18.07.2014 Views

PHD Thesis - Institute for Computer Graphics and Vision - Graz ...

PHD Thesis - Institute for Computer Graphics and Vision - Graz ...

PHD Thesis - Institute for Computer Graphics and Vision - Graz ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Graz</strong> University of Technology<br />

<strong>Institute</strong> <strong>for</strong> <strong>Computer</strong> <strong>Graphics</strong> <strong>and</strong> <strong>Vision</strong><br />

Head: Prof. Dr. Franz Leberl<br />

Dissertation<br />

Visual Localization within a World<br />

composed of Planes<br />

Friedrich Fraundorfer<br />

<strong>Graz</strong>, May 2006<br />

<strong>Thesis</strong> supervisor <strong>and</strong> first reviewer<br />

Prof. Dr. Horst Bischof<br />

<strong>Institute</strong> <strong>for</strong> <strong>Computer</strong> <strong>Graphics</strong> <strong>and</strong> <strong>Vision</strong>, <strong>Graz</strong> University of<br />

Technology<br />

Second reviewer<br />

Prof. Dr. David Nister<br />

Center <strong>for</strong> Visualization <strong>and</strong> Virtual Environments, University of<br />

Kentucky


The prize is the pleasure of finding<br />

the thing out, the kick in the<br />

discovery, the observation that other<br />

people use it [...]<br />

Richard P. Feynman (when asked about the honors<br />

of the Nobel Prize in 1967)<br />

i


Abstract<br />

Visual map building <strong>and</strong> localization <strong>for</strong> mobile robots is a wide spread field of research. Research<br />

done so far has already produced a vast variety of different approaches, yet key questions are<br />

still open. In this work we present novel approaches focussing on visual map building <strong>and</strong><br />

global localization. First we propose a piece-wise planar world representation which uses small<br />

planar patches as l<strong>and</strong>marks. The new world representation is designed to ease the l<strong>and</strong>mark<br />

correspondence problem. The map is augmented with the original appearances of the l<strong>and</strong>marks<br />

<strong>and</strong> invariant descriptors combining geometry <strong>and</strong> appearance based features in a local approach.<br />

For the building of the piece-wise planar map we make use of the recent advances in widebaseline<br />

stereo matching by using local detectors. The current state-of-the-art of local detectors<br />

is revised in this work <strong>and</strong> a new method to evaluate the per<strong>for</strong>mance of the different methods is<br />

proposed. Based on the evaluation results new methods <strong>for</strong> wide-baseline region matching <strong>and</strong><br />

piece-wise planar scene reconstruction are presented. A map building algorithm is presented,<br />

which creates a piece-wise planar world map where the world map consists of a set of linked<br />

metric sub-maps. Second a novel algorithm <strong>for</strong> global localization from a single small l<strong>and</strong>mark<br />

match is presented. The method produces an accurate 6DOF pose estimate, gaining benefits<br />

from the piece-wise planar world representation. Accurate pose estimation from a single small<br />

l<strong>and</strong>mark makes the localization very robust even against large occlusions. Map building <strong>and</strong><br />

localization are experimentally evaluated on two indoor scenarios. Map building <strong>and</strong> localization<br />

prove to be competitive to other state-of-the-art approaches. In fact, the localization accuracy<br />

is competitive to recent approaches, but gets computed from a single l<strong>and</strong>mark match only. The<br />

experimental results successfully demonstrate the benefits <strong>and</strong> strengths of our novel approach.<br />

ii


Kurzfassung<br />

Die Lokalisierung von mobilen Robotern und die automatische Kartenerstellung mittels optischen<br />

Systemen ist ein weit gefächertes Forschungsgebiet. Die bisherige Forschung resultierte<br />

in einer großen Vielzahl von unterschiedlichen Ansätzen, lässt aber immer noch grundsätzliche<br />

Fragen offen. In dieser Arbeit präsentieren wir neue Methoden zur Kartenerstellung und zur<br />

Lokalisierung. Als Erstes schlagen wir eine stückweise ebene Weltbeschreibung vor, in der kleine<br />

ebene Segmente die L<strong>and</strong>marks bilden. Die neue Weltrepräsentation ist darauf zugeschnitten das<br />

L<strong>and</strong>mark-Korrespondenzproblem zu erleichtern. Die Karte beinhaltet Originalbilder der L<strong>and</strong>marks<br />

sowie eine invariante Beschreibung mit dem Resultat geometrische und aussehensbasierte<br />

Merkmale in einem lokalen Ansatz zu vereinen. Zur Erstellung der stückweise ebenen Karte<br />

nutzen wir die kürzlich erzielten Fortschritte im Bereich von Wide-Baseline Stereomatching mittels<br />

lokaler Detektoren. Der aktuelle St<strong>and</strong> der Technik im Bereich von lokalen Detektoren wird<br />

in dieser Arbeit zusammengefasst wiedergegeben, und eine neue Methode zur Evaluierung der<br />

unterschiedlichen Methoden wird vorgestellt. Anh<strong>and</strong> der Evaluierungsergebnisse werden neue<br />

Methoden für Wide-Baseline Stereomatching und zur stückweise ebenen Szenenrekonstruktion<br />

vorgeschlagen. Ein Algorithmus zur Kartenerstellung wird vorgestellt der eine stückweise ebene<br />

Weltkarte, bestehend aus einer Menge von verbundenen metrischen Teilkarten, generiert. Als<br />

Zweites wird ein neuer Algorithmus zur globalen Lokalisierung anh<strong>and</strong> eines einzigen kleinen<br />

L<strong>and</strong>marks vorgestellt. Die Methode berechnet eine Position (mit vollen 6 Freiheitsgraden)<br />

unter Ausnutzung der stückweise ebenen Weltrepräsentation. Die akkurate Positionsbestimmung<br />

von einem einzigen kleinen L<strong>and</strong>mark macht die Lokalisierung sehr robust, sogar gegen<br />

großflächige Verdeckungen. Die Kartenerstellung und die Lokalisierung werden experimentell<br />

anh<strong>and</strong> von zwei Szenen evaluiert. Die Ergebnisse der Kartenerstellung und der Lokalisierung<br />

erweisen sich vergleichbar zum aktuellen St<strong>and</strong> der Technik. Tatsache ist, dass die Genauigkeit<br />

der Lokalisierung dem aktuellen St<strong>and</strong> der Technik entspricht, und das obwohl sie nur von<br />

einem einzigen kleinen L<strong>and</strong>mark berechnet wird. Die Experimente demonstrieren weiterhin<br />

eindrucksvoll die Vorteile und Stärken unseres neuen Ansatzes.<br />

iii


Contents<br />

1 Introduction to mobile robotics <strong>and</strong> vision 1<br />

1.1 Localization <strong>and</strong> map building in mobile robotics . . . . . . . . . . . . . . . . . . 2<br />

1.2 Why vision? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.3 What has already been achieved? . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

1.4 Why is it hard? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />

1.5 How can it get solved? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

1.6 Contribution of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

1.7 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

2 Visual localization 11<br />

2.1 Localization in metric maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11<br />

2.2 Localization from point features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />

2.3 Localization from line features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

2.4 Localization from plane features . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

3 Local detectors 20<br />

3.1 Interest point detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

3.1.1 Harris detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

3.1.2 Hessian detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

3.2 Scale invariant detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

3.2.1 Scale-invariant Harris detector . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

3.2.2 Scale-invariant Hessian detector . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

3.2.3 Difference of Gaussian detector (DOG) . . . . . . . . . . . . . . . . . . . 26<br />

3.2.4 Salient region detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

3.2.5 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

3.3 Affine invariant detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />

3.3.1 Affine-invariant Harris detector . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

3.3.2 Affine-invariant Hessian detector . . . . . . . . . . . . . . . . . . . . . . . 33<br />

3.3.3 Maximally stable region detector (MSER) . . . . . . . . . . . . . . . . . . 33<br />

3.3.4 Affine-invariant salient region detector . . . . . . . . . . . . . . . . . . . . 34<br />

3.3.5 Intensity extrema-based region detector (IBR) . . . . . . . . . . . . . . . 36<br />

3.3.6 Edge based region detector (EBR) . . . . . . . . . . . . . . . . . . . . . . 37<br />

3.3.7 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

3.4 Comparison of the described methods . . . . . . . . . . . . . . . . . . . . . . . . 40<br />

iv


CONTENTS<br />

v<br />

4 Evaluation on non-planar scenes 43<br />

4.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />

4.1.1 Repeatability score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />

4.1.2 Matching score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

4.1.3 Complementary score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

4.2 Representation of the detections . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

4.3 Detection correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

4.3.1 Transferring an elliptic region . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

4.3.2 Calculating the overlap area from the point set representation . . . . . . . 48<br />

4.3.3 Justification of the approximation . . . . . . . . . . . . . . . . . . . . . . 49<br />

4.4 Point transfer using the trifocal tensor . . . . . . . . . . . . . . . . . . . . . . . . 50<br />

4.5 Ground truth generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />

4.5.1 Trifocal tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />

4.5.2 Dense matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />

4.5.3 Ground truth quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52<br />

4.6 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />

4.6.1 Repeatability <strong>and</strong> matching score . . . . . . . . . . . . . . . . . . . . . . . 55<br />

4.6.2 Combining local detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

5 Maximally Stable Corner Clusters (MSCC’s) 68<br />

5.1 The MSCC detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69<br />

5.1.1 Interest point detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />

5.1.2 Multi scale clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />

5.1.3 Selection of stable clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

5.2 Region representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

5.3 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

5.4 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

5.5 Detection examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />

5.6 Detector evaluation: Repeatability <strong>and</strong> matching score . . . . . . . . . . . . . . . 80<br />

5.6.1 Evaluation of the ”Doors” scene . . . . . . . . . . . . . . . . . . . . . . . 80<br />

5.6.2 Evaluation of the ”Group” <strong>and</strong> ”Room” scene . . . . . . . . . . . . . . . . 80<br />

5.7 Combining MSCC with other local detectors . . . . . . . . . . . . . . . . . . . . 81<br />

6 Wide-baseline methods 91<br />

6.1 Wide-baseline region matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91<br />

6.1.1 Matching <strong>and</strong> registration . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

6.2 Piece-wise planar scene reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 96<br />

6.2.1 Reconstruction using homographies . . . . . . . . . . . . . . . . . . . . . . 96<br />

6.2.2 Piece-wise planar reconstruction . . . . . . . . . . . . . . . . . . . . . . . 97<br />

6.2.3 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />

6.2.4 Real Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />

7 Living in a piecewise planar world 110<br />

7.1 Map building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />

7.1.1 Sub-map identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

7.1.2 Sub-map creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

7.1.3 Structure computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />

7.1.4 L<strong>and</strong>mark extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116


CONTENTS<br />

vi<br />

7.1.5 Sub-map linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

7.2 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />

7.2.1 Localization from a single l<strong>and</strong>mark . . . . . . . . . . . . . . . . . . . . . 121<br />

7.2.2 The local plane score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />

7.2.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />

8 Map building <strong>and</strong> localization experiments 130<br />

8.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />

8.1.1 ActivMedia PeopleBot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />

8.1.2 Laser range finder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132<br />

8.1.3 Camera setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132<br />

8.2 Map building experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132<br />

8.2.1 Office environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133<br />

8.2.2 Hallway environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139<br />

8.3 Localization experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142<br />

8.3.1 Localization accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142<br />

8.3.2 Path reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145<br />

8.3.3 Evaluation of the sub-sampling scheme . . . . . . . . . . . . . . . . . . . . 148<br />

8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151<br />

9 Conclusion 152<br />

9.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154<br />

A Projective trans<strong>for</strong>mation of ellipses 157<br />

A.1 Projective ellipse transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />

A.2 Affine approximation of ellipse transfer . . . . . . . . . . . . . . . . . . . . . . . . 162<br />

B The trifocal tensor <strong>and</strong> point transfer 164<br />

B.1 The trifocal tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164<br />

B.2 Point transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164<br />

Bibliography 165


Chapter 1<br />

Introduction to mobile robotics <strong>and</strong><br />

vision<br />

<strong>Computer</strong> vision is a fascinating <strong>and</strong> challenging scientific area. <strong>Vision</strong> science is clearly motivated<br />

from our own ability to see. Our own visual system allows us to complete task at ease,<br />

which would be difficult or almost impossible without our eyes. Our eyes allow us to identify<br />

friends <strong>and</strong> colleagues, recognize objects we have seen be<strong>for</strong>e, categorize objects (even unseen<br />

ones), estimate properties of objects like size <strong>and</strong> shape, estimate distances to objects <strong>and</strong> let us<br />

know where we are. Generally speaking we get an idea of the world around us. For computer vision<br />

researchers it is the challenge to develop computer systems capable of achieving these tasks.<br />

In other words, it is the challenge to build computers that see. Although alone the satisfaction<br />

of building such computer systems is motivation enough <strong>for</strong> most researchers, the applications<br />

are manifold. While vision capabilities already play a big role <strong>for</strong> immovable computer systems,<br />

like e.g. access systems, surveillance systems etc., they may play an even bigger role <strong>for</strong> mobile<br />

systems. A lot of research in mobile robotics has already been done <strong>and</strong> much has been achieved.<br />

Nowadays mobile robots have already entered domestic areas <strong>and</strong> operate outside laboratory<br />

environments, <strong>for</strong> example the museum tour guide robots Rhino [14] <strong>and</strong> Minerva [104]. Mobile<br />

robots have already been sent to other planets, like the NASA Mars rover ”Sojourner” in 1997<br />

or its successors ”Spirit” <strong>and</strong> ”Opportunity” in 2004. However, the state-of-the-art systems still<br />

lack autonomy, they can only be used in very constraint environments or have to be supervised<br />

by human operators. There<strong>for</strong>e current research is focusing on building more autonomous systems,<br />

on building cognitive systems (see the EU-Project CoSy 1 ). A very important necessity<br />

<strong>for</strong> a mobile robot is the capability of knowing where it is. That is answering the ”Where I<br />

am?” question. Localizing itself in the environment is essential <strong>for</strong> navigation, it is necessary<br />

to compute a path to its target destination or even to recognize that the target destination has<br />

already been reached. In 1991 Cox [20] stated, that ”using sensory in<strong>for</strong>mation to locate the<br />

robot in its environment is the most fundamental problem to providing a mobile robot with<br />

autonomous capabilities”. Current systems like Minerva [104] rely on laser range finders <strong>and</strong><br />

sonar sensors to localize themselves. However, a vision sensor provides the most general world<br />

representation. Research has already been done to use vision sensors <strong>for</strong> mobile robot localization.<br />

First attempts already date back to Moravec [80] in 1980. But progress has not been that<br />

fast as anticipated after the first astonishing results. Still key problems are open <strong>and</strong> lead the<br />

path <strong>for</strong> lots of research to do in visual robot localization.<br />

1 http://www.cognitivesystems.org<br />

1


1.1. Localization <strong>and</strong> map building in mobile robotics 2<br />

1.1 Localization <strong>and</strong> map building in mobile robotics<br />

Close connected to navigation <strong>and</strong> localization is the issue of map building. Maps definitely<br />

play a key role in mobile robotics. Maps are needed if a mobile robot wants to localize itself<br />

or wants to make a plan to navigate to a certain position. Based on the role of the world map<br />

robot navigation can be split into three broad groups [23]:<br />

Map-less navigation: This first category of systems does not use a map at all. There is no<br />

world representation <strong>and</strong> a map is never ever created during operation. In such a case<br />

the robot only has a limited view of the world given by its actual sensor readings. This<br />

is however enough to allow collision detection. If the application is solely to strive around<br />

<strong>and</strong> detect a searched object a map would not be necessary, however it would not be<br />

possible <strong>for</strong> the robot to go back from where it started. Furthermore, the fact that no<br />

global localization is possible restricts the set of possible applications. But localization is<br />

not a necessity <strong>for</strong> every application. The currently very famous robotic vacuum cleaners<br />

(e.g. Roomba 2 ) are very successful despite the lack of localization capabilities.<br />

Map-based navigation: Map-based systems depend on a user provided world representation.<br />

The map must be available in a <strong>for</strong>m utilizable <strong>for</strong> a mobile robot. Maps can be 2D floor<br />

plans or complete 3D CAD models of the environment. The robot must have the ability<br />

to sense its current environment <strong>and</strong> compare it to the given map. In particular, the<br />

map must contain l<strong>and</strong>marks, which can be detected <strong>and</strong> identified by the sensors of the<br />

mobile robot. In visual localization such l<strong>and</strong>marks may be characteristic edges corners or<br />

even artificial markers with a characteristic texture which are easy to identify. However,<br />

creating such a map is very tedious. Furthermore, once a map has been created changes<br />

in the environment have to be maintained <strong>and</strong> the map has to be updated frequently.<br />

The map building necessity there<strong>for</strong>e complicates the deployment of robotic systems into<br />

domestic areas. The mobile robot cannot be moved to a new environment <strong>and</strong> get switched<br />

on. Thus map-based systems will be restricted to some special applications only.<br />

Map-building based navigation: Map-building based systems are closely related to mapbased<br />

systems. They also use a map to localize themselves, however, in addition they have<br />

the ability to build the necessary map with their own sensors. This is however a difficult<br />

task. Map building <strong>and</strong> localization have to be done simultaneously, it is called SLAM<br />

- Simultaneous Localization <strong>and</strong> M ap building. The robot must possess the ability to<br />

update its world representation, to add new structure <strong>and</strong> new details. In addition it must<br />

be possible to use the current, sometimes incomplete map to localize itself therein. In such<br />

a framework the mobile robot has to deal with measurement uncertainties frequently. Map<br />

features which are detected to be wrong after some time, have to be removed or updated.<br />

However, the SLAM approach provides the greatest versatility <strong>for</strong> mobile robots. Such a<br />

robot could be switched on in a previously unknown environment. It would start striving<br />

around <strong>and</strong> building a map of its environment. After the map has been completed the<br />

robot is fully operational <strong>and</strong> can do all the navigation <strong>and</strong> path planning. In addition it<br />

can sense changes in the environment <strong>and</strong> adapt the map accordingly.<br />

In the following we will focus on the map-building based approach. Previous research on<br />

map-building based robot navigation has generated a lot of different approaches to world rep-<br />

2 Roomba is available from iRobot (see http://www.irobot.com).


1.2. Why vision? 3<br />

resentation. A possible classification of the different paradigms according to [65] might be:<br />

topological, metric <strong>and</strong> appearance based.<br />

Topological maps: Topological maps represent the environment as connected graphs. The<br />

nodes of the graph are possible places, e.g. rooms. Connected nodes there<strong>for</strong>e represent<br />

places which are located close by <strong>and</strong> are reachable <strong>for</strong> the robot. Navigation <strong>and</strong> path<br />

planning in such a map can be difficult. The robot only gets the in<strong>for</strong>mation which places<br />

need to be traversed to get to its goal, but in the absence of metric in<strong>for</strong>mation no direction<br />

in<strong>for</strong>mation or distance in<strong>for</strong>mation can be given.<br />

Metric maps: In a metric map the single map elements are spatially organized, that means<br />

the position of a map element (l<strong>and</strong>mark) is known in a common world coordinate frame.<br />

Metric maps can differ widely by the used l<strong>and</strong>marks. One possible metric world representation<br />

partitions the known world by a grid into discrete cells [81]. For each cell position<br />

it is stored if the position is occupied by an object (e.g. wall, tables, etc.) or if it is free.<br />

Such a map is often called occupancy grid, <strong>and</strong> basically represents the 2D floor plan of<br />

an environment. As the size of each grid cell is, known metric in<strong>for</strong>mation is available <strong>and</strong><br />

allows distance computations <strong>and</strong> metric path planning. Another possibility is to represent<br />

the world by geometric features which are positioned in 3D [96]. Such features can be 3D<br />

points, 3D lines, etc. Localization in such a world can be done by triangulation. The main<br />

difficulty however is to find the correspondences between the features in the map <strong>and</strong> the<br />

features detected by the current sensor readings.<br />

Appearance-based maps: In such an approach the world is represented by raw sensor data.<br />

The map is simply a collection of all previously acquired sensor readings [51]. Guided<br />

navigation <strong>and</strong> localization is difficult. But the main problem however is the scalability of<br />

the approach. Simply storing all the sensor readings is very memory consuming <strong>and</strong> poses<br />

a big problem <strong>for</strong> large scale maps.<br />

In contrast to this classification, combinations of the different approaches are also proposed<br />

in the literature. In [103] metric grid-maps are connected by a topological approach on top to<br />

generate the world representation.<br />

1.2 Why vision?<br />

Most of the maps described in the previous section do not depend on a specific kind of sensors.<br />

In fact, research is done with a variety of different sensors. The prominent sensors <strong>for</strong> robot<br />

localization are wheel encoders (odometry), inertial sensors, sonar, infrared, laser range finders<br />

<strong>and</strong> of course vision sensors. Each sensor type shows different advantages <strong>and</strong> disadvantages, a<br />

list may be found in [65]. Wheel encoders <strong>and</strong> inertial sensors provide direct in<strong>for</strong>mation about<br />

the path of the robot. Sonar, infrared sensors <strong>and</strong> laser range finders are ranging devices. They<br />

provide the robot with more or less (depending of the type of the sensor) accurate distances to<br />

objects in the vicinity of the robot. However, they only provide distance in<strong>for</strong>mation. Compared<br />

to these sensors a vision sensor seems to be the most powerful one. A vision sensor can provide<br />

odometry in<strong>for</strong>mation as described in [84]. A vision sensor can also act as ranging device, either<br />

as a stereo setup (demonstrated in [80]) or with a structure-from-motion approach [106]. In<br />

addition a vision sensor allows to record the appearance of the world surrounding the robot.<br />

The visual appearance of l<strong>and</strong>marks can now be associated to range in<strong>for</strong>mation. A vision sensor


1.3. What has already been achieved? 4<br />

would probably give the most general world representation. In fact, certain tasks would require<br />

the use of vision sensors. Imagine a mobile robot with the task to find a certain object, lets say<br />

a coffee cup, <strong>for</strong> its user. The task of detecting the coffee cup can certainly not get achieved<br />

with ranging devices solely. Although one can think about detecting a cup by its 3D shape with<br />

a laser range finder this method cannot distinguish between similar cups differing only in the<br />

color. Such a task requires a vision sensor, <strong>and</strong> thus as vision is already onboard it is tempting<br />

to use it <strong>for</strong> navigation <strong>and</strong> localization too.<br />

1.3 What has already been achieved?<br />

The use of vision sensors <strong>for</strong> mobile robot localization has not yet reached an as elaborate state as<br />

the use of laser range finders. Mobile robots equipped with laser range finders already navigate<br />

safely in unknown <strong>and</strong> people crowded environments [104] <strong>and</strong> are able to build large <strong>and</strong><br />

accurate maps [105]. But lets discuss what has been achieved using visual sensors in odometry,<br />

localization <strong>and</strong> map building.<br />

In the absence of a map or within a featureless environment visual odometry can be used to<br />

compute the path a robot went <strong>and</strong> thus the actual position can be computed from the travelled<br />

path. For visual odometry point features are tracked from frame to frame <strong>and</strong> with a structurefrom-motion<br />

approach the robots movement <strong>for</strong> each frame can be computed. The estimation<br />

has to be very accurate, because the final position is computed incrementally from all small<br />

movements. And even small inaccuracies might result into big deviations. The capabilities of<br />

the current state-of-the-art in visual odometry has been shown impressively in NASA’s Mars<br />

Exploration Rovers ”Spirit” <strong>and</strong> ”Opportunity” [86]. The slippy surface did not allow <strong>for</strong> an<br />

accurate wheel odometry computation <strong>and</strong> laser range finders could not be used in the open<br />

outdoor environment. However, a fully autonomous visual based navigation was still not possible<br />

<strong>and</strong> the rovers where controlled by human operators in the end to compensate <strong>for</strong> errors of the<br />

visual localization system.<br />

The current state-of-the-art in visual localization is defined by vSlam [56] <strong>and</strong> the method<br />

described in [96]. Both systems are SLAM-approaches based on SIFT-l<strong>and</strong>marks [67] <strong>and</strong> show<br />

very similar per<strong>for</strong>mances. The approaches allow map building in indoor environments of the<br />

size of a room up to small flats. The robot will explore the environment autonomously <strong>and</strong><br />

create a map of 3D point l<strong>and</strong>marks. After the map creation is finished the robot can per<strong>for</strong>m<br />

global localization <strong>and</strong> path planning tasks. The achieved localization accuracy is about 10-15<br />

cm at average. For vSlam the robot has to be equipped with a single camera only. The 3D<br />

reconstruction of the l<strong>and</strong>marks is done with a structure-from-motion approach. The other<br />

system uses a stereo setup on the robot <strong>for</strong> 3D reconstruction. A main limitation of the systems<br />

is the size of the maintained environment map. For bigger than room-size environments the<br />

map will be too big to be h<strong>and</strong>led in real-time.<br />

The last example deals with the automatic map building of large scale environments <strong>and</strong><br />

outdoor environments. The system proposed in [10] is capable of mapping a path of the length<br />

of several kilometers accurately. The large scale map is composed of connected metric sub-maps.<br />

The sub-maps contain 3D line features. The system allows loop closing by matching the 3D lines<br />

from the current reconstruction to the 3D lines of the sub-maps. A global optimization ensures<br />

the high accuracy of the map. However, the map features are purely geometric <strong>and</strong> the system<br />

will get difficulties in buildings with highly similar structures. Moreover the method does not<br />

allow global localization in general. The map in the presented <strong>for</strong>m cannot be used <strong>for</strong> robot<br />

navigation <strong>and</strong> localization.


1.4. Why is it hard? 5<br />

These three examples show that visual methods basically are capable of per<strong>for</strong>ming odometry,<br />

SLAM in constraint environments <strong>and</strong> large scale map building. More details about the potential<br />

of the state-of-the-art methods in visual localization are given in Chapter 2.<br />

1.4 Why is it hard?<br />

The previous section showed us the current state-of-the-art in visual localization <strong>and</strong> gives us<br />

an idea what is still missing. But what are the difficulties in visual localization that prevents<br />

researchers until now to come up which reliable <strong>and</strong> versatile solutions? Difficulties arise as<br />

technical issues <strong>and</strong> conceptual issues.<br />

High complexity of image processing algorithms: A major technical difficulty is that mobile<br />

robots require real-time processing of the sensor data, that is real-time image processing.<br />

Current computer systems which can be used <strong>for</strong> mobile robotics do have very<br />

limited computational power. That means that the most advanced methods in computer<br />

vision simply can not be used on a mobile robot. Very often sub-optimal algorithms are<br />

used, which are known to produce not the best results but which are computable on mobile<br />

robots. Although the processing speed of computers is increasing fast, the need <strong>for</strong><br />

real-time processing is a major drawback. A lot of achievements in computer vision are<br />

there<strong>for</strong>e simply not used in mobile robotics.<br />

Changing environment: Other technical difficulties lie in the changing environment, <strong>for</strong> example<br />

illumination changes <strong>and</strong> background changes. Especially computer vision methods<br />

are very much affected by such environment changes. Such environmental changes are <strong>for</strong><br />

sure one of the main reasons why visual localization has not yet reached the same level as<br />

localization with other sensors. Illumination <strong>and</strong> background changes <strong>for</strong> instance pose no<br />

problems <strong>for</strong> laser range finders or sonar sensors.<br />

Scalability: Scalability poses also a big technical difficulty <strong>for</strong> mobile robots. Systems which<br />

work <strong>for</strong> small indoor environments (see [56]) will not work in larger scenarios or even<br />

outdoor scenarios. Computation time <strong>and</strong> memory requirements depend on the size of the<br />

maintained map of the environment. Real-time processing can simply not be carried out<br />

<strong>for</strong> large maps <strong>and</strong> the robot may simply run out of memory. Scalability is a problem<br />

independent of the sensors used, but definitely a much more critical issue <strong>for</strong> vision based<br />

systems. Digital cameras, especially with high resolution sensors, produce an enormous<br />

amount of data, posing problems <strong>for</strong> real-time processing.<br />

Uncertainties: A rather conceptual difficulty arises in the treatment of uncertainties. Uncertainties<br />

occur as simple measurement uncertainties, however uncertainties also occur in<br />

top level reasoning <strong>and</strong> navigation processes. Uncertainties in measurements <strong>and</strong> in geometric<br />

representations are already well understood, e.g. <strong>for</strong> elements like 3D points <strong>and</strong> 3D<br />

lines [96]. But uncertainties in navigational or task oriented decisions pose an enormous<br />

problem [23]. It is not well understood at which precision such uncertainties should be<br />

taken into account.<br />

Cognitive abilities: However the most challenging difficulty is, that <strong>for</strong> reliable autonomous<br />

navigation the robot must be aware of the meaning of objects <strong>and</strong> structures in its environment.<br />

Cognitive abilities as necessary <strong>for</strong> tasks like automatic scene interpretation,


1.5. How can it get solved? 6<br />

fully autonomous navigation <strong>and</strong> interaction with people are still subject of intense research.<br />

The current state-of-the-art in computer vision allows to solve very specific <strong>and</strong><br />

constrained tasks, but has not yet reached a level where it could provide the wanted<br />

autonomy <strong>for</strong> mobile robots.<br />

The above list mentions definitely the most prominent difficulties <strong>for</strong> visual controlled mobile<br />

robots. However, by having identified the difficulties <strong>and</strong> problems research can be focused into<br />

the right directions.<br />

1.5 How can it get solved?<br />

By reconsideration of the difficulties in the previous section, the question arises, how the difficulties<br />

can be solved. Or <strong>for</strong>mulated differently, how should we proceed to build a vision based<br />

mobile robot. Some researchers believe that there will not be a general overall algorithm which<br />

will suffice to guide a mobile robot [22]. In fact, there is some evidence that this is also not the<br />

case with biological navigation systems [90]. Instead it seems to be that vision <strong>for</strong> animals <strong>and</strong><br />

humans too works as a collection of specialized behaviors, which were developed during long<br />

times by evolution. This has been stated very impressively by Ramach<strong>and</strong>ran as: ”<strong>Vision</strong> is<br />

just a bag of tricks” [90]. Carried on to the domain of mobile robotics this can be interpreted as<br />

that a mobile robot has a collection of different methods, each very specialized to well defined<br />

cases. To solve an actual problem only the method has to be selected which works. The single<br />

methods would be required to work <strong>for</strong> very well constrained situations, which would ease<br />

their development. Such an approach has already been proposed by Brooks [12]. The proposed<br />

robot control system consists of a collection of specialized behaviors organized in hierarchical<br />

layers. All the behaviors run concurrently <strong>and</strong> the overall robot action is determined by voting.<br />

The different behaviors do not share a common world representation, each method has its<br />

specialized representation. Such a scheme is already very familiar from sensor fusion. Measurements<br />

from different modalities are combined to achieve higher accuracy <strong>and</strong> robustness. It is a<br />

straight<strong>for</strong>ward development to apply this scheme at the localization method level.<br />

The scheme sounds very promising. The deficiencies of current vision based solutions can be<br />

analyzed <strong>and</strong> specialized methods can be developed to overcome the deficiencies in a decoupled<br />

way. With such an approach it will be possible to produce a most versatile visual localization.<br />

On-the-edge visual localization like vSlam might already cope with 95% of all encountered<br />

situations. However, in 5% of the cases human assistance will be necessary to resolve problematic<br />

cases. For a non-stop operation of the mobile robot even the 5% problematic cases will be too<br />

much. Furthermore, the last 5% of problematic cases will require a bunch of specialized methods.<br />

Doing computer vision research with the goal to complete the ”bag of tricks” there<strong>for</strong>e seems<br />

to be a very promising way. It will provide robot engineers with specialized methods <strong>for</strong> lots<br />

of difficult cases. Research to identify special cases <strong>and</strong> provide proper solutions will be very<br />

valuable. The fusion of all the tricks however will be in the responsibility of AI researchers <strong>and</strong><br />

possibly carried out as described in [12].<br />

1.6 Contribution of this thesis<br />

In the spirit of the approach described in the previous section this thesis does not deal with<br />

the development of a complete visual SLAM method but focusses on the development of some<br />

primary key technologies. The main focus of this thesis is on global localization in indoor


1.6. Contribution of this thesis 7<br />

environments. Global localization applies when a robot operates in a known environment, that<br />

is the map has already been built. Global localization is then the computation of the pose of<br />

the robot, consisting of position <strong>and</strong> heading, from the actual sensor reading, i.e. camera image<br />

without using the previous pose in<strong>for</strong>mation. Global localization is necessary quite frequently<br />

when encountering the following situations:<br />

Switching the robot on: After switching on a mobile robot the position is not known. However,<br />

the robot may already possess a complete environment map e.g. from a previous run.<br />

But be<strong>for</strong>e it can start useful operations (like navigation <strong>and</strong> path planning) its position<br />

has to be determined first. Global localization allows to compute the pose of the mobile<br />

robot from the actual camera image.<br />

Kidnapped robot problem: The kidnapped robot problem has been stated by Engelson <strong>and</strong><br />

McDermott [27]. In the kidnapped robot problem a well-localized robot is teleported (or<br />

simply moved with switched-off sensors) to some other location. The problem with this<br />

scenario is, that the robot still believes to be at the location from where it has been<br />

kidnapped. Path planning <strong>and</strong> navigation based on such an assumption will not work.<br />

The environment predicted from the map <strong>and</strong> its assumed position will not match. The<br />

kidnapped robot problem is basically a test <strong>for</strong> the ability of a robot to recover from a<br />

catastrophic localization failure. Global localization is the solution to the kidnapped robot<br />

problem. With the ability of global localization the robot can immediately determine that<br />

it has been moved from outside.<br />

Recover from failure: Recovery from a failure is also possible with global localization. Consider<br />

the case when a mobile robot moves into an area without l<strong>and</strong>marks. Abruptly the<br />

current sensor readings do not contain any l<strong>and</strong>marks. For vision based systems this can<br />

occur easily when the robot is moving into an untextured area, e.g. a part of an room<br />

containing only white walls. In such a case the robot would loose track. The robot would<br />

then move around r<strong>and</strong>omly to get back into an area where l<strong>and</strong>marks can be detected.<br />

However, when he finally enters an area where l<strong>and</strong>marks re-occur his global position got<br />

lost, global localization is necessary. It would be possible to rely on the robots wheel odometry<br />

in the l<strong>and</strong>mark-less area. However, keeping track with odometry only will introduce<br />

too big deviations in pose estimation <strong>and</strong> global localization would still be necessary.<br />

Loop closing: Loop closing is the ability of a mobile robot to recognize previously visited<br />

areas. It is important in the stage of map building. During map building a mobile robot<br />

traverses the environment <strong>and</strong> adds new structure <strong>and</strong> l<strong>and</strong>marks to the world map. If<br />

the robot enters an already mapped area <strong>and</strong> does not recognize this, the same features<br />

will be added twice, usually not on the same position because of small drift errors. Global<br />

localization can be used to notice that the current location has already been visited, that<br />

the loop has been closed. This gives the in<strong>for</strong>mation, that the l<strong>and</strong>marks could already be<br />

in the map <strong>and</strong> an update would be appropriate <strong>and</strong> not a simple insertion.<br />

Homing: The last example <strong>for</strong> a situation where global localization would be required is homing.<br />

Homing is the task that a mobile robot has to go back to some start position. The<br />

start position may <strong>for</strong> instance mark an automatic charging device <strong>and</strong> the robot will go<br />

there to recharge its batteries. Global localization can tell the robot that its target position<br />

is reached.


1.6. Contribution of this thesis 8<br />

The vision based global localization proposed in the following will be based on wide-baseline<br />

stereo methods <strong>and</strong> work with a fully 3D piece-wise planar world representation. A new approach<br />

which allows global localization from a single l<strong>and</strong>mark will be presented in this work.<br />

Furthermore the methods to build a piece-wise planar world representation from an image sequence<br />

acquired from a mobile robot will be described. Localization <strong>and</strong> map building will be<br />

implemented <strong>for</strong> an ActivMedia PeopleBot 3 (see Figure 1.1). The robot is equipped with a<br />

single camera <strong>and</strong> a wide-angle lens. Localization <strong>and</strong> map building is done solely using this<br />

camera setup. The robot is also equipped with a laser range finder, infrared <strong>and</strong> sonar sensors.<br />

The additional sensors are used to obtain ground truth <strong>for</strong> the localization experiments.<br />

(a)<br />

(b)<br />

Figure 1.1: (a) The mobile robot used <strong>for</strong> localization <strong>and</strong> map building (ActivMedia PeopleBot).<br />

It is equipped with a single camera. (b) A closeup of the camera setup.<br />

The following topics are the main contributions of this thesis:<br />

Per<strong>for</strong>mance evaluation of local detectors: Local detectors are a key ingredient to solve<br />

the correspondence problem <strong>for</strong> robot localization. There already exists a variety of different<br />

methods showing different properties <strong>and</strong> per<strong>for</strong>mances. However it is not clear which<br />

one of the proposed methods is suited best <strong>for</strong> visual localization. So far the best source<br />

of in<strong>for</strong>mation is the comparison of Mikolajczyk et al. [76] which reveals the differences<br />

<strong>and</strong> properties of the different methods. However the comparison evaluates the different<br />

methods on simple planar test scenes <strong>and</strong> the method is not applicable to realistic complex<br />

scenes as will be encountered in mobile robot experiments. One contribution of this<br />

thesis there<strong>for</strong>e is the development of a method to evaluate the different local detectors<br />

on realistic complex scenes. The resulting comparison shows a significant difference to the<br />

previous evaluation on the restricted test cases.<br />

Maximally Stable Corner Clusters: Based on the new evaluation results we propose a new<br />

local detector, the so called Maximally Stable Corner Cluster (MSCC) detector. Interest<br />

regions are <strong>for</strong>med by clusters of simple corner points in images. The detection algorithm<br />

implies a stability criteria <strong>for</strong> a robust detection. The evaluation of the new detector<br />

3 http://www.activrobots.com/robots/peoplebot.html


1.7. Structure of the thesis 9<br />

shows a good repeatability. By comparison with other methods it is revealed that the<br />

new detector detects regions at image locations left out by the other methods, thus it is<br />

complementary to the other method. This complementarity is the key property of the new<br />

detector, it allows an effective combination with current state-of-the-art methods.<br />

3D Piece-wise planar world map: Another key contribution of this thesis is the development<br />

of a new world representation. We propose a piece-wise planar world map, where<br />

each l<strong>and</strong>mark is a small planar patch associated with a SIFT-descriptor <strong>and</strong> the original<br />

appearance from the image. The world representation is a crucial element <strong>for</strong> localization<br />

<strong>and</strong> we will show that our proposed map design allows new successful localization methods.<br />

A batch method is proposed to automatically build the piece-wise planar map from<br />

an image sequence acquired by a mobile robot.<br />

Global localization from a single l<strong>and</strong>mark: Based on the newly developed map a global<br />

localization algorithm is proposed which computes the pose of the robot by solving the<br />

3D-2D correspondence problem. The main achievement of this algorithm is, that full 3D<br />

pose estimation is possible from a single l<strong>and</strong>mark match. Furthermore a selection criteria,<br />

the lp-score, is introduced to select the best pose estimate from a set of hypothesis which<br />

allows accurate pose estimation from an extremely small image region (area around 400<br />

pixel). Thus the global localization can deal with a high level of occlusions, necessary <strong>for</strong><br />

crowded environments.<br />

1.7 Structure of the thesis<br />

The next two chapters discuss the current state-of-the-art in visual localization (Chapter 2) <strong>and</strong><br />

local detectors <strong>for</strong> wide-baseline stereo (Chapter 3). Next, it will be discussed how wide-baseline<br />

methods were used <strong>and</strong> extended <strong>for</strong> mobile robot localization. For wide-baseline stereo [70],<br />

methods have been developed to allow stereo reconstruction <strong>for</strong> scenes which deviate largely from<br />

a normal stereo case. That includes large baselines, large projective distortions, scale change<br />

<strong>and</strong> rotation. The main achievement by wide-baseline methods is to solve the correspondence<br />

problem <strong>for</strong> wide-baseline cases, to compute point matches between the images. Having solved<br />

the initial correspondence problem, the epipolar geometry between the images can be estimated<br />

<strong>and</strong> 3D reconstruction can be per<strong>for</strong>med with known st<strong>and</strong>ard methods [44]. The correspondence<br />

problem however is a key issue in mobile robot localization too. For robot localization it is<br />

necessary to detect correspondences between map l<strong>and</strong>marks <strong>and</strong> l<strong>and</strong>marks extracted from the<br />

current image. This is a difficult task, as there are big viewpoint changes as the robot moves<br />

around. The problems in mobile robotics are there<strong>for</strong>e very similar to wide-baseline stereo <strong>and</strong><br />

the application of wide-baseline methods <strong>for</strong> mobile robot localization represents the main focus<br />

of this thesis. The key ingredients <strong>for</strong> solving the correspondence problem in wide-baseline<br />

stereo are local detectors [76], which are interest point <strong>and</strong> interest region detectors which allow<br />

a repetitive detection of the same locations in images from widely different viewpoints. Recent<br />

research in wide-baseline stereo produced a variety of different detectors which quite different<br />

properties. Available comparisons of the different methods [76] were made on very restricted<br />

test cases, not comparable with the scenarios which will appear in mobile robot localization.<br />

The evaluation method in [76] allows an evaluation of the detectors on scenes containing a<br />

single plane only. The scenes encountered by mobile robots (offices, hallways, etc.) usually<br />

contain complex <strong>and</strong> arbitrary structure. Hence, to choose the best detector a new evaluation<br />

method (based on the trifocal tensor) has been developed which allows evaluation on realistic


1.7. Structure of the thesis 10<br />

complex scenes (see Chapter 4). A comparison was per<strong>for</strong>med with the current available local<br />

detectors <strong>and</strong> a significant difference to the previous evaluation could be observed. Based on<br />

these evaluation results we propose a new local detector, the so called Maximally Stable Corner<br />

Cluster (MSCC) detector (see Chapter 5). MSCC regions are clusters of simple corner points.<br />

MSCC regions are detected in structured, textured image parts. The evaluation showed that<br />

the areas where MSCC’s are detected are often left out by the other methods. Thus the MSCC<br />

regions are complementary to the other detections, which allows an effective combination with<br />

other methods. MSCC regions are a valuable enrichment of the current pool of local detectors.<br />

For localization <strong>and</strong> map building specialized wide-baseline methods were developed. A widebaseline<br />

region matcher, to solve the correspondence problem reliable <strong>and</strong> robust is described<br />

in Chapter 6. The key technology is to register planar regions detected in images from different<br />

viewpoints iteratively. The proposed method produces very accurate <strong>and</strong> highly reliable matches,<br />

with a very small false-positive rate. For the map building algorithm a method to reconstruct<br />

piece-wise planar scenes from wide-baseline images has been developed (see Chapter 6). The<br />

method works by using inter-image homographies <strong>and</strong> produces a segmentation of an image into<br />

scene planes <strong>and</strong> a piece-wise planar 3D reconstruction of the scene. This method is used in the<br />

map building algorithm described in Chapter 7. The l<strong>and</strong>marks in the proposed map are small<br />

planar patches associated with a SIFT-descriptor <strong>and</strong> the original appearance from the image.<br />

The l<strong>and</strong>marks planes are fully parameterized in 3D. The map is composed of connected metric<br />

sub-maps. The single sub-maps are connected by rigid trans<strong>for</strong>mations into a common world<br />

coordinate frame. Map building is described as an automatic batch process. The algorithm<br />

gets as input an image sequence acquired from a r<strong>and</strong>om robot run. Map building is per<strong>for</strong>med<br />

in three steps. In a first step images <strong>for</strong> sub-map reconstruction are automatically identified.<br />

In a second step the single sub-maps are created by piece-wise planar scene reconstruction <strong>and</strong><br />

l<strong>and</strong>mark extraction. In a third step the single sub-maps are connected to <strong>for</strong>m the complete<br />

world representation using wide-baseline region matching. The such created map can then be<br />

used <strong>for</strong> global localization from a single actual view. The proposed global localization method<br />

works by computing the pose of the robot within a local sub-map. The global pose is then<br />

computed by trans<strong>for</strong>ming the position from the local to the global coordinate frame. The pose<br />

in a local sub-map is computed from 3D-2D point correspondences. In a first step l<strong>and</strong>mark<br />

matches between the actual view <strong>and</strong> the map are detected, representing 2D-2D matches. By<br />

using the 3D plane in<strong>for</strong>mation 3D parameters of the map l<strong>and</strong>marks can be computed, which<br />

yield the necessary 3D-2D point correspondences. Each single l<strong>and</strong>mark yields a set of 3D-2D<br />

point correspondences <strong>and</strong> 3D-2D point correspondences computed from a single l<strong>and</strong>mark are<br />

enough <strong>for</strong> pose estimation. Experiments with the map building <strong>and</strong> localization algorithm are<br />

described in Chapter 8. Finally a discussion <strong>and</strong> an outlook conclude the thesis in Chapter 9.


Chapter 2<br />

Visual localization<br />

This chapter discusses the current state-of-the-art in visual localization, or as coined in the<br />

previous chapter, the current ”bag of tricks”. Different approaches to visual localization will be<br />

investigated, discussing their strengths <strong>and</strong> weaknesses. The focus however is on the methods<br />

which provide the basis <strong>for</strong> this thesis <strong>and</strong> are directly influencing the methods proposed in this<br />

thesis. The chapter will be closed with a summary <strong>and</strong> a comparison of the different methods<br />

in table <strong>for</strong>m.<br />

2.1 Localization in metric maps<br />

The most complete positional representation <strong>for</strong> mobile robots is a full 6 DOF pose in a global<br />

world coordinate system. However, as this is difficult to achieve a lot of approaches were developed<br />

to allow robot applications without the need <strong>for</strong> a complete pose description or the need<br />

<strong>for</strong> an explicit pose computation. Anyway a localization method which produces a full 6 DOF<br />

pose would be highly favorable <strong>for</strong> all mobile robot applications.<br />

One possibility to compute a 6 DOF pose is by using visual odometry. Visual features are<br />

tracked from frame to frame <strong>and</strong> the translation <strong>and</strong> rotation of the robot are updated with<br />

the movements between each frame. Successful visual odometry systems were implemented<br />

by Nister et al. [84] <strong>and</strong> Olson et al. [86]. Such approaches do not even need a world map.<br />

However, as in the case of wheel odometry the approaches suffer from a major, elementary<br />

problem. Small errors in the computation of the movements between the frames will accumulate<br />

in the final pose. Such systems encounter big troubles if used <strong>for</strong> longer time periods. Map<br />

building based approaches alleviate such problems, allow pose computation without knowing<br />

the previous position <strong>and</strong> there<strong>for</strong>e we will focus on them. However, not all map building based<br />

approaches compute a full 6 DOF position. Some approaches are limited to place recognition.<br />

In place recognition approaches the map is partitioned into distinct places <strong>and</strong> the localization<br />

returns the in<strong>for</strong>mation at which place the robot currently is. Although very simple it is still<br />

possible to do navigation <strong>and</strong> path planning. Such an approach has been described by Lowe [67].<br />

In [60] the method of Lowe has been extended with a Hidden Markov Model <strong>and</strong> localization<br />

results are presented <strong>for</strong> an indoor scenario. Another approach [35] describes place recognition<br />

by assuming the planarity of the l<strong>and</strong>marks to increase the reliability of l<strong>and</strong>mark matching.<br />

One can underst<strong>and</strong> place recognition as a pre-stage to full 6 DOF pose estimation. Place<br />

recognition is able to restrict complex pose estimation to a smaller part of the entire map, thus<br />

gaining a speedup. Accurate navigation as needed <strong>for</strong> service robots <strong>for</strong> instance however needs<br />

full 6 DOF pose estimation.<br />

11


2.2. Localization from point features 12<br />

Computing a full 6 DOF pose in a map building based approach requires a metric map,<br />

where each map feature is positioned in a global coordinate frame. Usable map features are<br />

points, lines <strong>and</strong> plane features. Different features require different localization methods <strong>and</strong> in<br />

the next sections the so far developed methods <strong>for</strong> localization in different metric maps will be<br />

described.<br />

2.2 Localization from point features<br />

Most of the so far developed localization methods work with world maps containing point features.<br />

The map l<strong>and</strong>marks are 3D points associated with a feature vector to solve the correspondence<br />

problem. Localization from point features is already very well understood. A well<br />

known method is triangulation. Angle measurements to three distinct l<strong>and</strong>marks allow the pose<br />

estimation. This approach has been used in the work of Davison <strong>and</strong> Murray [21]. Angular<br />

measurements were made by the use of an active stereo head. The active stereo head carries<br />

two digital cameras. The stereo head can per<strong>for</strong>m panning <strong>and</strong> turning movements where both<br />

cameras are moved together. In addition each camera can rotate around a vertical axis to produce<br />

converging viewpoints. The mechanical resolution is very accurate <strong>and</strong> the stereo head<br />

will deliver accurate odometry in<strong>for</strong>mation to relate the camera position to the robot position.<br />

The angle measurements are per<strong>for</strong>med by fixation. Fixation means to direct the cameras to<br />

point directly at a l<strong>and</strong>mark, i.e. the l<strong>and</strong>mark gets located at the principle point of the image.<br />

For fixation at first the left camera is centered onto a l<strong>and</strong>mark. Then the right image is<br />

searched along the corresponding epipolar line <strong>for</strong> a matching l<strong>and</strong>mark using normalized crosscorrelation.<br />

The right camera is now moved to the found match. The angle to the l<strong>and</strong>mark<br />

can be computed from the angle of the cameras <strong>and</strong> the stereo head. Multiple l<strong>and</strong>marks are<br />

fixated in this way <strong>and</strong> the measured angles are used <strong>for</strong> triangulation. Although the angles are<br />

determined mechanically the approach is quite accurate. The measurement accuracy depends<br />

on the geometry of the stereo head <strong>and</strong> the accuracy in image matching. Two fixated l<strong>and</strong>marks<br />

may show a localization error of at most 1 pixel. This will result in a possible angular error of<br />

0.3 ◦ . With a baseline of 0.338m of the stereo setup this allows accurate depth measurements in a<br />

range from 0 to 2m. The uncertainty of the depth estimate increases with the distance <strong>and</strong> <strong>for</strong> a<br />

distance of 5m the expected uncertainty is almost 1m. However, the angular measurements are<br />

very accurate up to high distances. The analysis of the authors also shows that the mechanical<br />

odometry of the head is more than accurate <strong>for</strong> this task. The maximal angular error of the head<br />

odometry is 0.005 ◦ . This is magnitudes lower than the error of 0.3 ◦ introduced by the allowed 1<br />

pixel error in image fixation. This fixation method is also used <strong>for</strong> the initial map construction.<br />

A possible l<strong>and</strong>mark, detected by the Harris corner detector [40], is fixated to determine the<br />

angle <strong>and</strong> the distance to the robot. The measurements are then used to compute the full 3D<br />

parameters of the point feature <strong>and</strong> the resulting point is added to the map. However, this<br />

approach is rather slow as the fixation process requires mechanical movements of the cameras,<br />

separately <strong>for</strong> multiple l<strong>and</strong>marks.<br />

A different approach has been presented by Karlsson et al. [56]. Localization uses computer<br />

vision methods <strong>and</strong> works by computing the robot pose from 3D-2D point correspondences. In<br />

the current view interest points are detected <strong>and</strong> matched with the l<strong>and</strong>marks in the map by<br />

SIFT [96] feature matching. This establishes 3D-2D point correspondences. The pose is computed<br />

from the 3D-2D correspondences with the POSIT [24] algorithm. The POSIT algorithm<br />

is an iterative method <strong>and</strong> requires at least 4 non-coplanar points. The pose estimated from<br />

the POSIT algorithm then gets refined by non-linear minimization. The full 6 DOF pose is


2.2. Localization from point features 13<br />

recovered. Typically a pose is estimated from 10 to 40 3D-2D correspondences. The visual<br />

pose is then combined with measuresments from wheel odometry within a probabilistic SLAM<br />

framework [78]. In fact, during frames with no detected visual l<strong>and</strong>marks navigation continues<br />

based on wheel odometry. Map building is also vision based. The robot starts driving around in<br />

a first unknown environment, building a world map. The 3D points in the map are associated<br />

with SIFT features <strong>and</strong> an original view of the l<strong>and</strong>marks cropped from the original image.<br />

Each l<strong>and</strong>mark can have a set of associated SIFT features, describing the l<strong>and</strong>mark <strong>for</strong> various<br />

viewpoints. A 3D l<strong>and</strong>mark is reconstructed from three images. Three images are taken in sequence<br />

each in a distance of 20cm. Interest points are detected <strong>and</strong> matched between the three<br />

images using SIFT feature matching. With a structure-from-motion approach the 3D l<strong>and</strong>marks<br />

are reconstructed <strong>and</strong> the camera positions (robot positions) are computed. The l<strong>and</strong>marks are<br />

reconstructed in a local coordinate frame. By adding this position to the current position of<br />

the robot the l<strong>and</strong>marks are trans<strong>for</strong>med into the global coordinate system <strong>and</strong> this position is<br />

stored in the map database. Map building continues until all the environment has been traversed<br />

<strong>and</strong> no new l<strong>and</strong>marks are found. The authors describe experiments <strong>for</strong> a 2-bedroom apartment.<br />

Map building lasted 32 minutes <strong>and</strong> the robot created a map containing 82 l<strong>and</strong>marks. During<br />

operation map updates are possible. Updates of the l<strong>and</strong>mark position are maintained by a<br />

Kalman filter [54]. The average localization error measured in the experiments is about 20cm to<br />

25cm, which is quite high. However, it should be stressed that rather simple methods are used<br />

in this approach to let the software run in real-time on low-cost computers. The approach of<br />

Karlsson et al. is especially interesting as it is available as the commercial localization software<br />

vSlam 1 <strong>for</strong> the robots sold by Evolution Robotics 2 . vSlam achieves map building <strong>and</strong> navigation<br />

with a single low-cost camera. The most limiting factor of the approach, according to the<br />

authors, is the size of the l<strong>and</strong>mark database. Each l<strong>and</strong>mark needs about 40kB to 500kB of<br />

memory. This restricts the method to small indoor environments. Another critical issue which<br />

should be discussed is the reconstruction of the l<strong>and</strong>marks during map building. A l<strong>and</strong>mark<br />

is reconstructed from three images at different positions. However, as the camera usually faces<br />

<strong>for</strong>ward the three views contain only translational <strong>for</strong>ward motion. This imposes very bad conditions<br />

<strong>for</strong> 3D reconstruction. In fact, reconstruction of a plane (e.g. wall) will show depth<br />

estimation errors of about 10cm in practice. However, such uncertainties are h<strong>and</strong>led within the<br />

SLAM framework.<br />

A different approach has been presented by Se, Lowe <strong>and</strong> Little [96]. In their work they<br />

actually propose three different localization methods. The robot movement is however assumed<br />

to be restricted to a plane, thus the pose estimate only contains 3 DOF. The map itself contains<br />

full 3D coordinates of the l<strong>and</strong>marks. All three methods basically work by computing the pose<br />

from 3D-3D l<strong>and</strong>mark matches. The robot is equipped with a trinocular stereo head (Triclops 3 )<br />

which produces 3D coordinates <strong>for</strong> each l<strong>and</strong>mark in the current view. The first localization<br />

approach is based on the Hough trans<strong>for</strong>m [48]. A 3D discretized Hough space representing<br />

the robot poses with three parameters (X, Z, θ) is constructed. Each l<strong>and</strong>mark match votes<br />

<strong>for</strong> possible poses in the Hough space. The maximum vote then determines the parameters<br />

(X, Z, θ) of the robot pose. The second proposed method is a RANSAC scheme [28]. From<br />

two l<strong>and</strong>mark matches the necessary translation <strong>and</strong> rotation <strong>for</strong> alignment <strong>and</strong> thus the robot<br />

pose can be computed. This is repeated <strong>for</strong> a number of r<strong>and</strong>omly chosen l<strong>and</strong>mark samples<br />

within a RANSAC scheme. For each sample the pose hypothesis is verified by checking how<br />

1 http://www.evolution.com/core/navigation/vslam.masn<br />

2 http://www.evolution.com<br />

3 http://www.ptgrey.com


2.2. Localization from point features 14<br />

many l<strong>and</strong>mark matches out of the complete set agree with the pose estimate. The l<strong>and</strong>marks<br />

supporting the pose estimate <strong>for</strong>m the consensus set <strong>and</strong> are called inliers. Finally a leastsquares<br />

estimate of the pose is per<strong>for</strong>med by all inlier l<strong>and</strong>marks of the pose hypothesis with<br />

the largest consensus set. The third method computes the pose by map alignment. It works by<br />

constructing a local sub-map from l<strong>and</strong>marks of multiple frames. This local sub-map is then<br />

aligned with a part of the world map. The local sub-map is created while the robot rotates<br />

a little, from -15 ◦ to 15 ◦ . The map alignment is implemented with the RANSAC scheme of<br />

the previous method. This method is to be preferred if only a few l<strong>and</strong>marks are currently in<br />

the field-of-view of the robot. Beside localization the authors describe a complete framework<br />

<strong>for</strong> visual SLAM including global localization. The system is designed <strong>for</strong> indoor operations.<br />

Without an a priori map, the robot will start to construct a map by driving around r<strong>and</strong>omly.<br />

The map building will be completed, if no new features are detected. DOG-keypoints [67] are<br />

detected <strong>for</strong> each image frame <strong>and</strong> a SIFT descriptor [67] is computed <strong>for</strong> each detection. The<br />

3D parameters <strong>for</strong> each detected image point is computed with the calibrated trinocular stereo<br />

system. The reconstructed image points are stored in the map as l<strong>and</strong>marks associated with<br />

the corresponding SIFT description. The detected image points are tracked in the subsequent<br />

frames <strong>and</strong> the different SIFT descriptions are additionally added to the 3D l<strong>and</strong>marks. Thus a<br />

l<strong>and</strong>marks entry in the database consists of the 3D parameters of the point <strong>and</strong> a collection of<br />

SIFT descriptions from different viewpoints. The acquired image data will not further be stored.<br />

A sub-map concept is used <strong>for</strong> map building. 3D l<strong>and</strong>marks extracted from an image are not<br />

immediately added to the map, instead they are added to a local sub-map. If the l<strong>and</strong>marks<br />

can be tracked <strong>for</strong> some time, the whole sub-map will then be added to the global map. The<br />

local sub-map will be aligned to the already existing l<strong>and</strong>marks in the global map <strong>and</strong> new<br />

l<strong>and</strong>marks will be added, while already existing l<strong>and</strong>marks will be updated. Each l<strong>and</strong>mark<br />

has an associated uncertainty which is decreasing with multiple measurements. The uncertainty<br />

is represented by a 3 × 3 covariance matrix. A Kalman filter [54] is used to propagate the<br />

uncertainty of the l<strong>and</strong>marks. If a l<strong>and</strong>mark is re-detected the uncertainty shrinks, indicating<br />

that the l<strong>and</strong>mark is better localized. Experiments <strong>for</strong> map building <strong>and</strong> localization are shown<br />

<strong>for</strong> a 10x10m big room. The measured average position error <strong>for</strong> global localization was reported<br />

to be 7cm, while the average rotational error was about 1 ◦ . The experiments show that reliable<br />

pose estimation requires a minimum of 10 l<strong>and</strong>mark matches. The approach is a very reliable<br />

visual SLAM algorithm. With a frame rate of 2Hz reported on a relatively slow computer it<br />

is basically running in real-time. The key component of the method is the use of the SIFT<br />

descriptor <strong>for</strong> the l<strong>and</strong>marks. This allows to generate a map of natural l<strong>and</strong>marks which can<br />

be reliable re-detected <strong>and</strong> matched. The SIFT descriptor is based on orientation histograms<br />

<strong>and</strong> is there<strong>for</strong>e very robust to illumination changes. This allows to solve the correspondence<br />

problem fast <strong>and</strong> reliable, which is basically the most crucial part <strong>for</strong> visual systems. The<br />

achieved localization accuracy is high enough to allow a safe <strong>and</strong> useful navigation through the<br />

environment. Difficulties in 3D reconstruction are avoided by using a fixed stereo setup, which<br />

directly outputs 3D coordinates. However, this is much more expensive than the use of a single<br />

camera <strong>and</strong> it is not suited <strong>for</strong> small scale robots. It is worth mentioning that the created 3D<br />

map is a sparse set of 3D l<strong>and</strong>marks. It cannot be used <strong>for</strong> visualization purposes <strong>and</strong> it is<br />

difficult to use <strong>for</strong> navigation <strong>and</strong> path planning tasks, because a lot of the structure in the<br />

environment is not contained in the map, but only some distinct l<strong>and</strong>mark points.<br />

Another approach to visual localization uses invariant sets of points. In the work by Atiya<br />

<strong>and</strong> Hager [1] the pose is computed from invariant point triples. Another different approach has<br />

been developed by Sim <strong>and</strong> Dudek [98] where the pose is computed from the trans<strong>for</strong>mation of


2.3. Localization from line features 15<br />

learned natural l<strong>and</strong>marks.<br />

2.3 Localization from line features<br />

Using lines as l<strong>and</strong>marks was investigated already in the early times of mobile robotics. One<br />

reason might be that line extraction works well, even despite of large viewpoint changes. An<br />

edge detector, e.g. [15] will detect lines repetitively despite viewpoint changes <strong>and</strong> illumination<br />

changes. Difficulty however remains in matching extracted lines in an image to the lines in the<br />

l<strong>and</strong>mark. As a solution <strong>for</strong> the correspondence problem geometric matching was investigated.<br />

One approach is known as the FINALE system developed by Kosaka <strong>and</strong> Kak [59]. The<br />

approach uses a CAD model of the environment as map. The CAD model is composed of<br />

lines only. The goal is to match the lines extracted from the current view with the lines in<br />

the CAD model <strong>and</strong> thus determine the position of the robot. The FINALE system allows <strong>for</strong><br />

incremental localization only. That means the robots previous position must be known <strong>and</strong> be<br />

close to the actual position. Thus the robots position is maintained using a Kalman filter. For<br />

localization the lines of the CAD model visible from the previous position are first projected<br />

on the image plane. The such created 2D representations of the map lines are then matched<br />

by a simple nearest neighbor approach with the lines extracted from the current view. When<br />

the correspondences are established the position maintained by the Kalman filter is updated<br />

depending on the deviation of the matched lines to the projected lines. For this approach one<br />

has to take care, that the CAD model contains edges which are detectable by an edge detector.<br />

This is usually the case <strong>for</strong> the edges created by a wall meeting the ceiling <strong>and</strong> the floor, or <strong>for</strong><br />

edges created by doors. However, the need <strong>for</strong> a pose estimate <strong>for</strong> the projection of the lines is<br />

a drawback of this method.<br />

A much more recent line based approach has been proposed by Bosse et al. [10]. In their<br />

work a method <strong>for</strong> large scale mapping <strong>and</strong> localization using line features is described. The<br />

proposed localization algorithm works by sub-map matching. The world map is composed of<br />

multiple sub-maps. A sub-map covers a small area created from a small number of image frames.<br />

The sub-maps are linked by rigid 3D trans<strong>for</strong>mations. A sub-map contains 3D lines extracted<br />

<strong>and</strong> reconstructed from an image sequence, but also 3D points <strong>and</strong> vanishing points. Only lines<br />

which correspond to a vanishing point are stored as l<strong>and</strong>marks in the sub-map. This discards<br />

a lot of small edge segments <strong>and</strong> selects mostly vertical <strong>and</strong> horizontal lines coming from the<br />

gross structure of buildings. The 3D points are reconstructed from KLT feature tracks [107]<br />

along the image frames. Localization is per<strong>for</strong>med by building a local sub-map from a short<br />

image sequence <strong>and</strong> aligning it with the world map using an extension of ICP [8] which h<strong>and</strong>les<br />

point <strong>and</strong> line features. The method uses omnidirectional images generated by a catadioptric<br />

camera-mirror system. It uses a very original method to extract lines <strong>and</strong> vanishing points from<br />

omnidirectional images. The authors demonstrated the mapping of large areas where the robot<br />

traversed several kilometers. Encountered loops were successfully closed. The method is not<br />

limited to planar movements but allows full 6 DOF localization.<br />

Another example <strong>for</strong> localization from line features is the work by Goedeme et al. [39]. The<br />

approach is using a topological map <strong>and</strong> thus localization does not provide a metric robot pose.<br />

Localization works in the sense of place recognition. The line features used in this work are<br />

different from other approaches, they do not originate from an edge detector. Instead vertical<br />

lines are detected in a gradient image of the original image. For this a gradient magnitude<br />

image is computed first using the Sobel operator. Then the image is processed column by<br />

column to detect line segments. A line feature is defined by the line segment between two local


2.4. Localization from plane features 16<br />

gradient magnitude maxima. For each detected line feature a descriptor based on viewpoint<br />

invariant measures is computed. The description vector is of length 10 <strong>and</strong> combines color <strong>and</strong><br />

intensity properties of the pixels of the line feature. The such detected line features are invariant<br />

to viewpoint changes, provided that the robot movements are restricted to a horizontal plane.<br />

Map building is described as an off-line batch process. A KD-tree is built to store the descriptors<br />

of each detected line features. Localization is then by extracting line features from the current<br />

view <strong>and</strong> matching them with the map features in the KD-tree. The approach has been used<br />

<strong>for</strong> autonomous wheelchair navigation.<br />

Further examples <strong>for</strong> model-based localization using a CAD-model are described in [2, 17,<br />

110, 114]. In the work of Tsubouchi <strong>and</strong> Yuta [110] color in<strong>for</strong>mation is used as additional cue.<br />

The system developed by Vincze et al. [114] deals with robot navigation within a ship structure.<br />

The work of Folkesson et al. [29] uses lines detected on the ceiling <strong>for</strong> localization. Lines extracted<br />

from images of a panoramic stereo sensor are used by Yuen <strong>and</strong> MacDonald [118]. In [82] Neira<br />

et al. describe how to build a stochastic map from line features.<br />

2.4 Localization from plane features<br />

Not so many approaches exist that are using plane features in their world representation. The<br />

identification of plane features requires algorithms which are more complicated than <strong>for</strong> the<br />

detection of line or point features. Furthermore outdoor scenes often do not contain much scene<br />

planes <strong>and</strong> thus the approaches would be restricted to indoor scenarios.<br />

One of the methods proposed so far has been developed by Hayet et al. [45]. The map<br />

contains planar l<strong>and</strong>marks like <strong>for</strong> instance posters, doors, windows etc. The l<strong>and</strong>marks are<br />

represented by the contour <strong>and</strong> <strong>for</strong> localization the robot pose can be computed from the 3D<br />

contour in the map <strong>and</strong> the extracted 2D contour of the l<strong>and</strong>mark in the current view. For their<br />

approach they use a single camera mounted on a pan-tilt unit. Active vision however is not the<br />

main focus of their approach. The key concepts are:<br />

• Detection of planar quadrangular visual l<strong>and</strong>marks<br />

• Map building using laser range finder <strong>and</strong> stereo reconstruction<br />

• Visibility map<br />

The choice of planar l<strong>and</strong>marks seems to be very suitable <strong>for</strong> indoor environments, like offices<br />

etc. Posters or paintings attached to walls will provide reliable l<strong>and</strong>marks. However only<br />

quadrangular l<strong>and</strong>marks will be selected. This restriction eases the detection process. L<strong>and</strong>mark<br />

detection is based on perceptual grouping of edge segments. First edge detection is applied to<br />

the images. Grouping is applied to achieve connected edge segments. Combinations of edge<br />

segments are then searched which fulfill necessary constraints of a perspective projection of<br />

a quadrangular l<strong>and</strong>mark. After identification the l<strong>and</strong>mark is normalized to be invariantly<br />

stored in a map database. In a first step the l<strong>and</strong>mark is rectified to a quadrangular area of<br />

fixed size by applying a homography trans<strong>for</strong>m. This representation is invariant to scale <strong>and</strong><br />

viewpoint change. A describing feature vector is extracted from this normalized representation.<br />

Two approaches are proposed. In the first approach Harris corners [40] detected in the image<br />

patch are used as descriptor. L<strong>and</strong>mark matching can then be done by computing the partial<br />

Hausdorff distance [49] between a l<strong>and</strong>mark in the map <strong>and</strong> a l<strong>and</strong>mark detected in the current<br />

view. In addition to the Hausdorff distance Hayet et al. propose to incorporate the graylevel<br />

in<strong>for</strong>mation as feature vector too. The second representation approach uses the Principal


2.5. Summary 17<br />

Component analysis to compute a representative feature vector. Using these l<strong>and</strong>marks, map<br />

building is described as an off-line approach. For map building a robot equipped with a camera<br />

<strong>and</strong> a laser range finder is steered through the environment. Planar l<strong>and</strong>marks are detected<br />

by the previously described method. For each l<strong>and</strong>mark a 3D reconstruction of the contour<br />

will be per<strong>for</strong>med from two successive frames. The reconstructed l<strong>and</strong>mark will be put into the<br />

global coordinate system by using the robot pose in<strong>for</strong>mation of the laser range finder. Robot<br />

localization then works by matching planar l<strong>and</strong>marks detected in the current view with the<br />

map l<strong>and</strong>marks. For a matched l<strong>and</strong>mark the four corner points are used to create 3D-2D<br />

point correspondences. The pose of the robot is then computed from the four 3D-2D point<br />

correspondences using the planar P4P method <strong>and</strong> a subsequent iterative refinement [45]. A<br />

key concept of this approach is the visibility map <strong>for</strong> localization. It is assumed that a pathplanning<br />

process defines a trajectory in the world coordinate frame. According to this path the<br />

best suited l<strong>and</strong>marks are selected <strong>for</strong> different sections of the path. Localization is per<strong>for</strong>med in<br />

this approach from a single l<strong>and</strong>mark. The active camera is used to keep the l<strong>and</strong>marks selected<br />

from the visibility map in the field-of-view. The quadrangular l<strong>and</strong>marks however impose a very<br />

strong restriction upon the potential l<strong>and</strong>marks. Even <strong>for</strong> indoor environments this is a big<br />

limitation. And the localization algorithm relies on the four corner points. A single occluded<br />

corner point renders the whole l<strong>and</strong>mark invalid. For practical applications these constraint will<br />

be to rigid.<br />

A different SLAM approach using plane features has been proposed by Molton et al. [77].<br />

The pose is computed by image alignment between learned <strong>and</strong> detected l<strong>and</strong>marks.<br />

2.5 Summary<br />

The focus of the presented state-of-the-art in the last sections was drawn to methods which allow<br />

the construction of a world map <strong>and</strong> allow global localization therein. Both, SLAM approaches<br />

<strong>and</strong> batch approaches were enlisted. A SLAM approach allows incremental map building during<br />

operation. In a batch approach a map has to be created prior robot operation. This does not<br />

mean, that the map building could not be automatically. A mobile robot navigating with a<br />

laser range finder could traverse the environment <strong>and</strong> acquire image data. The world map could<br />

then be constructed off-line from the image data. Afterwards this map could be used by other<br />

robots only equipped with a digital camera <strong>for</strong> localization <strong>and</strong> navigation. This makes sense as<br />

digital cameras are much cheaper than a laser range finder <strong>and</strong> the robots which operate later<br />

on only need a digital camera. In general a SLAM approach is more challenging to develop, but<br />

<strong>for</strong> most cases it would be possible to extend existing batch versions to full SLAM approaches.<br />

The approaches proposed by Se et al. [96], Karlsson et al. [56] <strong>and</strong> Davison et al. [21] are SLAM<br />

approaches. The others are batch approaches. The approach from Kosaka [59] even requires a<br />

manually constructed CAD-model of the world. All of the presented SLAM approaches use a<br />

metric map containing 3D point features. Extraction <strong>and</strong> reconstruction of point l<strong>and</strong>marks can<br />

be done very fast. Other l<strong>and</strong>marks as lines, planes or vanishing points require more complex<br />

detection <strong>and</strong> reconstruction methods <strong>and</strong> might be not applicable <strong>for</strong> a real-time updating of<br />

the l<strong>and</strong>marks as required in a SLAM framework. Until know the influence of the map features<br />

<strong>for</strong> navigation <strong>and</strong> path planning has been neglected. However, it is worth to be discussed.<br />

For path planning the robot has to know which locations are obstructed <strong>and</strong> which are free.<br />

Clearly, this in<strong>for</strong>mation should be provided from the world representation. The described<br />

SLAM approaches construct sparse world representations consisting of distinct point features<br />

only. A path planning algorithm does not know if the space between the l<strong>and</strong>marks is occupied


2.5. Summary 18<br />

or not. Clearly this is a bad situation <strong>for</strong> navigation. This is however similar with maps based<br />

on line features. Most of the line features will be vertical <strong>and</strong> thus the situation will not be<br />

very different as <strong>for</strong> point features. In contrast planar l<strong>and</strong>marks have a spatial extension in<br />

3D. A planar l<strong>and</strong>mark projected to the ground floor will appear as a line <strong>and</strong> provides more<br />

in<strong>for</strong>mation about obstructing objects than point <strong>and</strong> line features.<br />

The discussed methods also differ strongly in the used camera systems. The approach described<br />

in [96] uses a fix mounted stereo head. It allows 3D reconstruction of l<strong>and</strong>marks from<br />

a single location. This eases map building enormously. It also simplifies localization to a computation<br />

of a rigid trans<strong>for</strong>mation between two sets of matched l<strong>and</strong>marks. However, equipping<br />

a robot with a stereo head is costly <strong>and</strong> in any case more expensive as using a single camera.<br />

In [56] only a single camera is used. L<strong>and</strong>mark reconstruction thus is not possible from a single<br />

location, instead the robot has to use three images acquired from different positions. In localization<br />

the pose has to be computed from 3D-2D l<strong>and</strong>mark correspondences. However, this can be<br />

done very efficiently. A stereo head is also used in the approach described in [21]. In their work<br />

it is an active stereo head, i.e. the cameras can be moved independently. This allows an original<br />

method <strong>for</strong> l<strong>and</strong>mark reconstruction <strong>and</strong> localization by triangulation. The method described<br />

in [10] even uses an omnidirectional sensor. This complicates map building <strong>and</strong> the approach<br />

described is a batch method. However, one gains a lot of benefits by having a field-of-view of<br />

360 ◦ . Nevertheless a mobile robot which has to be equipped with a single st<strong>and</strong>ard camera only<br />

would be most preferable, <strong>for</strong> cost reasons <strong>and</strong> reason of simplicity.<br />

Another major difference of the compared methods is in the pose representation. In [56,<br />

59, 96] the robots pose is only represented in 2D, by the triple (x, y, θ), where x <strong>and</strong> y are the<br />

position <strong>and</strong> θ is the heading of the robot. It is a quite common assumption, that the mobile<br />

robot is restricted to move in a horizontal plane. However, a much more general representation<br />

would be a full 6 DOF representation by a 3D translation <strong>and</strong> 3D rotation. This would allow<br />

ramps <strong>and</strong> different height levels be contained in the world map.<br />

The characteristics of the reviewed methods are summarized in Table 9.1. Based on this<br />

review of the current state-of-the-art we also can identify the following main deficiencies:<br />

6 DOF pose: A lot of approaches <strong>for</strong> mobile robot localization simply assume that the robot<br />

is moving on a horizontal plane only. Imposing such restrictions simplifies the localization<br />

algorithms as only a 3 DOF pose has to be estimated. Clearly the more general 6 DOF<br />

pose representation is favorable as it allows the robot to operate on different height levels<br />

or move onto ramps etc. For outdoor environments the horizontal plane assumption does<br />

not hold anyway.<br />

Single camera solution: Many systems, so far proposed, use specialized camera setups, like<br />

stereo setups, active stereo setups or even trinocular camera systems. The use of advanced<br />

imaging devices certainly eases a lot. However, specialized hardware is expensive <strong>and</strong><br />

often easier prone to malfunctions. If robotic systems are to be deployed to domestic<br />

environments e.g. as service robots the cost factor cannot be neglected <strong>and</strong> there<strong>for</strong>e the<br />

cheaper single camera solution is needed.<br />

L<strong>and</strong>mark correspondence problem: The correspondence problem is well known in the<br />

computer vision community <strong>and</strong> it is known to be hard. Detecting l<strong>and</strong>mark correspondences<br />

is also one of the most important issues in robot localization. Recent advances in<br />

wide-baseline stereo already allow efficient <strong>and</strong> reliable l<strong>and</strong>mark matching (see [56, 96]).<br />

However, still a high number of false matches are produced carrying the potential to


2.5. Summary 19<br />

compromise the localization algorithm. Any new method which provides a more reliable<br />

l<strong>and</strong>mark matching will there<strong>for</strong>e increase the overall localization per<strong>for</strong>mance.<br />

Localization despite large occlusions: Occlusions of the robots view will occur frequently if<br />

the robot is operating in a crowded environment. The view to l<strong>and</strong>marks will there<strong>for</strong>e be<br />

quite often limited. Localization algorithms should there<strong>for</strong>e be capable of computing an<br />

accurate pose from only a minimal number of detected l<strong>and</strong>mark matches. The methods<br />

described in [56, 96] require about 10-20 l<strong>and</strong>mark matches <strong>for</strong> a reliable pose estimate,<br />

quite a high number to met in a crowded <strong>and</strong> heavy occluded environment.<br />

Automatic map interpretation: Automatic map interpretation is a necessity to allow mobile<br />

robots to interact autonomously with the world <strong>and</strong> to carry out more complex tasks than<br />

vacuum cleaning. Nowadays systems can already get confused by a simple door. Assume<br />

that the mobile robot maps a room with an open door. In the map this will be reflected<br />

as an opening to traverse. Imagine that the other day the robot is heading towards the<br />

door <strong>and</strong> finds it closed. A simple localization algorithm will believe in a false position<br />

estimate. If however the robot knows about the functionality it can reason that the door<br />

has an open <strong>and</strong> a closed state <strong>and</strong> thus gets not confused. Well working service robot<br />

need to know even more about the environment. They need to know the names of the<br />

objects, functionalities of the objects, which objects are moveable, etc. Clearly this goes<br />

h<strong>and</strong> in h<strong>and</strong> with research in object recognition, but it should be thought about how the<br />

world representation of a mobile robot can support achieving this goal.<br />

Authors World map Sensor<br />

system<br />

Map features<br />

L<strong>and</strong>mark<br />

matching<br />

Map<br />

building<br />

Global<br />

localization<br />

(#l<strong>and</strong>marks ∗ )<br />

Pose representation<br />

Se,<br />

Lowe,<br />

Little [96]<br />

sparse<br />

metric<br />

stereo 3D points + SIFT feature<br />

matching<br />

SLAM<br />

tri-angulation,<br />

map-alignment<br />

(>= 10)<br />

2D (3DOF)<br />

Karlsson<br />

et al. [56]<br />

sparse<br />

metric<br />

monocular<br />

3D points + SIFT<br />

+ appearance<br />

feature<br />

matching<br />

SLAM<br />

3D-2D<br />

(>= 4)<br />

2D (3DOF)<br />

Davison<br />

et al. [21]<br />

sparse<br />

metric<br />

active<br />

stereo<br />

3D points correlation SLAM tri-angulation<br />

(>= 3)<br />

3D (6DOF)<br />

Bosse<br />

et al. [10]<br />

sparse<br />

metric<br />

omnidirectional<br />

3D points<br />

+ 3D lines<br />

+ vanishing points<br />

nearest<br />

neighbor<br />

batch<br />

map matching<br />

(approx.30)<br />

3D (6DOF)<br />

Goedeme<br />

et al. [39]<br />

topological<br />

monocular,<br />

omnidirectional<br />

2D lines<br />

+ color descriptor<br />

+ intensity descriptor<br />

feature<br />

matching<br />

batch<br />

line matching <strong>and</strong><br />

voting<br />

topological<br />

location<br />

Kosaka<br />

et al. [59]<br />

sparse<br />

metric,<br />

CAD-model<br />

monocular 3D lines nearest<br />

neighbor<br />

manual - 2D (3DOF)<br />

Hayet<br />

et al. [45]<br />

sparse<br />

metric<br />

monocular<br />

quadrangular 3D planes<br />

+ PCA descriptor<br />

feature<br />

matching<br />

batch<br />

3D-2D<br />

(1)<br />

3D (6DOF)<br />

Table 2.1: Main characteristics of the revised literature approaches. ( ∗ necessary l<strong>and</strong>mark<br />

matches <strong>for</strong> robust pose estimation)


Chapter 3<br />

Local detectors<br />

Research on local detectors can be dated back to 1977 when Hans Moravec has described an<br />

interest operator which is today known as the Moravec operator [79]. In [80] Hans Moravec<br />

described obstacle avoidance <strong>and</strong> navigation <strong>for</strong> a mobile robot. He was using his interest<br />

operator to detect interest points in stereo image pairs <strong>and</strong> images from different viewpoints to<br />

use them as features to build a 3D map of the environment. Feature matching was achieved with<br />

correlation of 6 × 6 pixel image patches around the detected feature locations. The Moravec<br />

operator is based on the auto-correlation function, that is measuring the gray-level difference<br />

between a window <strong>and</strong> a shifted window in four directions. Calculating the sum of squared<br />

distances in the window gives a measure <strong>for</strong> every shift. The values are high, if the graylevel<br />

variance is high (textured) <strong>and</strong> low if there is low gray-level variance (e.g. homogeneous<br />

region). If the measures <strong>for</strong> every direction are high, the pixel location is a good c<strong>and</strong>idate <strong>for</strong><br />

an interest point. The smallest measure is then used as a quality measure <strong>for</strong> the interest point.<br />

In most cases the detected locations lie on edges <strong>and</strong> corners. For such cases a little shift already<br />

causes a difference. However, an obvious deficiency is the anisotropic behavior because there is<br />

only a discrete set of shifts. This basic idea was carried on leading to the well known Harris<br />

corner detector [40]. The idea got re-<strong>for</strong>mulated using the structure tensor [9] <strong>and</strong> the second<br />

moment matrix respectively, leading to different variants of corner detectors [30, 61, 91, 107].<br />

Other approaches [7, 57] use the second derivatives (Hessian matrix [115]) instead of the first<br />

derivatives. All these approaches can be considered belonging to one class of simple interest<br />

point detectors. They all have in common to detect a location only. That means that <strong>for</strong> a<br />

subsequent task like image matching via cross-correlation the size of the necessary matching<br />

window has to be chosen independently. This limitation shows up if dealing with images which<br />

show scale change. Although the detector might be able to detect the corresponding location,<br />

the correlation window will not contain the same gray-values <strong>and</strong> the matching will fail.<br />

This limitation was addressed by estimating a proper scale <strong>for</strong> every detected interest point.<br />

With this in<strong>for</strong>mation the scale of the matching window can be normalized <strong>and</strong> cross-correlation<br />

would again work. The first work going into this direction was done by Tony Lindeberg [64] in<br />

1998. Other approaches followed shortly by David Lowe [66] or Krystian Mikolajczyk [72]. This<br />

class of interest operators is usually called scale-invariant interest operators.<br />

However, research again went one step further. According to the success of interest operators<br />

which are invariant to scale change methods were sought to create interest operators invariant to<br />

a larger class of image trans<strong>for</strong>mations. This was driven mostly by developments in wide-baseline<br />

image matching where significant perspective distortions occur. Research therein led to a new<br />

class of interest detectors, affine-invariant detectors. In most cases such a detection consists of a<br />

20


3.1. Interest point detectors 21<br />

point location <strong>and</strong> an elliptical delineation of the detection. The ellipse representation captures<br />

the affine trans<strong>for</strong>mation of the detection. By normalizing the ellipse to an unit-circle the affine<br />

trans<strong>for</strong>mation can be removed. This method was first suggested in 2000 by Baumberg et al. [6].<br />

It led to a wide variety of affine-invariant detectors [53, 70, 73, 112]. The common property<br />

of these approaches is that they provide in<strong>for</strong>mation how the region around the detection can<br />

be normalized to allow image matching. The detections itself however may not be simple point<br />

locations anymore. In the case of the MSER detector [70] a detection is a whole image region<br />

showing similar gray-values. Approaches like that are usually referred to as distinguished region<br />

detectors, moreover as every affine detector defines his own support region too. Thus the term<br />

’local detector’ emanated, which st<strong>and</strong>s <strong>for</strong> simple point detectors as well as region detectors.<br />

3.1 Interest point detectors<br />

Interest point detectors are equivalent to corner detectors. A corner point shows strong intensity<br />

change in x <strong>and</strong> y direction. See Figure 3.1 <strong>for</strong> an example. Such corner points can easily be<br />

detected by examination of the gradients in x <strong>and</strong> y direction.<br />

Figure 3.1: Corner point: showing strong intensity change in x <strong>and</strong> y direction. (Image adapted<br />

from [87])<br />

.<br />

When speaking from an interest point one usually means the x <strong>and</strong> y coordinates of a corner<br />

point. The interest point is defined only by its position. When using interest points <strong>for</strong> feature<br />

matching a description from a certain window around the interest point position has to be<br />

computed. A interest point detector however, does not define the size <strong>and</strong> shape of such a<br />

window. Let us call such a window ’measurement region’. We will see that other local detectors<br />

which will be described later on are able to define different types of measurement regions. Now<br />

let us re-state that an interest point detector does define a position, but no measurement region.<br />

Let us now look at the details of two popular interest point detectors, the Harris detector <strong>and</strong><br />

the Hessian detector.<br />

3.1.1 Harris detector<br />

The Harris detector is probably the best known <strong>and</strong> most used interest point <strong>and</strong> corner detector.<br />

The Harris detector is an extension of the Moravec operator [79] <strong>and</strong> is dated back to 1988 [40].<br />

The Moravec operator calculates the auto-correlation function (that is measuring the gray-level<br />

difference between a window <strong>and</strong> a shifted window) in four directions. The auto-correlation


3.1. Interest point detectors 22<br />

function will be high, if the gray-level variance is high (textured) <strong>and</strong> low if there is low graylevel<br />

variance (e.g. homogeneous region). If the measures <strong>for</strong> every direction are high, the pixel<br />

location is a good c<strong>and</strong>idate <strong>for</strong> an interest point. This idea is carried on to the Harris corner<br />

detector, however the anisotropic behavior from using only a discrete set of shifts is extended<br />

to an isotropic <strong>for</strong>mulation. This is done by a first order Taylor-series expansion of the autocorrelation<br />

function. To cope with image noise a Gaussian filtering is applied too. Written in<br />

matrix <strong>for</strong>m the resulting value of the auto-correlation E <strong>for</strong> a small shift (x, y) is<br />

E(x, y) = (x, y)M(x, y) T (3.1)<br />

where M is the 2 × 2 matrix<br />

[<br />

M = exp − x2 +y 2<br />

2σ 2 ⊗<br />

( ∂I<br />

∂x )2<br />

( ∂I ∂I<br />

∂x<br />

)(<br />

∂y )<br />

( ∂I ∂I<br />

∂x<br />

)(<br />

∂y )<br />

]<br />

( ∂I<br />

∂y )2<br />

[<br />

= exp − x2 +y 2 I<br />

2<br />

2σ 2 ⊗ x I x I y<br />

I x I y<br />

I 2 y<br />

]<br />

. (3.2)<br />

I(x, y) is the gray-level intensity of an image I at position (x,y) <strong>and</strong> exp − x2 +y 2<br />

2σ 2 ⊗ means convolution<br />

with a 2D Gaussian filter with some predefined σ. The matrix M is computed <strong>for</strong><br />

every pixel location in the image I <strong>and</strong> from M a cornerness measure <strong>for</strong> every pixel location is<br />

computed. Harris <strong>and</strong> Stephens defined the following cornerness measure R:<br />

R = det M − k(trace M) 2 (3.3)<br />

R is often denoted as ’corner response’ also. The scalar factor k is set to 0.04, which is<br />

a value defined by experimental validation. A positive value of R now characterizes a corner.<br />

The higher the value of R the stronger is the corner. A small value, close to zero, denotes<br />

homogenous image regions. The value of R can be negative as well <strong>and</strong> in such a case it is an<br />

indication <strong>for</strong> an edge pixel, i.e. a pixel location with R < 0 is an edge pixel. Figure 3.2(a)<br />

shows example detections. The interest points are marked with yellow crosses.<br />

The original Harris corner algorithm computes the partial derivatives in x <strong>and</strong> y direction by<br />

simple difference computation. The gradients are computed by convolution with the following<br />

kernels:<br />

∂I<br />

= I(x, y) ⊗ (−1, 0, 1)<br />

∂x<br />

(3.4)<br />

∂I<br />

∂y = I(x, y) ⊗ (−1, 0, 1)T (3.5)<br />

By computing the gradients with Gaussian derivatives Schmid et al. reported a significant<br />

improvement in robustness <strong>and</strong> stability [95]. The choice of the st<strong>and</strong>ard deviation σ <strong>for</strong> the<br />

Gaussian filters is also very important. The corner response differs strongly <strong>for</strong> different values<br />

of σ. The parameter σ can be seen as scale parameter. For large values of σ only strong corners<br />

will be detected. For small values of σ also smaller corners will be detected <strong>and</strong> usually a small<br />

σ leads to multiple close-by detections. Non-maxima suppression should be per<strong>for</strong>med which<br />

would reduce the number of nearby detections also in the case of small values <strong>for</strong> σ.<br />

The Matrix M is also known as the structure tensor from the work of Bigün [9]. This relation<br />

allows a different interpretation of the Harris corner measure in terms of the Eigenvalues of the<br />

structure tensor, which gives significant insights into the properties of the detector. The reader<br />

may be referred to [87] <strong>and</strong> [26] <strong>for</strong> details.<br />

Besides the Harris corner measure a vast variety of detectors based on the structure tensor<br />

exist. A good overview can be found in [87].


3.2. Scale invariant detectors 23<br />

3.1.2 Hessian detector<br />

The Hessian detector is very similar to the Harris detector. Instead of the structure tensor the<br />

Hessian matrix is computed to identify corners.<br />

[ ]<br />

H = exp − x2 +y 2<br />

2σ 2 ⊗<br />

= exp − x2 +y 2 Ixx I<br />

2σ 2 ⊗ xy<br />

(3.6)<br />

I xy I yy<br />

[ ∂I<br />

∂x 2<br />

∂I<br />

∂x∂y<br />

]<br />

∂I<br />

∂x∂y<br />

∂I<br />

∂y 2<br />

As a measure <strong>for</strong> interest points, the determinant det(H) of the Hessian matrix is used.<br />

det(H) = I xx I yy − I 2 xy (3.7)<br />

This measure has been first introduced by Beaudet [7] in 1978. Figure 3.2(b) shows example<br />

detections. The interest points are marked with yellow crosses.<br />

Another measure based on the Hessian matrix has been proposed by Kitchen <strong>and</strong> Rosenfeld<br />

[57]. The measure K is defined as<br />

K = I xxIy 2 + I yy Ix 2 + 2I xy I x I y<br />

Ix 2 + Iy<br />

2 . (3.8)<br />

Recently it has been shown by Mikolajczyk [75] that the Hessian matrix can be used <strong>for</strong><br />

scale selection. It will be described in detail in the next section.<br />

(a)<br />

(b)<br />

Figure 3.2: Detection examples <strong>for</strong> interest point detectors on ”Group” scene. (a) Harris detector.<br />

(b) Hessian detector.<br />

3.2 Scale invariant detectors<br />

Scale invariant detectors are interest point detectors which additionally define a circular measurement<br />

region. In addition to x, y a third parameter s which is the size of the circular measurement


3.2. Scale invariant detectors 24<br />

region is found by the detectors. The important point is now, that the measurement region is<br />

found identical in two images, where one is a scaled version. That means that if an image is<br />

reduced in size by 2 then the measurement region <strong>for</strong> the same interest point in the smaller<br />

image is half the size of the one in the original. Thus the detectors are called scale invariant.<br />

This is illustrated in Figure 3.3. This property is important <strong>for</strong> feature matching. It is now<br />

possible to normalize detections, so that the feature vector is computed from the identical size.<br />

Normalization means a scale trans<strong>for</strong>mation of one of the measurement regions to the size of the<br />

other. Feature vectors computed from normalized patches will ease correspondence detection<br />

enormously <strong>and</strong> allow to h<strong>and</strong>le much more complicated situations as when using simple interest<br />

point detectors. In the following we will describe four different scale invariant detectors.<br />

(a)<br />

(b)<br />

Figure 3.3: Example <strong>for</strong> scale-invariant Harris detector. The left image shows a detection with<br />

scale estimate on the original image. The right image shows the detection on a smaller version<br />

of the image (60% size of original). The scale is estimated so that the same image region as in<br />

the original is selected.<br />

3.2.1 Scale-invariant Harris detector<br />

The scale-invariant Harris detector has been proposed by Mikolajczyk et al. [72]. The scaleinvariant<br />

Harris detector detects interest points with an associated circular measurement region<br />

around the center. The scale of the measurement region will be geometrically stable, that means<br />

applying the detector on a re-scaled version of the image, will produce a detection with the<br />

identical (but re-scaled) image content within the measurement region. The detection is a two<br />

step process. First Harris corners are detected on multiple scales. For this a scale-adapted Harris<br />

detector is used. In a second step a characteristic scale <strong>for</strong> each Harris corner is identified. The<br />

characteristic scale directly determines the size of the resulting measurement region. Extrema of<br />

the Laplacian-of-Gaussian are used to detect the characteristic scale of an interest point. Thus<br />

the detector is also known as Harris-Laplace detector.<br />

A necessity <strong>for</strong> the first step is the scale-adapted Harris detector. The original Harris detector<br />

[40] is not invariant to scale change. To overcome this the authors of [72] propose a<br />

combination with the automatic scale selection described by Lindeberg [64]. The combination<br />

leads to the scale adapted second moment matrix. The second moment matrix describes the<br />

gradient distribution in a local neighborhood of a point <strong>and</strong> is the basis <strong>for</strong> corner detection<br />

with the Harris method. The scale adapted second moment matrix is defined by:<br />

[ ]<br />

M(x, σ i , σ d ) = σd 2 I g(σ 2<br />

i) ⊗ x (x, σ d ) I x I y (x, σ d )<br />

I x I y (x, σ d ) Iy 2 (3.9)<br />

(x, σ d )


3.2. Scale invariant detectors 25<br />

g(σ i ) is a 2-dimensional Gauss kernel with st<strong>and</strong>ard deviation σ i . σ d is the differentiation scale.<br />

The local derivatives are computed with Gaussian derivatives <strong>and</strong> the differentiation scale σ d<br />

determines the size of the Gaussian filter. σ i is the so called integration scale <strong>and</strong> determines the<br />

size of the Gaussian window which is used <strong>for</strong> smoothing the gradients in the local neighborhood.<br />

The Harris measure <strong>for</strong> the scale-adapted second moment matrix is now defined by:<br />

R = det M(x, σ i , σ d ) − k(trace 2 M(x, σ i , σ d )) (3.10)<br />

By computing the second moment matrix <strong>and</strong> the cornerness measure R <strong>for</strong> different values<br />

<strong>for</strong> σ i <strong>and</strong> σ d a scale-space representation of Harris corners can be established. The authors<br />

propose in [75] to compute a scale-space representation <strong>for</strong> pre-selected scales σ n = ξ n σ 0 , where<br />

ξ is the scale factor between successive levels. In [64] Lindeberg suggest ξ = 1.4. The integration<br />

scale σ i <strong>for</strong> computation of the second moment matrix is set to σ i = σ n . The differentiation scale<br />

σ d is set to σ d = sσ n = sσ i , where s is a constant factor. s is set to 0.7 in [75]. This couples the<br />

integration scale <strong>and</strong> the differentiation scale by a multiplicative scalar factor. Harris corners<br />

are finally identified by thresholding the cornerness value R <strong>and</strong> non-maxima suppression in a<br />

8-neighborhood <strong>for</strong> every scale level.<br />

The next step in the algorithm is the detection of a characteristic scale <strong>for</strong> the Harris detections.<br />

In the previous step Harris corners were independently detected on each different<br />

scale level. Now <strong>for</strong> each such detection a characteristic scale is estimated using the Laplacianof-Gaussians<br />

(LoG) function. A characteristic scale is determined by a local maxima of the<br />

following function:<br />

|LoG(x, σ n )| = σ 2 n|L xx (x, σ n ) + L yy (x, σ n )| (3.11)<br />

For every detected point location the function is evaluated over all available scales. The characteristic<br />

scale corresponds to the local maxima. If more that one local maxima exists multiple<br />

characteristic scales are assigned to the detection. Besides the LoG other functions would be<br />

possible. However, an evaluation in [72] revealed that the LoG per<strong>for</strong>ms best. If an evaluated<br />

point location does not show a LoG maximum or if the response is below a threshold the point<br />

will be discarded. All the other detections are reported as results of the detectors, where the<br />

characteristic scale directly determines the size of the measurement region in pixel. In [72] the<br />

radius of the measurement region in pixel is 2.8σ n . The per<strong>for</strong>mance of the Harris-Laplace detector<br />

is evaluated very thoroughly in [75] <strong>and</strong> compared to other methods. Figure 3.4(a) shows<br />

example detections <strong>for</strong> the Harris-Laplace method. Each detection is visualized by the center<br />

point (yellow cross) <strong>and</strong> the characteristic scale drawn as a circle around the center point.<br />

3.2.2 Scale-invariant Hessian detector<br />

The scale-invariant Hessian detector is very similar to the previously described Harris-Laplace<br />

detector. It is also known as Hessian-Laplace detector <strong>and</strong> described in [71]. The detection<br />

algorithm is basically identical to the Harris-Laplace detector with the only exception, that<br />

the initial interest points are identified with the Hessian matrix instead of the second moment<br />

matrix. As cornerness measure the determinant of the Hessian matrix is used. The Hessian-<br />

Laplace detector produces very similar results to the Harris-Laplace detector, which is not very<br />

surprising as the algorithms are almost identical. Example detections <strong>for</strong> the Hessian-Laplace<br />

method are shown in Figure 3.4(b). Each detection is represented by the center point (yellow<br />

cross) <strong>and</strong> the characteristic scale drawn as a circle around the center point.


3.2. Scale invariant detectors 26<br />

(a)<br />

(b)<br />

Figure 3.4: Detection examples <strong>for</strong> the scale invariant Harris <strong>and</strong> Hessian detectors on ”Group”<br />

scene. (a) Harris-Laplace detector. (b) Hessian-Laplace detector.<br />

3.2.3 Difference of Gaussian detector (DOG)<br />

The Difference of Gaussian detector has been developed by David Lowe <strong>and</strong> first presented<br />

in [66]. The DOG-keypoints were introduced in combination with a suitable descriptor, the<br />

SIFT-descriptor. This has led to quite a misconception, <strong>and</strong> often DOG-keypoints are called<br />

SIFT-keypoints. However, despite the fact that DOG-keypoints <strong>and</strong> SIFT-descriptor are often<br />

used as combination, each method also st<strong>and</strong>s <strong>for</strong> it alone <strong>and</strong> DOG-keypoints should be<br />

there<strong>for</strong>e not reduced to SIFT-keypoints.<br />

The essence of the DOG-detector is to find blob like structures in a scale-space [117] created<br />

from the input image. This is done by computing the difference of Gaussians <strong>for</strong> multiple scales<br />

<strong>and</strong> searching <strong>for</strong> local extrema therein. The difference of Gaussians is a close approximation of<br />

the Laplacian of Gaussians investigated in [63]. The main reason <strong>for</strong> the use of the difference<br />

of Gaussian is computational efficiency. Moreover, most of the DOG-detection algorithm is<br />

designed <strong>for</strong> efficiency.<br />

A scale-space of the image I is defined as a function L(x, y, σ). It is gained by convolution<br />

of a variable-scale Gaussian G(x, y, σ) with the image I(x, y).<br />

G(x, y, σ) is a 2D Gaussian kernel with the scale parameter σ.<br />

L(x, y, σ) = G(x, y, σ) ∗ I(x, y). (3.12)<br />

The difference-of-Gaussian function D(x, y, σ) is now defined as<br />

G(x, y, σ) = 1<br />

2πσ 2 exp −(x2 +y 2 )<br />

2σ 2 (3.13)<br />

D(x, y, σ) = L(x, y, kσ) − L(x, y, σ) (3.14)


3.2. Scale invariant detectors 27<br />

where k is a constant multiplicative factor. This means that D(x, y, σ) is simply the subtraction<br />

of two neighboring discrete scale-space representations of the image I. The scale-space <strong>for</strong> DOG<br />

detection is defined in the following manner. It consists of a pre-defined number of partitions,<br />

called octaves. Each new octave starts with a σ with a double-as-high value of the previous<br />

octave. Each octave is partitioned into a number of s discrete scale-space representations, where<br />

s is an integer number. With this condition the parameter k is defined as k = √ 2. For each<br />

octave the image I is re-sampled down to half of the size of the previous image. Re-sampling<br />

is done by simply selecting every other pixel of the image. This is done <strong>for</strong> computational<br />

efficiency. Doing the re-sampling everytime when the σ is doubling is consistent with the scalespace<br />

theory. The difference of Gaussian function D(x, y, σ) is now produced by subtracting the<br />

neighboring scale-space slices within each octave. The next step after computation of D(x, y, σ)<br />

is the detection of local extrema therein. The extrema to be detected are the local minima<br />

<strong>and</strong> maxima of D(x, y, σ). Every pixel of the scale-space representation is checked if it is an<br />

extremum of D(x, y, σ). If a pixel is an extremum then it is selected as a DOG-keypoint. If the<br />

extremum is located on one of the re-sampled octaves the x <strong>and</strong> y coordinate in the original<br />

image scale have to be computed. The characteristic scale of the DOG-point is the value of the<br />

σ of the scale-space slice on which the extremum has been found. For extremum detection all<br />

26 neighbor pixels in scale-space are investigated. The pixel is a local maximum if its value is<br />

higher than the values of its neighbor <strong>and</strong> it is a local minimum if it is smaller than all of its<br />

neighbors. The 26 neighbors are defined by a 8-connecting neighborhood in scale-space. The 26<br />

neighbors consist of the 8 neighbors of the same slice, 9 neighbors on the upper scale-level <strong>and</strong> 9<br />

neighbors on the lower scale-level. Point detection in such a way only gives detection with pixel<br />

accuracy. In a subsequent step to detection a sub-pixel keypoint localization will be per<strong>for</strong>med.<br />

This step ensures, that keypoints are located exactly on corners or edges. To gain sub-pixel<br />

accuracy a 3D quadratic function will be fitted to the local scale-space region. The keypoint<br />

will finally be localized at the interpolated maximum or minimum of the quadratic function (<strong>for</strong><br />

more details see [13]).<br />

However not all detected extrema are suited to finally act as keypoints. Detected keypoints<br />

with low contrast are not well suited as keypoints. Scale-space extrema also tend to be located<br />

on edges. However, they are not well localized along the edge itself. A final filtering step will<br />

eliminate such ambiguous detections. Edge responses are eliminated by Eigenvalue analysis of<br />

the Hessian matrix H of the keypoint location. The process is very similar to corner detection<br />

using the Hessian matrix. The ratio of the two principal directions is computed <strong>and</strong> the<br />

keypoint is eliminated if one direction is significant stronger than the second one. The ratio is<br />

approximated by the ratio of the squared trace to the determinant. If<br />

trace(H) 2<br />

det(H)<br />

<<br />

(r + 1)2<br />

r<br />

(3.15)<br />

the location is accepted as DOG-keypoint, where r = 10 is a reasonable value <strong>for</strong> a lot of<br />

situations.<br />

It is possible to implement the necessary steps of the DOG-detector very efficiently. The<br />

DOG-detector is there<strong>for</strong>e a c<strong>and</strong>idate of choice if one wants to build a real-time system. Figure<br />

3.6(a) shows examples <strong>for</strong> DOG-keypoints. Each keypoint is represented by the center point<br />

(yellow cross) <strong>and</strong> the characteristic scale drawn as a circle around the center point.


3.2. Scale invariant detectors 28<br />

3.2.4 Salient region detector<br />

The salient region detector has been proposed by Kadir <strong>and</strong> Brady [52]. As the other scale<br />

invariant detectors a location <strong>and</strong> a characteristic scale is detected. However, a major difference<br />

is in the selection of the location. The goal is to detect salient image regions. Kadir <strong>and</strong> Brady<br />

propose as a measure <strong>for</strong> saliency the entropy of the gray-value distribution within an image<br />

region. The entropy H of an image region is defined by<br />

H = − ∑ i<br />

p(d i )log 2 p(d i ) (3.16)<br />

where p(d i ) is the probability of gray-value d i in the image region. The values p(d i ) can be<br />

computed by the histogram of the image region. The histogram counts the frequency of the<br />

occurrence Kadir&Brady: of each gray-value. The entropy Entropy can be computed with the normalized histogram<br />

counts. The goal is to select regions which show a distributed histogram. A distributed histogram<br />

indicates highly textured, thus salient regions. A peaked histogram indicates low texture <strong>and</strong><br />

lots of similar gray-values. Figure 3.5 depicts examples <strong>for</strong> peaked <strong>and</strong> distributed histograms.<br />

distributed<br />

peaked<br />

Figure 3.5: Example <strong>for</strong> peaked <strong>and</strong> distributed histograms. The image patch corresponding<br />

to the peaked histogram shows low texture. The distributed histogram corresponds to a highly<br />

textured region.<br />

Peaked <strong>and</strong> distributed histograms can be distinguished by their entropy value H. Distributed<br />

histograms show a larger (negative) entropy value than peaked histograms. To detect<br />

salient regions the entropy is computed <strong>for</strong> different window sizes. Different window sizes lead<br />

to different histograms. Consider an image with a homogeneous background showing a textured<br />

object. Computing the histogram <strong>for</strong> the object yields a distributed histogram. If the window<br />

size <strong>for</strong> the histogram will be increased, the window will contain more <strong>and</strong> more from the homogeneous<br />

background <strong>and</strong> the histogram will change from distributed to peaked. Such changes<br />

now indicate salient regions. In detail, a peak in the function H(w) indicates a salient region.<br />

The window size w of the peak in H(w) can be seen as the characteristic scale of the salient<br />

region.<br />

The algorithm can be summarized as follows. First, compute the entropy value <strong>for</strong> multiple<br />

window sizes <strong>for</strong> every pixel location. Search <strong>for</strong> a peak in H(w) <strong>for</strong> every pixel location. Select<br />

the locations which show a peak in H(w). The selected locations can be stored as triplet 〈x, y, s〉,<br />

where x, y is the location in the image <strong>and</strong> s is the window size, or scale respectively. Each triplet<br />

corresponds to an entropy value computed at location x, y with window size s. For lots of pixel


3.2. Scale invariant detectors 29<br />

locations H(w) does not contain a peak, but is monotonic increasing or decreasing. Such pixel<br />

locations will be discarded. The remaining triplets indicate salient regions, however a sort of<br />

non-maximum suppression is needed. With clustering in x, y, s space close nearby detections are<br />

merged <strong>and</strong> the cluster centers are the resulting salient regions, containing a position x, y <strong>and</strong><br />

a scale s. Different to other detectors an absolute saliency measure can be computed <strong>for</strong> each<br />

detection which introduces an ordering of the detections. The saliency Y is computed by<br />

Y = H(w)W (w). (3.17)<br />

W (w) is a weight function which measures the intra-scale unpredictability. In simple words, it<br />

measures the gray-value difference between two adjacent scales. A scale step which produces a<br />

high gray-value difference is a measure <strong>for</strong> high saliency. The intra-scale unpredictability W (w)<br />

is defined as<br />

∫ ∣ ∣∣∣ ∂<br />

W (w) = w<br />

∂w p(d i, w)<br />

∣ dI. (3.18)<br />

W (w) can be computed practically by the absolute sum of differences between the histograms of<br />

the adjacent scales <strong>and</strong> multiplied with the current window size. The absolute saliency measure<br />

Y allows to limit the detection to the most n salient features.<br />

In Figure 3.6(b) examples <strong>for</strong> detected salient regions are shown. Each salient region is<br />

visualized by the center point (yellow cross) <strong>and</strong> the characteristic scale drawn as a circle around<br />

the center point.<br />

The entropy may not necessarily be computed from the image gray-values. The method<br />

can also be applied to directional data as produced by an edge detector as shown in [52]. The<br />

method can readily be applied to a lot of different descriptors, which gives the method a large<br />

versatility. For example, in [31] salient regions are detected by computing the entropy on the<br />

cornerness value of the Harris detector.<br />

3.2.5 Normalization<br />

The additional scale in<strong>for</strong>mation of the scale-invariant interest detectors can be used <strong>for</strong> normalization.<br />

The scale parameter defines a circular measurement region around the detection.<br />

This is important <strong>for</strong> image matching, where corresponding detections are searched. In most<br />

cases matching works by extracting an image descriptor from the measurement region of the<br />

detection or by area based correlation of the measurement regions. Scale changes are there<strong>for</strong>e<br />

a big problem <strong>for</strong> matching. This can be overcome by normalization of the detection. By<br />

knowing the characteristic scale of the detection the measurement region can be re-sampled to<br />

a fixed canonical size which is used <strong>for</strong> correlation or descriptor extraction. Re-sampling <strong>and</strong><br />

interpolation however have to be per<strong>for</strong>med carefully. Best results will be gained if the bigger<br />

measurement region is downsized to fit the smaller one. If the scale change is high downsizing<br />

according to the scale-space theory (Gauss filtering <strong>and</strong> re-sizing) should be per<strong>for</strong>med. Normalizing<br />

the measurement region in the described way is possible <strong>for</strong> all previously described<br />

detector methods.<br />

In addition to scale normalization Lowe describes a method to normalize the DOG-keypoints<br />

<strong>for</strong> an arbitrary rotation [66]. The method works by computing a histogram of the gradients<br />

which occur in the measurement region. A histogram with 36 bins covering the 360 ◦ is reported to<br />

give good results. The histogram maximum defines the principal orientation of the detection. If<br />

the histogram contains multiple equally strong local maxima the detection gets assigned multiple<br />

orientations. To compensate <strong>for</strong> the low resolution of the histogram (every bin accounts <strong>for</strong> 10 ◦ )


3.3. Affine invariant detectors 30<br />

(a)<br />

(b)<br />

Figure 3.6: Detection examples <strong>for</strong> scale invariant DOG <strong>and</strong> salient region detector on ”Group”<br />

scene. (a) DOG detector. (b) Salient region detector.<br />

a parabola is fit to the maxima <strong>and</strong> its neighbors <strong>and</strong> the interpolated peak of the parabola is<br />

used as the principal orientation. This principal orientation can now be used in the re-sampling<br />

stage of the normalization to rotate the detection into a canonical orientation. This orientation<br />

estimation has be successfully used <strong>for</strong> the DOG-keypoints. However, the method can also be<br />

applied without restrictions to all of the previously described scale-invariant interest detectors.<br />

3.3 Affine invariant detectors<br />

Affine invariant detectors are designed to produce repetitive detections despite of an arbitrary<br />

affine trans<strong>for</strong>mation of the image. For an example see Figure 3.7. Figure 3.7(a) shows a single<br />

detection on the original image. Figure 3.7(b) shows the detection on an affine trans<strong>for</strong>med<br />

version of the original. The affine trans<strong>for</strong>mation contains a rotation of 30 ◦ , an axis shear of 10 ◦ ,<br />

a scale change in x <strong>and</strong> y of 90% <strong>and</strong> 80% respectively. Despite the distortion the center point is<br />

detected repetitively <strong>and</strong> the ellipse detected in the trans<strong>for</strong>med image covers the same content as<br />

in the original image. Affine invariant detectors were developed to cope with heavy perspective<br />

distortions in wide-baseline scenarios where large view-point changes occur. The effect of a<br />

perspective distortion can be approximated locally by an affine trans<strong>for</strong>mation. Locally means<br />

in this respect a small area (measurement region) around an interest point detection. The in the<br />

following presented affine invariant detectors are based on different concepts. The affine-invariant<br />

Harris <strong>and</strong> Hessian detectors are based on simple interest point detectors <strong>and</strong> searches an affine<br />

invariant measurement region <strong>for</strong> each point detection. An other method, the MSER detector<br />

finds homogeneous, characteristic delineated image regions with a method not affected by an<br />

affine trans<strong>for</strong>mation. However, independent of the detection method each detector detects a<br />

measurement region <strong>and</strong> a characteristic point within the measurement region. The shape of the


3.3. Affine invariant detectors 31<br />

measurement region of different methods may vary. Some detect elliptical measurement regions,<br />

others detect measurement regions of arbitrary shape, however every method finds an affine<br />

trans<strong>for</strong>mation to trans<strong>for</strong>m the measurement region into a normalized canonical coordinate<br />

frame. In the case of an elliptical measurement region the computed affine trans<strong>for</strong>mation<br />

trans<strong>for</strong>ms the ellipse into a circular region of unit size. The affine trans<strong>for</strong>mation removes the<br />

different scaling in the both principal directions <strong>and</strong> the shear. What remains, however, is an<br />

arbitrary rotation between the canonical representation. A different normalization scheme can<br />

be used <strong>for</strong> the MSER detector, where a characteristic outline of the region is detected. Here<br />

the normalization works by creating a so called local affine frame (LAF). Such a normalization<br />

also accounts <strong>for</strong> the rotation. In the following the most prominent methods are described in<br />

detail.<br />

(a)<br />

(b)<br />

Figure 3.7: Example <strong>for</strong> an affine invariant detector. The left image shows a detection on the<br />

original image. The right image shows the detection on an affine trans<strong>for</strong>med version of the<br />

original. The center point is detected repetitively <strong>and</strong> the ellipse detected in the trans<strong>for</strong>med<br />

image covers the same content as in the original image.<br />

3.3.1 Affine-invariant Harris detector<br />

The affine-invariant Harris detector, also known as Harris-Affine detector, has been introduced<br />

by Mikolajczyk [73] as an extension to the scale invariant Harris-Laplace detector [72]. The<br />

detector works by estimating the affine shape of a local structure in the neighborhood of a<br />

scale adapted Harris corner. The method assumes that the local neighborhood of an interest<br />

point is an affine trans<strong>for</strong>med, <strong>and</strong> thus anisotropic structure of an originally isotropic structure.<br />

By finding the parameters of this affine trans<strong>for</strong>mation the local anisotropic structure can be<br />

trans<strong>for</strong>med back to the isotropic structure. An isotropic circular structure could be represented<br />

by a circular region. Its affine trans<strong>for</strong>mation results in an ellipse. The regions of the Harris-<br />

Affine detector are there<strong>for</strong>e ellipses representing the affine, anisotropic trans<strong>for</strong>mation of the<br />

local structure. The detector does not only detect the interest regions invariant to an affine<br />

trans<strong>for</strong>mation, but returns also the trans<strong>for</strong>mation parameters to normalize them into shapes<br />

which show an isotropic local neighborhood.<br />

The first step of the algorithm is the detection of scale adapted Harris corners on different<br />

scale levels. This step is identical to the Harris-Laplace detector. In a next step <strong>for</strong> each detected<br />

Harris corner the shape adaptation is now per<strong>for</strong>med to estimate the anisotropic structure of the<br />

local neighborhood. The characteristic scale as defined previously is used as an initial value <strong>for</strong><br />

the affine shape adaptation. The anisotropic shape of a local image structure can be estimated


3.3. Affine invariant detectors 32<br />

with the second moment matrix. This has been shown by Lindeberg [64] <strong>and</strong> later Baumberg [6].<br />

The second moment matrix in an affine scale-space is given by:<br />

M(x, Σ I , Σ D ) = det(Σ D )g(Σ I ) ⊗ ( (∇I)(x, Σ D )(∇I)(x, Σ D ) T ) (3.19)<br />

Σ I is a covariance matrix which determines the integration Gaussian kernel, which is <strong>for</strong> the<br />

smoothing of the gradient values over a local neighborhood. Σ D is the covariance matrix <strong>for</strong> the<br />

differentiation Gaussian kernel which steers the Gaussian derivatives <strong>for</strong> the gradient computation.<br />

In [73] the authors propose to set Σ I = sΣ D , where s is a scalar. This limits, with a little<br />

loss of generality, the number of possible kernel combinations to make the computation feasible<br />

in practice. Basically this means that the differentiation <strong>and</strong> integration kernel will differ only in<br />

size <strong>and</strong> not in shape. The second moment matrix M(x, Σ I , Σ D ) can now be used to trans<strong>for</strong>m<br />

the local structure into an isotropic structure with<br />

x = M − 1 2 x<br />

′<br />

(3.20)<br />

where x ′ is a point in the original anisotropic neighborhood. x is a point in the normalized<br />

isotropic neighborhood. The trans<strong>for</strong>mation matrix is the inverse matrix square root of the<br />

second moment matrix M of the local structure of the point x ′ . For two points x L in a left<br />

image <strong>and</strong> x R in a right image which are related by an affine trans<strong>for</strong>mation x R = Ax L a<br />

relation x L <strong>and</strong> x R can be derived which relates both points in terms of the second moment<br />

matrices.<br />

M 1 2<br />

R<br />

x R = RM 1 2<br />

L<br />

x L (3.21)<br />

This relation is determined up to an arbitrary rotation R. The task of shape adaptation is<br />

now to estimate the second moment matrix M which trans<strong>for</strong>ms the local neighborhood into<br />

an isotropic structure. The eigenvalues of the second moment matrix can be interpreted as<br />

a measure of isotropy. Equal eigenvalues indicate an isotropic structure. The ratio of the<br />

eigenvalues gives then a normalized measure <strong>for</strong> the isotropy:<br />

Q = λ min<br />

λ max<br />

(3.22)<br />

The range <strong>for</strong> the value <strong>for</strong> Q is in the range of [0..1] where 1 indicates a perfect isotropic<br />

structure. This measure is now used to evaluate the current estimate of the trans<strong>for</strong>mation<br />

matrix U which trans<strong>for</strong>ms a local structure into a perfectly isotropic one. The trans<strong>for</strong>mation<br />

matrix U is a concatenation of square roots of second moment matrices.<br />

U = ∏ k<br />

(M − 1 2 ) (k) U (0) (3.23)<br />

(M − 1 2 ) (k) is the square root of the second moment matrix in step (k) of the iterative algorithm<br />

<strong>and</strong> U (0) is the 2 × 2 identity matrix. For each iteration the second moment matrix will be<br />

estimated with the characteristic scale as initial value <strong>for</strong> Σ I <strong>and</strong> Σ D . The estimated trans<strong>for</strong>mation<br />

will then be applied to the local neighborhood <strong>and</strong> U will be updated. In the next<br />

step the second moment matrix will be estimated again <strong>and</strong> the trans<strong>for</strong>mation will be applied<br />

again. This will be iterated until the measure Q of the second moment matrix is close to 1, that<br />

means the structure is almost isotropic. The sequence of trans<strong>for</strong>mations U will then be used<br />

to represent the elliptic shape of the detection. The algorithm converges fast, usually less then<br />

10 iterations are necessary.<br />

Figure 3.8(a) shows examples <strong>for</strong> the Harris-Affine detector. Each detection is visualized by<br />

its center point (yellow cross) <strong>and</strong> its associated elliptical measurement region. For more details<br />

<strong>and</strong> also <strong>for</strong> implementation details the interested reader my be referred to [73].


3.3. Affine invariant detectors 33<br />

3.3.2 Affine-invariant Hessian detector<br />

The affine-invariant Hessian detector is very similar to the previously described Harris-Affine<br />

detector. It is also known as Hessian-Affine detector <strong>and</strong> described in [71]. The detection algorithm<br />

is basically identical to the Harris-Affine detector with the only exception, that the initial<br />

interest points are identified with the Hessian-Laplace method instead of the Harris-Laplace<br />

method. The Hessian-Affine detector however produces very similar results to the Harris-Affine<br />

detector, which is not very surprising as the algorithms are almost identical. Figure 3.8(b)<br />

shows examples <strong>for</strong> the Hessian-Affine detector. Each detection is represented by its center<br />

point (yellow cross) <strong>and</strong> its associated elliptical measurement region.<br />

(a)<br />

(b)<br />

Figure 3.8: Detection examples <strong>for</strong> affine invariant Harris <strong>and</strong> Hessian detectors on ”Group”<br />

scene. (a) Harris-Affine detector. (b) Hessian-Affine detector.<br />

3.3.3 Maximally stable region detector (MSER)<br />

The MSER detector is a currently very popular affine invariant region detector developed by<br />

Matas [70]. The concept of the detector is very different from the previously described detectors.<br />

One of the biggest differences is, that the measurement region can be of arbitrary shape. The<br />

MSER region is defined by its border pixels, a connected set of pixels. The border pixels <strong>and</strong> all<br />

pixel inside constitute the MSER region. A MSER region is a part of an image, delineated by a<br />

boundary, where all pixels inside the boundary are either brighter or darker as the pixels outside<br />

the boundary. Such image regions have a variety of interesting favorable properties. First, the<br />

region definition is unaffected by monotonic changes of image intensities. The region is defined<br />

only by a relative ordering of the intensities. Common models <strong>for</strong> photometric changes, like a<br />

change in illumination will not effect the detection of the interest regions. The most important<br />

property however is that the definition is invariant to continuous geometric trans<strong>for</strong>mations. A<br />

connected set of pixels will again be trans<strong>for</strong>med into a connected set of pixels by a continuous


3.3. Affine invariant detectors 34<br />

geometric trans<strong>for</strong>mation. Rotation, scale change, plane-perspective trans<strong>for</strong>mations will not<br />

influence the repetitive detection of a MSER region. This is an enormously valuable property<br />

when dealing with wide-baseline scenarios.<br />

The MSER detection algorithm is related to thresholding, which is already anticipated from<br />

the definition of a MSER region. In terms of thresholding the algorithm can easily be defined<br />

as follows. Imagine all possible thresholdings of a gray-level image. For a gray-level image we<br />

have 256 thresholds t 0 < t 1 < ... < t 256 . Let the thresholded binary images show white pixels if<br />

the gray-value is higher than the threshold <strong>and</strong> black otherwise. Now imagine a movie showing<br />

the binary images starting with the one computed with t 0 <strong>and</strong> the others following in increasing<br />

order. The first frame will be completely white but soon black regions will appear <strong>and</strong> grow<br />

with increasing thresholds. Some of the appearing black regions will stay stable <strong>for</strong> a series<br />

of thresholds <strong>and</strong> these regions are the ones to be detected. Maximally stable regions are the<br />

ones which area does not change <strong>for</strong> a certain number of thresholds. These image regions will<br />

be reported by the algorithm. With increasing threshold initially distinct regions will merge<br />

<strong>and</strong> eventually create another stable region out of two others. Generally the algorithm might<br />

produced nested detections on different scales. Referencing back to the definition of the MSER<br />

regions the above described thresholding method will detect regions where all the inside pixels<br />

have a higher gray-value than the boundary pixels. The other variant of the MSER regions can<br />

be computed by reversing the order <strong>and</strong> starting with the binary image <strong>for</strong> the highest threshold<br />

t 256 . This will produce the regions where the pixels inside the border show lower gray-values<br />

than the outside pixels. An efficient implementation however will not compute the single binary<br />

images. Instead a sorted list of pixel-values is created similar to a histogram but including the<br />

pixel location. For an image with n pixels this can be done in O(n) time. Now connected<br />

components at each level have to be detected <strong>and</strong> maintained over all different levels. The<br />

change in area <strong>for</strong> the identified regions has to be computed to identify stable regions. This can<br />

efficiently be solved by using the union-find [97] algorithm as proposed in [70]. The union-find<br />

algorithm then determines the overall complexity by O(n log log n). The described algorithm<br />

returns a MSER region as a set of connected pixels. A lot of applications however, e.g. epipolar<br />

geometry estimation, are in need <strong>for</strong> a single point location associated to each detection. For the<br />

MSER detection this can be done by computing the center of gravity (COG) of the MSER pixels.<br />

As shown in [85] the COG of an MSER regions is invariant to affine trans<strong>for</strong>mation. In other<br />

words the COG’s of two MSER regions connected by an affine trans<strong>for</strong>mation are connected by<br />

the same affine trans<strong>for</strong>mation. The COG however is not localized on a special image feature<br />

like a corner or an edge feature thus the localization accuracy is determined by the pixel set of<br />

the MSER region only.<br />

The MSER detection algorithm is not only theoretical very efficient but also is really fast<br />

in an actual implementation. Current implementations allow a real-time detection of MSER<br />

regions with about 25 frames per second. This behavior makes the MSER detector extremely<br />

interesting <strong>for</strong> the development of computer vision systems. Figure 3.9 shows examples <strong>for</strong> the<br />

MSER detector. In Figure 3.9(a) the detections are visualized by ellipses fitted to the MSER<br />

detections. Figure 3.9(b) shows the original contour based representation of the detections.<br />

3.3.4 Affine-invariant salient region detector<br />

The affine-invariant salient region detector is a straight<strong>for</strong>ward extension of the salient region<br />

detector presented in the previous section. The extension has been proposed by the original<br />

author in [53]. The affine invariancy is gained by an affine shape adaptation of the detection of


3.3. Affine invariant detectors 35<br />

(a)<br />

(b)<br />

Figure 3.9: Detection example <strong>for</strong> MSER regions on ”Group” scene. (a) Ellipse representation.<br />

(b) Original contour based representation.<br />

the original salient region detector. The algorithm works by first detecting scale invariant salient<br />

regions. Starting with a circular salient region the shape adaptation allows to trans<strong>for</strong>m the<br />

original region to elliptical shape. The goal is to adapt the shape so that the saliency measure<br />

Y is maximal. The elliptical shape is parameterized by 3 values, the scale s, the ratio of the<br />

major to the minor axis ρ <strong>and</strong> the orientation of the major axis θ. The length of the major axis<br />

of the ellipse is defined by √ s<br />

ρ<br />

<strong>and</strong> the minor axis by sρ. This 3-vector replaces the previously<br />

scalar scale parameter s. Let us remember that the absolute saliency measure is defined by<br />

Y = H(w)W (w). W (w) is the intra-scale unpredictability <strong>and</strong> is to be maximized by the shape<br />

adaptation. The axis ratio ρ <strong>and</strong> the orientation θ are changed until a maximum value <strong>for</strong> W (w)<br />

has been found. After the orientation <strong>and</strong> axis ratio of the ellipse has been fixed the scale is<br />

varied to find again the peak of H(w) now <strong>for</strong> the elliptical shaped region.<br />

As many other affine invariant detectors the detection is parameterized as a center point <strong>and</strong><br />

an elliptical measurement region around the detection. Different to others the ellipse however is<br />

not parameterized by the second moment matrix (gained from the gray-value distribution) but<br />

with the orientation θ <strong>and</strong> the major <strong>and</strong> minor axis. There is a difference therein which is worth<br />

to be discussed. The shape adaptation follows the idea initially presented by Baumberg [6] where<br />

the second moment matrix of the gray-values from the measurement region of the detection is<br />

used to remove an arbitrary affine trans<strong>for</strong>mation. The trans<strong>for</strong>mation specified by the second<br />

moment matrix accounts <strong>for</strong> the different modalities of the affine trans<strong>for</strong>m, that is rotation,<br />

scale change in x <strong>and</strong> y direction <strong>and</strong> shear (a possible translation is assumed to be removed<br />

already). In the shape adaptation used <strong>for</strong> the salient regions, however, the trans<strong>for</strong>mation is not<br />

computed from the second moment matrix but from the ellipse parameters found through shape<br />

adaptation. The trans<strong>for</strong>mation includes orientation, scaling in x <strong>and</strong> y direction (from the axis<br />

ratio) but not the shear! Parameterizing an affine trans<strong>for</strong>m from such an ellipse description


3.3. Affine invariant detectors 36<br />

can not produce a trans<strong>for</strong>mation containing shear. There is no way by using this method<br />

to normalize two regions differing by an affine trans<strong>for</strong>mation containing shear into the same<br />

canonical coordinate frame. This is a big limitation of the proposed method <strong>and</strong> sadly it is not<br />

discussed in their paper.<br />

Another issue to be discussed is the local search strategy <strong>for</strong> the optimal shape. The authors<br />

propose a brute-<strong>for</strong>ce strategy which is computationally expensive. In addition the tested ellipse<br />

parameters are discrete. It is not clear from the paper if the method will really find the optimal<br />

shape with the brute <strong>for</strong>ce approach <strong>and</strong> theoretical considerations about the convergence are<br />

not given.<br />

A last point concerns the practical application of the method. The method contains a lot<br />

of computational expensive steps, histogram computation <strong>for</strong> different scales, clustering, brute<br />

<strong>for</strong>ce shape adaptation. The method is inherently very slow. For most practical application<br />

running on state-of-the-art computers the method is in fact too slow. Examples <strong>for</strong> the affine<br />

salient region detector are depicted in Figure 3.10. Each affine salient region is visualized by its<br />

center point (yellow cross) <strong>and</strong> its associated elliptical measurement region.<br />

Figure 3.10: Detection example <strong>for</strong> affine invariant salient region detector on ”Group” scene.<br />

3.3.5 Intensity extrema-based region detector (IBR)<br />

Intensity extrema-based regions were first introduced by Tuytelaars <strong>and</strong> Van Gool [113]. The<br />

detector selects anchor points using a gray-value intensity criteria <strong>and</strong> then identifies a region<br />

border around the anchor point in an affine invariant way. The resulting regions show in general<br />

arbitrary shapes around a blob-like homogeneous anchor point. The first step of the algorithm<br />

is the detection of anchor points. Other than previous methods the algorithm does not use a<br />

corner or edge detector. Instead image locations which show a local intensity extrema are used.<br />

For this in a first step the image I is smoothed to remove image noise, e.g. with a Gaussian filter.<br />

Local intensity extrema are now identify by non-maximum suppression. Due to the smoothing<br />

the intensity extrema do not show a strong peak <strong>and</strong> there<strong>for</strong>e are weakly localized. However


3.3. Affine invariant detectors 37<br />

without smoothing a lot of anchor points would be produced because of image noise. Such<br />

identified anchor points are invariant to monotonic intensity trans<strong>for</strong>mations. In a second step<br />

a region delineation is now searched in an affine invariant way <strong>for</strong> every detected anchor point.<br />

Searching <strong>for</strong> a border works by emanating rays from the center of the anchor point. The rays<br />

are distributed uni<strong>for</strong>mly around the full 360 ◦ . Along each ray the intensity profile gets analyzed<br />

to find a characteristic gray-value change which is invariant to an affine trans<strong>for</strong>m. The function<br />

f I (t) evaluated <strong>for</strong> each ray is defined by<br />

f I (t) =<br />

max<br />

|I(t) − I 0 |<br />

( ∫ t<br />

0 |I(t)−I 0|dt<br />

t<br />

) (3.24)<br />

, d<br />

where t is the distance of the current evaluation position on a ray from the anchor point. I(t) is<br />

the intensity value along the ray at distance t. I 0 is the intensity value of the anchor point. d is<br />

a small number which prevents a division by zero. f I (t) typically shows a maximum when the<br />

intensity along the ray is changing significantly compared to the average changes along the ray.<br />

For instance this will happen if the ray crosses a border of a rather homogeneous image region.<br />

The function f I (t) is chosen to produce easily detectable extrema on intensity changes. It would<br />

be possible to detect the extrema in the plain intensity function I(t) along the rays, which would<br />

be as well affine invariant in theory. However, the extrema in I(t) are shallow <strong>and</strong> <strong>and</strong> not as<br />

stable as <strong>for</strong> f I (t). In the case that the global extremum along the ray does not significantly differ<br />

from other local extrema, the one extrema is selected which is located at a similar distance as<br />

the ones from the neighboring rays. After analyzing all rays the border of the region is given by<br />

a distinct set of point locations around the anchor point. By connecting the distinct points <strong>and</strong><br />

computing the convex hull a possible region delineation is created. Another possibility, which<br />

is also preferred by the original authors, is to fit an ellipse to the distinct points. The ellipse<br />

then defines the region border. The ellipse parametrization provides a simpler h<strong>and</strong>ling of the<br />

regions <strong>for</strong> subsequent matching tasks. It is important to note that the ellipse fitting creates an<br />

ellipse which is not necessarily centered around the original anchor point. The original anchor<br />

point is than replaced by the computed ellipse center in the region description. Figure 3.11(a)<br />

shows example detections <strong>for</strong> the IBR detector. Each detection is visualized by its center point<br />

(yellow cross) <strong>and</strong> its associated elliptical measurement region.<br />

For an implementation of this algorithm a proper value <strong>for</strong> the angle between two neighboring<br />

rays has to be set, which makes a tradeoff between speed <strong>and</strong> accuracy of regions border. Using<br />

a high number of rays it will take longer to evaluate the intensity profiles but will approximate<br />

the regions border more accurately. A small number of rays will be faster but will provide<br />

a poor approximation of the region border. Another implementation issue is the sampling of<br />

the intensity profiles. As the intensity profiles are sampled in different directions a proper<br />

interpolation or smoothing will be necessary. Both issues are not addressed by the original<br />

authors.<br />

As a final remark I would like to mention one specific property stated by the authors in [113].<br />

As the used anchor points are not corner points the chance that the region is located on a 3D<br />

corner is much smaller. Regions located on 3D corners are much more complicated to match as<br />

planar regions.<br />

3.3.6 Edge based region detector (EBR)<br />

The edge based region detector (EBR) has been described by Tuytelaars et al. in [111]. As<br />

the method is based on geometric constraints it is also known as geometry-based method. The


3.3. Affine invariant detectors 38<br />

method exploits the fact that an image corner usually appears when two image edges meet.<br />

The image corner <strong>and</strong> the two edges are then used to define an affine invariant region. In a<br />

first step of the algorithm corners <strong>and</strong> edges have to be detected in the image. The authors<br />

propose to use the Harris corner detector [40] to detect the anchor points <strong>for</strong> the algorithm.<br />

For edge detection the authors propose to use the Canny edge detector [15]. As corner <strong>and</strong><br />

edge detection is per<strong>for</strong>med by different methods it is not certified that the corner is located at<br />

the intersection of the edges. This is however not a necessary criteria <strong>for</strong> the region detection.<br />

The method which is described in the following will work on non-straight lines. For straight<br />

lines a special adaptation of the method will be described afterwards. The method works by<br />

constructing parallelograms from the corner point p <strong>and</strong> points p 1 <strong>and</strong> p 2 located on each edge.<br />

The parallelogram construction is driven by an affine invariant. The functions l 1 <strong>and</strong> l 2 are<br />

relative, affine invariants. ∫ ∣ ∣∣∣<br />

l 1 = det( dp 1(s 1 )<br />

p − p 1 (s 1 ))<br />

ds 1<br />

∣ ds 1 (3.25)<br />

∫ ∣ ∣∣∣<br />

l 2 = det( dp 2(s 2 )<br />

p − p 2 (s 2 ))<br />

ds 2<br />

∣ ds 2 (3.26)<br />

The ratio l 1<br />

l2<br />

is an absolute affine invariant <strong>and</strong> the association of a point on the one edge with a<br />

point on the other edge is also affine invariant. Two points p 1 <strong>and</strong> p 2 are associated when l 1 = l 2 .<br />

We will denote this relation simply as l. Then the points p 1 <strong>and</strong> p 2 are parameterized by a single<br />

parameter l which ensures a family of affine invariant parallelogram constructions. Now certain<br />

photometric properties of the pixels inside the defined parallelograms are evaluated <strong>and</strong> the<br />

parallelogram constructions yielding an extremum of the photometric properties are reported<br />

as affine invariant regions. The following functions on the pixels inside an parallelogram can be<br />

used to <strong>for</strong> this task.<br />

f 1 (Ω) = 1<br />

|Ω|<br />

∑<br />

d i (3.27)<br />

The function f 1 represents the average intensity over the region of the parallelogram Ω. Ω is<br />

the set of all pixels inside the parallelogram. d i is the intensity of a single pixel <strong>and</strong> |.| denotes<br />

the cardinality. Note, that the average intensity itself is not invariant to affine photometric <strong>and</strong><br />

geometric changes, but an extremum in the average intensities <strong>for</strong> a family of parallelograms is.<br />

The goal is there<strong>for</strong>e to identify the parallelogram construction which shows an extremum in<br />

the average intensity function f 1 . Function f 2 now represents an absolute affine invariant.<br />

Ω<br />

f 2 (Ω) = |p − q p − p g|<br />

|p − p 1 p − p 2 |<br />

(3.28)<br />

The function f 2 is a ratio of areas depending on the center of gravity p g . q is the corner of the<br />

parallelogram opposite to the point p <strong>and</strong> is defined as q = p 1 +p 2 −p. Although f 2 is an absolute<br />

affine invariant in practice the best results are given when searching again <strong>for</strong> extrema of the<br />

function. In a further work of Tuytelaars [112] two further evaluation functions are introduced.<br />

Let us now discuss the case of straight lines. Straight lines emanating from a corner point<br />

occur frequently in images. Thus this case cannot be neglected. For straight lines the functions<br />

l 1 <strong>and</strong> l 2 yield l 1 = l 2 = 0. Thus it is not possible to use the relation l 1 = l 2 to associate points on<br />

one edge to points on the other edge. It is there<strong>for</strong>e necessary to construct parallelograms <strong>for</strong> all<br />

combinations of both edge points. This gives a 2-dimensional search space in the parameters s 1<br />

<strong>and</strong> s 2 . However, as shown in [111] a single function does not give a well localized extremum but<br />

a valley. However, by simultaneous evaluation of two functions, say f 1 <strong>and</strong> f 2 two valleys will be


3.3. Affine invariant detectors 39<br />

created which intersect each other. The intersection of the valleys then defines the parameters<br />

of the parallelogram reported as affine invariant region.<br />

The edge based regions differ from regions of other detectors as they are not centered around<br />

the initial anchor point. Instead the anchor point is located at one corner of the parallelogram<br />

shaped region. It would be possible to extend the parallelogram in a way that the anchor point<br />

is located at the intersection of the diagonals. But this would enlarge the initial detection<br />

<strong>and</strong> as corners are very often located on depth discontinuities it increases the change that the<br />

enlarged region then is located on the depth discontinuity, which is not a desired property <strong>for</strong><br />

region matching. In [112] the authors also describe to fit an ellipse to the parallelogram shaped<br />

regions to create a similar representation as other detectors to compare the per<strong>for</strong>mance of<br />

different detectors. Figure 3.11(b) shows examples <strong>for</strong> such detections where each EBR region<br />

is represented by its center point (yellow cross) <strong>and</strong> its associated elliptical measurement region.<br />

(a)<br />

(b)<br />

Figure 3.11: Detection examples <strong>for</strong> affine invariant detectors on ”Group” scene. (a) Intensity<br />

based regions. (b) Edge based regions.<br />

3.3.7 Normalization<br />

All of the previously described methods allow to represent the detections as a center point<br />

associated with an elliptical measurement region. The elliptical measurement region is given by<br />

an estimate <strong>for</strong> the affine second moment matrix. This representation is natural <strong>for</strong> the Harris-<br />

Affine, Hessian-Affine method <strong>and</strong> the affine invariant salient region detector. The regions of<br />

the MSER, EBR <strong>and</strong> IBR detector are originally represented differently, however they can be<br />

represented as ellipses as well but which generally causes the loss of some in<strong>for</strong>mation. The<br />

following normalization method is based on the elliptical shape representation using the affine<br />

second moment matrix. For the EBR <strong>and</strong> IBR regions the authors did not provide an own<br />

normalization method, there<strong>for</strong>e this method also applies. For the case of the MSER methods


3.4. Comparison of the described methods 40<br />

the original authors propose a normalization based on a local affine frame (LAF) [85] which will<br />

also be outlined in this section.<br />

Normalization of the elliptical point detections works by re-sampling the region area into<br />

a canonical isotropic coordinate system. The necessary affine trans<strong>for</strong>mation is directly given<br />

by the inverse square root of the second moment matrix. The points within the measurement<br />

region can be trans<strong>for</strong>med into the canonical representation by<br />

x c = M − 1 2 x (3.29)<br />

where x is the pixel location in the original coordinate system, x c is the pixel location in the<br />

canonical coordinate system <strong>and</strong> M − 1 2 is the according affine second moment matrix. Please<br />

note that the trans<strong>for</strong>mation M − 1 2 assumes the center point of the detection to be the origin of<br />

the trans<strong>for</strong>mation coordinate system. Normalization with the second moment matrix results<br />

in a circular image region, where the different scalings <strong>and</strong> the shear gets removed. However,<br />

the patch is arbitrarily rotated. For image matching a rotation invariant descriptor has to be<br />

used, or the additional rotation normalization as describe <strong>for</strong> scale-invariant regions above has<br />

to be used. The normalization is illustrated in Figure 3.12. An original isotropic local structure<br />

is trans<strong>for</strong>med using two different affine trans<strong>for</strong>mations. The isotropic structure is represented<br />

by two orthogonal lines. Figures 3.12(a),(b) show the initial detections using the Harris-Affine<br />

detector. The elliptical measurement region is represented using the second moment matrix.<br />

Figure 3.12(c) shows the normalization of the detection in Figure 3.12(a). Figure 3.12(d) shows<br />

the normalization of the detection in Figure 3.12(b). The isotropic structure (visualized by the<br />

two lines) has been reconstructed nicely by the normalization, the original orthogonality has<br />

been recovered. However, the normalized detections still differ in an arbitrary rotation.<br />

For the regions of the MSER detector the normalization can be done using a so-called local<br />

affine frame (LAF) [85]. The basic idea is to identify 3 points which are invariant to an affine<br />

trans<strong>for</strong>mation. This three points define the axes of a coordinate system which represents the<br />

LAF. The points can now be used to parameterize an affine trans<strong>for</strong>mation to a canonical coordinate<br />

system where the axes are orthogonal <strong>and</strong> of equal length in both directions. Normalization<br />

is done by applying the such created affine trans<strong>for</strong>m to the detection. The normalized patches<br />

will be perfectly aligned, also the orientation will be recovered. The critical point <strong>for</strong> this method<br />

is however the identification of the affine invariant points within the measurement region. This<br />

is possible <strong>for</strong> the MSER regions because the region is represented by its contour. The first<br />

invariant point is the center of gravity (COG) of the detected region. It is shown in [85] that the<br />

COG of an MSER region is invariant to an affine trans<strong>for</strong>mation. Further additional points are<br />

topological extremal points of the regions contour. Such extremal points are invariant to affine<br />

trans<strong>for</strong>mations. With the COG <strong>and</strong> two additional contour points a LAF can be constructed<br />

<strong>and</strong> used <strong>for</strong> normalization. In [85] several possible methods to create LAF’s <strong>for</strong> MSER’s are<br />

described.<br />

3.4 Comparison of the described methods<br />

As one can see easily from the big list of detectors described above there exists a vast variety<br />

of different methods. Each method has its pros <strong>and</strong> cons <strong>and</strong> peculiarities. In this section<br />

we present a table (Table 3.1) comparing the properties of the most important local detectors<br />

against each other. The table is based on the extensive evaluation per<strong>for</strong>med by Mikolajczyk<br />

<strong>and</strong> Schmid [72–74, 76] <strong>and</strong> on the publicly available implementations of the detectors used in


HarrisAffinePoints / shape estimation<br />

HarrisAffinePoints / shape estimation<br />

3.4. Comparison of the described methods 41<br />

(a)<br />

(b)<br />

(c)<br />

(d)<br />

Figure 3.12: Example <strong>for</strong> normalization, an isotropic local structure is trans<strong>for</strong>med using two<br />

different affine trans<strong>for</strong>mations. (a)(b) Initial detections using the Harris-Affine detector, the<br />

elliptical measurement region is represented using the second moment matrix. (c) Normalization<br />

of the detection in (a). (d) Normalization of the detection in (b). The isotropic structure<br />

(visualized by the two lines) has been reconstructed nicely by the normalization. However, the<br />

normalized detections still differ in an arbitrary rotation.<br />

the evaluation 1 . The table will be very useful if one needs to select the proper method <strong>for</strong> an<br />

application or getting a general overview of the per<strong>for</strong>mance of state-of-the-art methods.<br />

The table contains ratings <strong>for</strong> invariance, the number of detections, the repeatability score,<br />

the matching score, the speed of the method <strong>and</strong> an overall rating based on a combination of<br />

the other ratings. In the following the ratings <strong>and</strong> terminology used in the table are described.<br />

Invariance: In this column the detectors invariance to a class of trans<strong>for</strong>mations is given. The<br />

detectors are classified into three groups, no invariance (’none’), invariant to scale change<br />

(’scale’) <strong>and</strong> invariant to affine trans<strong>for</strong>mation (’affine’). One method is rated with ’affine ∗ ’.<br />

This is because the detector is not fully invariant to an affine trans<strong>for</strong>mation. For more<br />

details see the description of the detector in the previous section.<br />

Number of detections: The number of detections is quite different <strong>for</strong> the various detectors.<br />

Although <strong>for</strong> most methods the number of detections depends on the parameter settings,<br />

1 Implementations were collected by Krystian Mikolajczyk <strong>and</strong> are available at<br />

http://www.robots.ox.ac.uk/∼vgg/research/affine/


3.4. Comparison of the described methods 42<br />

each method has quite a characteristic number of useful detections. The detection number<br />

is classified qualitatively in four categories, low, medium, high, very high. The rating ’low’<br />

corresponds to about 100 detections, whereas ’very high’ corresponds to about several<br />

thous<strong>and</strong> detections.<br />

Repeatability: The repeatability score is an important quality criteria <strong>for</strong> a local detector. It<br />

has been introduced in [72]. The repeatability scores published in [72–74, 76] have been<br />

used to rank the detectors. The different values have been qualitatively divided into four<br />

categories, low, medium, high, very high.<br />

Matching score: The matching score is also a measure introduced in [72] <strong>and</strong> used in combination<br />

with the repeatability score. The scores published in [72–74, 76] have been used to<br />

rank the detectors based on their matching properties using the SIFT descriptor [66]. The<br />

different values have been qualitatively divided into four categories, low, medium, high,<br />

very high.<br />

Speed: The detection speed is very interesting <strong>for</strong> building practical applications. The speed<br />

has been evaluated by using the publicly available implementations. The speed has been<br />

divided into five categories, very slow, slow, medium, fast, very fast. Methods with the<br />

rate ’fast’ or ’very fast’ can achieve a real-time frame-rate.<br />

Overall rate: The overall rate generally rates the usefulness of the different methods <strong>for</strong> practical<br />

applications. The rating is based on the evaluations but also reflects the personal<br />

experience with the different methods. It is divided into four categories, bad, ok, good,<br />

very good.<br />

Detector invariance number of<br />

detections<br />

repeat.<br />

matching<br />

score<br />

speed<br />

overall<br />

Harris none very high high low very fast ok<br />

Hessian none very high high low very fast ok<br />

Harris-Laplace scale medium high medium medium ok<br />

Hessian-<br />

Laplace<br />

scale medium high medium medium ok<br />

DOG scale medium high medium fast very good<br />

Salient region scale low low low very slow bad<br />

Harris-Affine affine medium high high medium good<br />

Hessian-Affine affine medium high high medium good<br />

MSER affine low high high fast very good<br />

Affine salient<br />

region<br />

affine ∗ low low low very slow bad<br />

IBR affine low high high slow ok<br />

EBR affine low medium medium slow ok<br />

Table 3.1: Comparison of the properties of different local detectors. The ratings are based on<br />

evaluations in [72–74, 76]. Please see the text <strong>for</strong> a description of the different properties.


Chapter 4<br />

Evaluation on non-planar scenes 1<br />

From the previous chapter we already know that there exists an astonishing variety of different<br />

local detectors. Each method is based on different image features <strong>and</strong> in most cases developed to<br />

per<strong>for</strong>m well on a specific set of image data, mostly driven by the application. The development<br />

of a new method is then justified by achieving a better per<strong>for</strong>mance compared to previous<br />

methods. Thus it is quite common to compare the new method with current state-of-the-art<br />

methods. One example <strong>for</strong> this procedure would be the work from Carneiro <strong>and</strong> Jepson [16].<br />

They present a new local detector, so called phase-based local features <strong>and</strong> compare this method<br />

to the Harris-Laplace detector [72] <strong>and</strong> the DoG detector [66]. Although there is an extensive<br />

testing the new method is not compared to all state-of-the-art detectors, mainly because of the<br />

reason that this would involve a big ef<strong>for</strong>t to gather implementations of all detectors <strong>and</strong> putting<br />

them into a common framework.<br />

Nevertheless this task was pursued by Mikolajczyk <strong>and</strong> Schmid. With big ef<strong>for</strong>t they collected<br />

implementations of most state-of-the-art detectors <strong>and</strong> put it into a common evaluation<br />

framework. They achieved to get the implementations from the original authors itself to assure<br />

that the algorithms they compare are most efficient. The test results as well as the evaluation<br />

methods are published in [71, 74]. For measuring the per<strong>for</strong>mance of the detectors a repeatability<br />

score <strong>and</strong> a matching score are evaluated. A local detector is assumed to be good if it<br />

produces interest points <strong>and</strong> regions repetitively at the same locations on an object independent<br />

of acquisition conditions like viewpoint, illumination <strong>and</strong> scale changes. The evaluation of the<br />

repeatability of the local detectors in the case of a viewpoint change needs an automatic procedure<br />

<strong>for</strong> ground truth generation. Obviously this can not be done by matching because every<br />

known method will introduce mis-matches or will miss corresponding regions. Nevertheless the<br />

ground-truth can be established by geometric means. On planar surfaces a homography can<br />

be estimated. By using this homography it is possible to check if an interest point or region<br />

on a planar patch will occur in an image from a different viewpoint at the same location. The<br />

homography describes the geometry of the test scene. It acts as ground truth <strong>and</strong> has to be<br />

verified <strong>for</strong> each plane manually, but it allows an automatic verification of all the interest point<br />

correspondences in the scene.<br />

1 Based on the publications:<br />

F. Fraundorfer <strong>and</strong> H. Bischof. Evaluation of local detectors on non-planar scenes. In Proc. 28th Workshop of<br />

the Austrian Association <strong>for</strong> Pattern Recognition, Hagenberg, Austria, pages 125–132, 2004 [32]<br />

F. Fraundorfer <strong>and</strong> H. Bischof. A novel per<strong>for</strong>mance evaluation method of local detectors on non-planar scenes. In<br />

Workshop Proceedings Empirical Evaluation Methods in <strong>Computer</strong> <strong>Vision</strong>, IEEE Conference on <strong>Computer</strong> <strong>Vision</strong><br />

<strong>and</strong> Pattern Recognition, San Diego, Cali<strong>for</strong>nia, 2005 [33]<br />

43


44<br />

But using a plane to plane homography limits the possible test cases to planar scenes only.<br />

Because of this limitation it is questionable if the results of previous detector evaluations will<br />

hold <strong>for</strong> realistic, non-planar scenes, especially <strong>for</strong> changing viewpoints. If interest regions are<br />

primarily detected on depth discontinuities <strong>and</strong> viewed from a different viewpoint the appearance<br />

changes significantly which will result in lower matching per<strong>for</strong>mance. There<strong>for</strong>e the results<br />

of the detector evaluation may change considerably if using 3D scenes. Figure 4.1 shows an<br />

example <strong>for</strong> Hessian-Affine regions <strong>and</strong> MSER regions. A significant number of Hessian-Affine<br />

regions are located on depth discontinuities while the MSER detector seems to avoid such areas.<br />

This motivates our approach to apply the evaluation of local detectors to complex, realistic,<br />

<strong>and</strong> practically relevant scenes. The basic idea of enabling this extension is to exploit the<br />

properties of the trifocal geometry [44]. Instead of defining the ground truth <strong>for</strong> 2 images we<br />

propose to use 3 images of the scene. A fundamental property of the trifocal geometry allows the<br />

coordinate transfer of a point correspondence from 2 views into the third view. And this transfer<br />

is not restricted to planes but is valid <strong>for</strong> arbitrary scenes. The proposed evaluation framework<br />

compares different local detectors according to 3 measures. A repeatability score measures<br />

the capabilities of local detectors to produce detections repetitive at the same locations in the<br />

presence of viewpoint changes. A matching score compares the descriptive <strong>and</strong> discriminative<br />

qualities of the detected regions. The matching score will also reflect the cases when a local<br />

detector tends to produce detections on depth discontinuities. As a last measure the absolute<br />

number of correct matches is introduced which is interesting in object recognition where a higher<br />

number of matches increases the robustness against partial occlusions as well as in geometry<br />

estimation where a higher number usually increases the accuracy.<br />

(a)<br />

(b)<br />

Figure 4.1: (a) Hessian-Affine regions (a significant fraction of detections are located on depth<br />

corners) (b) MSER regions (no detections on depth corners)


4.1. Measures 45<br />

4.1 Measures<br />

This section defines the measures which are used to evaluate the different local detectors. In<br />

previous evaluations from Mikolajczyk <strong>and</strong> Schmid [71] a repeatability <strong>and</strong> a matching score were<br />

defined. To be comparable with the previous evaluations we chose to use the same measures.<br />

In fact the repeatability <strong>and</strong> matching score capture the most important properties of local<br />

detectors, their repeatability. Basically local detectors are designed to select a subset of pixels<br />

of an image. The goal is that if the operator is applied to two images showing the same scene but<br />

differ by some trans<strong>for</strong>mation like scale change, rotation, translation or viewpoint change, the<br />

same subset of pixels is selected. In practice one gets two subsets which show some overlap. The<br />

pixels in the overlapping part can be said to be detected repetitively. The repeatability score<br />

thus assesses the number of repetitively detected pixel locations. Measuring the repeatability<br />

score is straight <strong>for</strong>ward by counting the number of detections which correspond. However,<br />

detecting the corresponding detections is the difficult task therein. It will be dealt with in detail<br />

in the next section. The repeatability score measures obviously the most basic property of a<br />

local detector.<br />

The matching score now measures a property of the next higher level. It evaluates the quality<br />

of the detections. One needs to consider a complete framework <strong>for</strong> local appearance based<br />

methods. After the detection of interest regions, a matcher is applied to identify corresponding<br />

detections from the two images. Such a matcher builds a description <strong>for</strong> the detection from its<br />

gray-value characteristics around it. Matching is than finding such a similar feature vector in the<br />

other image. Matching heavily relies on discriminative descriptions, i.e. the description of two<br />

different detections are easily to distinguish. One prerequisite there<strong>for</strong>e is that the considered<br />

areas around the detections show a characteristic gray-value variance. The matching score now<br />

assesses how many of the detections are correctly matched. The results of course may differ <strong>for</strong><br />

various matching schemes which use different descriptors. But this will allow to find detectordescriptor<br />

pairs which in combination achieve the best per<strong>for</strong>mance.<br />

In addition to the repeatability <strong>and</strong> matching score we would like to extend the previous<br />

evaluation framework with a new measure, the complementary score. The complementary score<br />

comes from the idea to combine two ore more of the available detectors. This has already been<br />

done, e.g. in the Video Google system from Sivic <strong>and</strong> Zisserman [99] Harris-Affine regions are<br />

used alongside with MSER regions. This resulted in a better recognition rate than using one<br />

of the detectors alone. This raises the fundamental question which of the detectors can be<br />

used in combination to increase the per<strong>for</strong>mance. We call two detectors to be complementary<br />

if the detections of both do not overlap <strong>and</strong> are located in different areas. This diversity is<br />

measured with the complementary score. Ideally one would use a combination of all available<br />

methods. However, most applications are time critical <strong>and</strong> would not allow the computation <strong>for</strong><br />

all possible methods. Here an evaluation would allow to select the best detector combinations<br />

<strong>for</strong> the specific application <strong>and</strong> available computing time.<br />

Let us start first with the details of the repeatability score.<br />

4.1.1 Repeatability score<br />

The repeatability score r i is a measure from two images. Let us assume an image sequence<br />

I 1 ...I n as illustrated in Figure 4.4, where the images are taken with increasing viewpoint change.<br />

The repeatability score is now calculated <strong>for</strong> image pairs, where one reference image is paired<br />

with all the others to get a sequence of increasing viewpoint angle. The arising pair sequence<br />

is then I 1 ↔ I 2 , I 1 ↔ I 3 , ..., I 1 ↔ I n . The repeatability score r i <strong>for</strong> image I i is thus the ratio of


4.2. Representation of the detections 46<br />

point-to-point (region-to-region) correspondences between the reference image I 1 <strong>and</strong> I i <strong>and</strong> the<br />

smaller number of points (regions) detected in one of the images. Only points (regions) located<br />

in the part of the scene present in both images are taken into account. It is given in Eq. (4.1).<br />

r i = r 1i =<br />

|C 1i |<br />

min(|R 1 |, |R i |)<br />

(4.1)<br />

R i is the set of all detected regions in image I i <strong>and</strong> |.| denotes the cardinality of a set. C ij is the<br />

set containing all true region correspondences between the images I i <strong>and</strong> I j . C ij contains only<br />

single correspondences, i.e. no element of R i corresponds to more than one element in R j .<br />

4.1.2 Matching score<br />

The matching score m i is the ratio of correct matches <strong>and</strong> the smaller number of regions detected<br />

in one of the images. It is depicted in Eq. (4.2).<br />

m i = m 1i =<br />

|M 1i |<br />

min(|R 1 |, |R i |)<br />

(4.2)<br />

M ij is the set containing all detected true region matches between the images I i <strong>and</strong> I j . In<br />

addition we define m i as the matching score related to the number of possible matches |C ij | (see<br />

Eq. 4.3).<br />

m i = m 1i = |M 1i|<br />

(4.3)<br />

|C 1i |<br />

4.1.3 Complementary score<br />

The complementary score c n i is the number of correctly matched non-overlapping regions between<br />

two different viewpoints. The measure is given relative to the sum of all matching detections.<br />

It is given in Eq. (4.4).<br />

c n i = |M i 1 ∪ M i 2 ∪ ... ∪ M i n|<br />

|Mi 1| + |M i 2| + ... + |M i n| (4.4)<br />

M j i<br />

is the set of correctly matched correspondences <strong>for</strong> detector type j between images I 1 <strong>and</strong> I i<br />

<strong>and</strong> n is the number of combined detectors. The complementary score is defined between 0..1. A<br />

complementary score of 0 means that there are no non-overlapping regions, the detectors produce<br />

the same regions. A complementary score of 1 states that the detections of both detectors are<br />

completely different. Thus a high or close to 1 complementary score will reveal good detector<br />

combinations.<br />

4.2 Representation of the detections<br />

The variety of local detectors is vast <strong>and</strong> their results may differ substantial. However, most<br />

of the detectors represent their results in a similar manner. In most cases the detectors return<br />

a center location <strong>and</strong> a measurement area around the center. Most of the difference lies in<br />

the representation of the measurement area. In the case of simple interest point detectors<br />

only a location is returned <strong>for</strong> a detection. Scale invariant detectors commonly return a center<br />

location <strong>and</strong> a circular measurement region based on the center location. Most of the affine<br />

invariant detectors return a center location <strong>and</strong> an elliptical measurement region based on the


4.3. Detection correspondence 47<br />

center location. Within this framework we thus distinguish between two representations, a<br />

point representation (PR) <strong>and</strong> a region representation (RR). The point representation only<br />

contains the x <strong>and</strong> y coordinates of the detection, the region representation contains the x <strong>and</strong> y<br />

coordinates of the detection <strong>and</strong> an elliptical measurement regions centered at the given location.<br />

For interest point operators thus point representation is used. For scale invariant detectors <strong>and</strong><br />

affine invariant detectors the region representation is used. A special case however is the MSER<br />

detectors. It returns a measurement region which shape cannot be described by an ellipse.<br />

In detail, the detector returns a point set which describes the outline of the border of the<br />

detection. As an approximation of the region shape the ellipse defined by the covariance matrix<br />

of the border pixels is used.<br />

4.3 Detection correspondence<br />

Detecting correspondences, necessary <strong>for</strong> the calculation of the previously described measures,<br />

is done with geometric means. It is done by projecting a detection in one image into the other<br />

image. The two detections correspond if the projection from the first image <strong>and</strong> the detection<br />

in the second image are on the same location. Here we have to distinguish between the two<br />

representations of the detections. Let us consider the point representation first.<br />

In the point representation we have a detection p = [x y] in image I <strong>and</strong> a detection q = [x y]<br />

in image I ′ . By geometric means we project the detection p into image I ′ <strong>and</strong> denote it by p ′ .<br />

We now define that p <strong>and</strong> q correspond if the Euclidean distance between q <strong>and</strong> p ′ is smaller<br />

than a threshold t p . According to previous evaluations [71] t p is set to 1.5 pixel.<br />

In the region representation the two detections p <strong>and</strong> q are ellipses. p’ is p transferred into<br />

the image I ′ . In general p’ is not an ellipse anymore, the transfer may change the shape of the<br />

ellipse p into a complex <strong>for</strong>m depending on the 3D structure of the region. The correspondence<br />

of p <strong>and</strong> q is now determined by checking if the areas of q <strong>and</strong> p’ overlap. We there<strong>for</strong>e calculate<br />

the overlap area of both structures as follows:<br />

overlap = q ∩ p′<br />

q ∪ p ′ (4.5)<br />

The two detections p <strong>and</strong> q correspond if the overlap area is higher than a threshold t r . According<br />

to previous evaluations [71] t r is set to 50%. How to calculate the intersection <strong>and</strong> union<br />

areas of p <strong>and</strong> q is outlined in the next section.<br />

4.3.1 Transferring an elliptic region<br />

For correspondence detection it is necessary to compute how an ellipse detected in image I is<br />

seen from the vantage point of the second image I ′ <strong>and</strong> where the ellipse is located in image<br />

I ′ . We refer to that as transferring an ellipse from the image I to the image I ′ . The result<br />

depends strongly on the underlying 3D structure of the scene. In the case that the elliptic<br />

image structure lies on a plane in 3D the corresponding pixel coordinates in the other image<br />

<strong>for</strong>m an ellipse too. What more, the shape can be calculated analytically if the geometry<br />

relations between both images (vantage points) are known (see Appendix A). In every other<br />

case the shape of the corresponding pixel coordinates changes according to the underlying 3D<br />

structure. In general it is not a conic anymore. And it is not anymore possible to calculate the<br />

corresponding shape analytically. An approximation of the resulting shape can be computed by<br />

sampling the original ellipse border with a raster <strong>and</strong> transferring each point individually into


4.3. Detection correspondence 48<br />

the other image. By connecting all points in the same order as in the original image we can<br />

get a sampled representation (i.e. a polygonal representation) of the resulting shape. The area<br />

covered is then defined by transferring the pixel coordinates inside the ellipse.<br />

Let us denote an ellipse detected in image I as E 1 <strong>and</strong> let E 1 ′ be the ellipse detected in I <strong>and</strong><br />

transferred to the other image I ′ . Ellipse E 2 ′ is the ellipse detected in I′ . When the parameter<br />

<strong>for</strong>m of the ellipses E 1 , E 1 ′ <strong>and</strong> E′ 2 is known the values necessary to calculate the overlap area<br />

can computed by pixel counting. The intersection area q ∩ p ′ can be computed by counting the<br />

number of pixels which are commonly located within the ellipse E 1 ′ <strong>and</strong> E′ 2 . The union area<br />

q ∪ p ′ is computed by counting all pixels which either are within E 1 ′ or E′ 2 without counting<br />

the pixels belonging to both ellipses twice. The overlap area can be computed at an arbitrarily<br />

accuracy by choosing an accordingly small pixel raster.<br />

However, the parameter <strong>for</strong>m <strong>for</strong> E 1 ′ is only known <strong>for</strong> the planar case. For the non-planar<br />

case the transfer result is only determined by a set of pixels showing an arbitrary shape.<br />

4.3.2 Calculating the overlap area from the point set representation<br />

With point set representation we denote when an ellipses is represented by the set of points within<br />

the ellipse borders. For the non-planar case a transfer is only possible with this representation.<br />

As stated be<strong>for</strong>e after transfer the original elliptical region may come off with an arbitrary shape.<br />

The projective transfer could introduce gaps in the resulting structure. Depth discontinuities<br />

may even split up the transferred structure into several pieces. Calculating an exact solution<br />

<strong>for</strong> the transferred area under these circumstances is not possible. We there<strong>for</strong>e calculate an<br />

approximation to the needed areas.<br />

After transferring the point set of E 1 into I ′ we can assign the points to two sets. One set P<br />

contains all points which are located inside the ellipse E 2 ′ <strong>and</strong> the other set Q contains all other<br />

points. We can define a ratio r as<br />

r =<br />

|P |<br />

|P | + |Q| . (4.6)<br />

The intersection area is approximated by the area of the convex hull <strong>for</strong> the set of transferred<br />

points P . This approximation will also give a good estimation if there is a significant scale change<br />

<strong>and</strong> the transferred points are spread out.<br />

The area of the union is approximated as the sum of the area of the original ellipse E 2<br />

′<br />

<strong>and</strong> the area represented by Q. The area of E 2 ′ can be calculated exactly. However it is not<br />

possible to approximate the area of Q by using the convex hull as done <strong>for</strong> P . Q is not assumed<br />

to represent one connected structure. The area of Q is there<strong>for</strong>e estimated from the ratio r<br />

between the point sets P <strong>and</strong> Q.<br />

area(Q) = (1 − r) area(P )<br />

r<br />

(4.7)<br />

overlap =<br />

area(P )<br />

area(E ′ 2 ) + area(Q) (4.8)<br />

Figure 4.2 illustrates the area approximation. The black ellipse is the ellipse E 2 ′ which area<br />

can be calculated exactly. The red ellipse is the exact transferred ellipse E 1 ′ . The convex hull of<br />

the part of E 1 ′ which is located within E′ 2 is drawn in blue. The blue <strong>and</strong> red crosses mark the<br />

pixel locations which represent the transferred ellipse E 1 ′ as point set <strong>and</strong> which are used <strong>for</strong> the<br />

area approximation.


4.3. Detection correspondence 49<br />

E 2<br />

'<br />

E 1<br />

'<br />

Figure 4.2: Illustration of overlap area approximation.<br />

4.3.3 Justification of the approximation<br />

We give an experimental justification of the presented approximation method by transferring<br />

ellipses <strong>and</strong> comparing the approximated overlap area with the true overlap. The relative error<br />

of the approximation was calculated <strong>for</strong> various overlap situations with varying viewing angle<br />

(from -45 ◦ to 45 ◦ ) <strong>and</strong> with increasing scale factor (from 0.5 to 2). The various testcases are<br />

illustrated in Figure 4.3. Figure 4.3(a-d) shows the scale steps 0.5, 1, 1.5, 2. Figure 4.3(e-h)<br />

shows the viewpoint angles of -45 ◦ , -30 ◦ , 0 ◦ , 45 ◦ . The initial uni<strong>for</strong>mly distributed point set gets<br />

perspectively distorted. Figure 4.3(i-l) shows various overlap scenarios. Table 4.1 summarizes<br />

the results. The approximation error increases with increasing viewing angle. Especially <strong>for</strong><br />

large scale changes the approximation will introduce high errors. However, such cases can be<br />

identified by analyzing the distribution of the trans<strong>for</strong>med point set <strong>and</strong> highlight the cases <strong>for</strong><br />

manual inspection.<br />

(a) (b) (c) (d)<br />

(e) (f) (g) (h)<br />

(i) (j) (k) (l)<br />

Figure 4.3: (a-d) Scale change from 0.5 to 2. (e-h) Viewpoint change from -45 ◦ to 45 ◦ . (g-l)<br />

Various overlap scenarios.


4.4. Point transfer using the trifocal tensor 50<br />

viewing angle [ ◦ ] error [%] error [%] error [%] error [%]<br />

scale 0.5 scale 1.0 scale 1.5 scale 2.0<br />

-45 7.4 10.4 20.1 26.2<br />

-40 8.1 7.9 12.7 19.8<br />

-35 8.2 7.6 8.7 15.0<br />

-30 7.7 7.7 7.8 11.0<br />

-25 9.1 7.4 7.3 7.9<br />

-20 8.9 7.1 7.3 7.1<br />

-15 6.3 7.1 6.7 7.1<br />

-10 3.5 6.2 6.7 6.7<br />

-5 4.6 5.8 6.4 6.3<br />

0 5.6 5.3 6.3 6.6<br />

5 5.4 5.6 6.4 6.3<br />

10 3.5 6.2 6.8 6.7<br />

15 5.5 7.3 6.7 7.1<br />

20 8.9 7.1 7.3 7.1<br />

25 9.1 7.4 7.3 7.9<br />

30 8.3 7.8 7.8 11.1<br />

35 8.2 7.6 8.8 15.0<br />

40 8.1 7.9 12.7 19.8<br />

45 7.4 10.5 20.1 26.2<br />

Table 4.1: Overlap approximation error compared to exact overlap <strong>for</strong> viewing angle from -45 ◦<br />

to 45 ◦ <strong>and</strong> scale changes from 0.5 to 2.<br />

4.4 Point transfer using the trifocal tensor<br />

For non-planar scenes the pixel-by-pixel transfer of the ellipses in point representation can be<br />

computed using the trifocal tensor. The trifocal geometry describes the relations between images<br />

taken from 3 different vantage points. That means in trifocal geometry there are 3 images, say<br />

I, I ′ <strong>and</strong> I ′′ . Point locations are denoted in the same way, p, p ′ <strong>and</strong> p ′′ where p is a homogeneous<br />

vector containing the x <strong>and</strong> y coordinates p = [x y 1] T . The geometry between the 3 images is<br />

encapsulated by the trifocal tensor T which can be estimated from point correspondences in the<br />

3 images p ↔ p ′ ↔ p ′′ . The point transfer property allows to compute the location of a matched<br />

pair of points p ↔ p ′ in a third view I ′′ , provided the trifocal tensor between the three views is<br />

known (see Appendix B <strong>for</strong> details). This relation can be written as<br />

p ′′ = f(T, p ↔ p ′ ), (4.9)<br />

where T is the trifocal tensor. Assume that we want to transfer the pixels of an ellipse from<br />

view I to I ′′ . One consequence of this relation is that <strong>for</strong> each ellipse point in I we need to know<br />

the corresponding point in a second image I ′ to carry out the transfer. This can be achieved by<br />

establishing a dense matching between the images I <strong>and</strong> I ′ . That is, <strong>for</strong> every pixel location in<br />

I we know the corresponding location in I ′ . This will allow to transfer each location of I to I ′′ .<br />

The entities needed <strong>for</strong> the point transfer, i.e. trifocal tensor <strong>and</strong> dense matching, are further<br />

denoted as ground truth.


4.5. Ground truth generation 51<br />

4.5 Ground truth generation<br />

Ground truth is the geometry in<strong>for</strong>mation <strong>for</strong> the test images which is necessary to do the ellipse<br />

transfer between two images. The ground truth is composed of a dense matching between two<br />

nearby images <strong>and</strong> the trifocal tensors between the image triplets used <strong>for</strong> evaluation. A typical<br />

evaluation scenario is illustrated in Figure 4.4. A 3D scene (or object) is imaged from various<br />

vantage points. It is convenient to move the camera in a circular path around the 3D object<br />

so that the viewpoint change can be annotated in degrees. One image has to be chosen as the<br />

reference image. An image close to the reference image should be used <strong>for</strong> the dense matching<br />

<strong>and</strong> which will serve as intermediate image <strong>for</strong> the point transfer. The other images of the<br />

sequence can then be used <strong>for</strong> the evaluation. However, there are basically no geometrical<br />

restrictions to an image sequence to be used <strong>for</strong> evaluation. It is not necessary to acquire the<br />

test images on a special setup (e.g. like a turn-table). All methods used to create the ground<br />

truth can deal with uncalibrated image data. Not more than the images itself are necessary<br />

to create the ground truth data <strong>and</strong> do the evaluation. That means, it is possible to create<br />

evaluation ground truth to whatever images you obtain (e.g. downloads from the internet).<br />

4.5.1 Trifocal tensor<br />

The trifocal tensor encapsulates the geometry between three images. It is the analogue to the<br />

fundamental matrix from the two view case. The trifocal tensor can be calculated from 7 point<br />

correspondences across the three images [44]. The calculation of the trifocal tensor from the<br />

point correspondences works straight <strong>for</strong>ward. The difficult part however is the detection of the<br />

point correspondences. Due to the nature of the testcases wide-baseline methods are needed<br />

to generate the point correspondences. In our evaluation framework point correspondences<br />

are automatically established by detecting MSER regions [70] <strong>and</strong> matching them using the<br />

SIFT descriptor [67]. For cases where this automatic method fails the correspondences must be<br />

selected manually.<br />

4.5.2 Dense matching<br />

For dense matching of two nearby images there exists a variety of algorithms [46, 58, 93, 102].<br />

However, one special requirement <strong>for</strong> the dense matching is sub-pixel accuracy, such that the<br />

points of the reference image lie on the pixel raster <strong>and</strong> the points in the intermediate images<br />

are sub-pixel shifted to achieve the best correlation. There<strong>for</strong>e we do not simple employ one of<br />

the st<strong>and</strong>ard algorithms but implemented a dense matching which perfectly fits our needs. Our<br />

matching method is outlined in Algorithm 1.<br />

Algorithm 1 Dense matching<br />

Interest point detection <strong>and</strong> matching on low resolution images<br />

Robust fundamental matrix estimation (RANSAC)<br />

Image rectification<br />

Initial iterative point matching (en<strong>for</strong>cing epipolar constraint)<br />

Upgrade to dense sub-pixel matching<br />

The first step of the dense matching is to estimate the fundamental matrix. This is a necessary<br />

precondition to do image rectification. Thus Harris corners [40] are extracted <strong>and</strong> matched


4.5. Ground truth generation 52<br />

using template matching <strong>and</strong> normalized cross correlation. This is done on re-sampled lower resolution<br />

versions of the images. This will speed up the initial matching enormously. The detected<br />

point correspondences are used to calculate the fundamental matrix. The Gold st<strong>and</strong>ard method<br />

<strong>for</strong> fundamental matrix estimation is used [44]. It is robust against outliers <strong>and</strong> minimizes the<br />

re-projection error. The next step is to rectify the images. The projective rectification method<br />

proposed by Hartley is used [43]. The method is able to work with uncalibrated images. The<br />

prerequisites are the fundamental matrix <strong>and</strong> a small set of point correspondences. It works by<br />

factorizing the fundamental matrix <strong>and</strong> estimating a matching pair of image trans<strong>for</strong>mations<br />

which have to be applied to the images. The corresponding points have to be outlier-free <strong>and</strong><br />

very accurate. Inaccurate point matches effect the algorithm badly. Thus only a subset of the<br />

initial point matches which fit the epipolar geometry best are selected <strong>for</strong> that step. The resulting<br />

rectified images are then matched again with an iterative method [50]. The algorithm<br />

returns a 4×4 grid matching at subpixel accuracy <strong>and</strong> en<strong>for</strong>ces the epipolar constraint. For<br />

subpixel accuracy the method of Lan <strong>and</strong> Mohr [62] is used, which is reported to achieve a<br />

matching precision better than 0.1 pixels <strong>for</strong> selected interest points. In a next step the matching<br />

is densified by filling in the matches between the grid points. Because of the grid matching<br />

it is possible to restrict the search window <strong>for</strong> template matching to a very small area. In fact,<br />

by establishing an affine trans<strong>for</strong>mation between 3 neighboring grid points the expected position<br />

of a corresponding point in the other images can be calculated. In most cases only a sub-pixel<br />

correction of the point match is necessary. As a last step the point matches are de-rectified to<br />

gain the point correspondences in the original image coordinates.<br />

4.5.3 Ground truth quality<br />

Inaccuracies of the generated ground truth directly effect the evaluation results. But first let us<br />

consider what are the quality characteristics:<br />

• False point correspondences<br />

• Inaccurate point correspondences<br />

• Regions without dense matching because of homogeneous texture<br />

• Regions without dense matching because of occlusions<br />

• Inaccurate trifocal tensors<br />

Most of the listed characteristics concern the dense matching. Let us discuss the different<br />

cases in more detail. It is almost impossible to create a dense matching which is completely<br />

without false point correspondences. Although one can en<strong>for</strong>ce the epipolar constraint <strong>and</strong><br />

an additional ordering constraint this does not guarantee that all false correspondences get<br />

discarded. However, in most dense matching algorithms the number of false correspondences is<br />

close to zero. For our application where a set of point matches is used to represent an ellipse the<br />

occurrence of one or two false matches would hardly influence the overall result, because this<br />

number of false matches would be negligible compared to the number of correct points used <strong>for</strong><br />

the representation.<br />

The accuracy of the point correspondences however is a very crucial subject. It directly<br />

effects the point transfer. In fact, pixel accurate matches are not accurate enough, sub-pixel<br />

accuracy is necessary. However, with the used sub-pixel method [62] the necessary accuracy can<br />

be achieved.


4.5. Ground truth generation 53<br />

Another critical issue is when the dense matching does not cover the whole image. It is<br />

known that homogenous <strong>and</strong> non-textured regions cause problems <strong>for</strong> correlation based matching<br />

algorithms. Correlation based matching requires a local variance of the gray-values. If the nontextured<br />

region is bigger than the correlation window it is not possible to identify the matching<br />

pixel location. Thus such image parts may not be covered by dense matches. The consequence<br />

thereof is that the representation of local detections <strong>for</strong> the evaluation is not complete which<br />

may effect the evaluation results.<br />

Parts without dense matching may occur also because of occlusions or depth discontinuities.<br />

As we are dealing with non-planar scenes <strong>and</strong> different vantage points such cases will definitely<br />

occur. However, as the dense matching is done on short-baseline images the influence of occlusions<br />

<strong>and</strong> depth discontinuities is only a minor one compared to missing parts because of<br />

non-textured regions.<br />

The so far discussed points were issues of the dense matching. However, the accuracy of the<br />

trifocal tensor directly effects the accuracy of the point transfers. The trifocal tensor is calculated<br />

from wide-baseline matches across three views. Inaccuracies in the estimation may result from a<br />

low number of point correspondences as well as the accuracy of the point correspondences itself.<br />

Wide-baseline images taken from widely different vantage point may show strong occlusions<br />

<strong>and</strong> which makes it difficult to establish point matches which are well distributed over the<br />

whole image area. Such configurations can also result in an inaccurate estimation of the trifocal<br />

geometry.<br />

Now as we have identified the different effects which influence the quality of the ground truth<br />

we can think about assessing the quality in a quantitative way. The following quantities can be<br />

measured:<br />

• Re-projection error of the point correspondences<br />

• Re-projection error <strong>for</strong> trifocal tensor<br />

• Number of non-matched image pixels<br />

The re-projection error [44] of the dense point correspondences gives a measure <strong>for</strong> the<br />

accuracy of the matching. It is calculated by building the 3D reconstructions of the point<br />

matches, doing a re-projection into the images <strong>and</strong> calculating the distance to the original point<br />

correspondences. To calculate the 3D reconstructions it is necessary to estimate the fundamental<br />

matrix from the point matches.<br />

The re-projection error <strong>for</strong> the trifocal tensor is similarly calculated. It differs that the 3D<br />

reconstruction is computed using the trifocal tensor <strong>and</strong> that the re-projection error is summed<br />

over 3 images.<br />

The number of non-matches image pixels is easily computed, it is simple counting.<br />

Another idea would be not to evaluate the ground truth data but the ground truth generating<br />

methods. This could be done by generating a synthetic test scene (which known ground<br />

truth) <strong>and</strong> apply the ground truth generation to this scene. The estimated ground truth can<br />

be compared with the known ground truth <strong>and</strong> the estimation errors can be reported. This<br />

evaluation can measure the following values:<br />

• The number of false point correspondences<br />

• Error distance of dense matches <strong>and</strong> synthetic matches in intermediate image<br />

• Transfer error of the estimated trifocal tensor


4.6. Experimental evaluation 54<br />

Figure 4.4: Two nearby images (e.g. the first two) from the whole sequence are used to create<br />

the dense matching. With the trifocal tensor it is possible to transfer a point location given in<br />

the first image into every other image of the sequence.<br />

The enlisted values can be calculated straight <strong>for</strong>ward. For the synthetic scene all point correspondences<br />

are known, so that the false match correspondences can easily be identified <strong>and</strong><br />

counted. The error distance between the established matches <strong>and</strong> the synthetic know point correspondences<br />

can be calculated very easily. The st<strong>and</strong>ard deviation characterizes the quality of<br />

the matching. The quality of the trifocal tensor estimation can be characterized by the transfer<br />

error. For the synthetic data the position of every pixel from the reference image in the other<br />

views is known. When transferring the pixels from the reference image to another view using<br />

the synthetic point matches, the only error source lies in the trifocal tensor. The transfer error<br />

is the distance of the transferred point to the exact point location. This gives a quality measure<br />

<strong>for</strong> the trifocal tensor.<br />

4.6 Experimental evaluation<br />

Ground truth was calculated <strong>for</strong> 2 different complex scenes. The test scene ”Group” shows two<br />

boxes <strong>and</strong> was acquired at a turntable. This scene is piece-wise planar. The second test scene<br />

”Room” shows a part of an room. This scene is of higher complexity than the first one. Both<br />

image sequences consist of 19 images <strong>and</strong> the viewpoint varies from 0 ◦ to 90 ◦ . Figure 4.5(a),<br />

(c) show examples <strong>for</strong> both scenes. Figure 4.5(b), (d) show the depth maps resulting from the<br />

dense matching. Black image parts contain no matches. The ”Group” scene with a resolution of<br />

896×1024 pixel is covered by matches 96.5% (452167 pixels) (excluding the background). The<br />

”Room” scene with a resolution of 800×600 pixel is covered 71.4% (342668 pixels) with matches.<br />

Most of the missing parts are due to large homogenous regions. This will not severely bias the<br />

evaluation results because most detectors will not find regions in homogeneous image parts. For<br />

interest point evaluation the average distance to the nearest matched point was 0.43 pixel. For<br />

interest region evaluation the average coverage of the regions with matched points is 86%.


4.6. Experimental evaluation 55<br />

(a)<br />

(b)<br />

(c)<br />

(d)<br />

Figure 4.5: (a) Test scene ”Group”. (b) Depth map <strong>for</strong> ”Group” scene (not matched parts are<br />

black). (c) Test scene ”Room”. (d) Depth map <strong>for</strong> ”Room” scene.<br />

4.6.1 Repeatability <strong>and</strong> matching score<br />

We evaluate 7 different detectors on increasing viewpoint change. The compared values are the<br />

repeatability score <strong>and</strong> the matching score. The evaluated detectors are the Maximally Stable<br />

Extremal Regions (MSER) [70], the Hessian-Affine regions [73], the Harris-Affine regions [73],<br />

the intensity based regions (IBR) [112], Difference of Gaussian keypoints (DOG) [67], Harris<br />

<strong>and</strong> Hessian corners [40].<br />

For the detectors we use the publicly available implementation from Mikolajczyk. Figure 4.6<br />

shows the repeatability scores <strong>for</strong> the ”Group” scene. The best per<strong>for</strong>mances are obtained by<br />

the MSER <strong>and</strong> the DOG detector. In fact the repeatability score is even <strong>for</strong> viewpoint changes<br />

up to 90 ◦ higher than 40%. Figure 4.7 shows the evaluation results <strong>for</strong> the ”Room” scene. The<br />

best per<strong>for</strong>mance is achieved by the DOG <strong>and</strong> IBR detector. The IBR detector especially shows


4.6. Experimental evaluation 56<br />

repeatability [%]<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

MSER<br />

DOG<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

Harris<br />

Hessian<br />

30<br />

20<br />

10<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(a)<br />

number of correspondences<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

MSER<br />

DOG<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

Harris<br />

Hessian<br />

200<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(b)<br />

Figure 4.6: (a) Repeatability score <strong>for</strong> ”Group” scene. (b) Absolute number of correspondences.<br />

high repeatability scores <strong>for</strong> large viewpoint changes too. Overall the repeatability scores <strong>for</strong><br />

this complex scene are lower than that <strong>for</strong> the ”Group” scene. This is because the ”Group”<br />

scene is composed only of 2 piecewise planar objects while the ”Room” scene contains much<br />

more objects of arbitrary shape. Generally speaking the results fulfill our expectations. While<br />

the repeatability of the simple interest corner detectors drops very fast with increasing viewpoint<br />

change the scores <strong>for</strong> the more advanced affine invariant detectors stay quite high despite the<br />

increasing viewpoint change. However, the plot of the absolute number of repetitive detections<br />

shows that <strong>for</strong> the ”Room” scene the absolute number of repetitive detections from the MSER<br />

detector drops below 20 <strong>for</strong> viewpoint changes larger than 45 ◦ . For some algorithms such a low<br />

number of possible matches would not allow them to run robustly. Other approaches like the<br />

DOG detector are still able to produce more than 150 possible matches at such large viewpoint<br />

changes.<br />

Figure 4.8(a) shows the matching scores of the ”Group” scene relative to the number of


4.6. Experimental evaluation 57<br />

repeatability score [%]<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

MSER<br />

DOG<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

Harris<br />

Hessian<br />

20<br />

10<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(a)<br />

number of correspondences<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

MSER<br />

DOG<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

Harris<br />

Hessian<br />

200<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(b)<br />

Figure 4.7: (a) Repeatability score <strong>for</strong> ”Room” scene. (b) Absolute number of correspondences.<br />

detected regions. The number of the matches is related to the smaller number of detected<br />

regions in both images. Figure 4.8(b) shows the matching scores related to the number of<br />

possible matches. Possible matches are region correspondences established geometrically using<br />

the ground truth. The first measure actually represents how many of the initial detections could<br />

be correctly matched. The second measure reveals how well the detector selects discriminative<br />

image regions, as it relates the number of correct matches to the number of possible matches.<br />

Figure 4.8(c) shows the absolute number of correct matches, which is interesting if subsequent<br />

algorithms require a certain number of correspondences, e.g. epipolar geometry estimation.<br />

Figure 4.9 shows the matching scores <strong>for</strong> the ”Room” scene.<br />

In this experiment it is expected to see significant differences between the simple point detectors,<br />

the scale invariant detectors <strong>and</strong> the affine invariant detectors. Especially the normalization<br />

of the affine invariant regions should compensate <strong>for</strong> the viewpoint change. And actually the


4.6. Experimental evaluation 58<br />

MSER detector accomplishes in average the best matching scores. The DOG detector shows<br />

surprisingly low matching scores. While it starts similar to the other detectors the matching<br />

score drops very fast with increasing viewpoint change. This is once more surprising as the DOG<br />

detector was initially introduced with the SIFT descriptor. Most impressive are the results <strong>for</strong><br />

the simple point detectors. For small viewpoint changes the results are ranked under the top<br />

three. With increasing viewpoint change the matching scores however drop dramatically.<br />

Comparing these results to the previous evaluation of Mikolajczyk [74] on planar scenes<br />

only, one can see two main differences. First, the MSER detector provides a significant higher<br />

per<strong>for</strong>mance than other affine invariant detectors especially <strong>for</strong> the matching score. This comes<br />

from the fact that the MSER detector is not corner based <strong>and</strong> does not tend to detect regions<br />

on depth continuities. Second, the evaluations on the ”Room” scene show that the achievable<br />

repeatability scores <strong>and</strong> matching scores <strong>for</strong> complex scenes are considerably lower than those<br />

achieved on planar scenes. This means one must expect a much lower number of matches in<br />

practice than the previous evaluations suggested.<br />

4.6.2 Combining local detectors<br />

This experiment evaluates the benefits gained by combining different detectors. Benefits can be<br />

gained if the combined detectors produce detections in different parts of the image. To assess<br />

this we measure the complementary score c n i . Figure 4.10 shows a cumulative plot of the relative<br />

numbers of non-overlapping matched interest regions from 5 different detectors <strong>for</strong> the ”Group”<br />

<strong>and</strong> ”Room” scene. Every line shows how many new regions are added to the previous set of<br />

interest regions by the specific detector. The graphs show impressively that combining local<br />

detectors leads to a larger set of distinct image regions over a wide range of viewpoint changes.<br />

It is remarkable that the regions from a combination of all 5 detectors still contain less than 20%<br />

overlapping ones. However, in real applications usually per<strong>for</strong>mance issues do not permit to run<br />

all detectors on an input image. A good choice <strong>for</strong> combining 2 detectors would be selecting<br />

the MSER <strong>and</strong> DOG detector which apparently seem to be the 2 fastest detectors. The graphs<br />

in Figure 4.11 show only a small number of overlapping regions. Combining Harris-Affine <strong>and</strong><br />

Hessian-Affine detectors together creates a significant number of overlapping regions as seen in<br />

Figure 4.12. This is expected as the algorithms <strong>for</strong> both methods are quite similar.


4.6. Experimental evaluation 59<br />

view MSER DOG Har.-Aff. Hes.-Aff. IBR Harris Hessian<br />

change repeat. repeat. repeat. repeat. repeat. repeat. repeat.<br />

[ ◦ ] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.]<br />

10 76.3 68.2 48.5 56.4 57.6 69.3 70.4<br />

180 1113 586 457 364 1244 853<br />

15 67.8 63.8 45.0 50.9 52.6 59.6 64.2<br />

164 1056 543 412 318 1070 734<br />

20 59.8 59.6 39.9 46.7 49.8 38.3 47.7<br />

143 950 482 358 308 662 538<br />

25 56.0 55.8 36.3 43.5 48.8 39.2 44.6<br />

126 889 436 359 290 636 519<br />

30 57.7 52.5 36.3 40.6 48.5 34.3 40.0<br />

127 811 439 341 275 534 464<br />

35 55.6 50.9 33.8 41.2 47.2 31.9 38.3<br />

120 754 408 353 265 494 459<br />

40 50.7 50.0 33.6 37.2 45.3 26.4 31.9<br />

114 742 406 236 267 412 401<br />

45 50.9 49.2 31.0 34.0 44.3 28.0 32.0<br />

112 703 375 253 255 443 402<br />

50 46.6 48.9 31.4 35.6 46.8 28.9 32.9<br />

108 713 379 258 256 457 414<br />

55 47.2 46.7 32.0 34.4 45.0 27.6 32.2<br />

110 695 387 279 247 437 405<br />

60 45.9 44.8 32.3 33.1 44.2 25.3 31.8<br />

111 693 390 268 258 418 400<br />

65 43.3 42.5 29.7 31.9 43.1 23.7 32.1<br />

107 676 359 258 249 409 404<br />

70 44.1 43.1 29.4 30.7 39.1 21.5 28.8<br />

109 676 355 249 236 365 362<br />

75 41.2 41.6 28.6 26.9 42.2 21.9 28.3<br />

100 625 346 218 232 367 356<br />

80 42.5 42.5 26.5 26.2 43.3 21.9 30.2<br />

97 642 320 212 229 354 377<br />

85 41.7 44.0 25.5 26.0 43.3 21.4 29.3<br />

91 615 293 206 218 336 353<br />

90 37.9 43.7 26.2 25.6 40.8 20.3 30.4<br />

83 584 292 188 200 304 336<br />

Table 4.2: Repeatability score <strong>and</strong> absolute number of correspondences <strong>for</strong> ”Group” scene with<br />

changing viewpoint


4.6. Experimental evaluation 60<br />

matchingscore (rel. to #detection) [%]<br />

90.0<br />

80.0<br />

70.0<br />

60.0<br />

50.0<br />

40.0<br />

30.0<br />

20.0<br />

MSER<br />

DOG<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

Harris<br />

Hessian<br />

10.0<br />

0.0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(a)<br />

matchingscore (rel. to #possible matches) [%]<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

MSER<br />

DOG<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

Harris<br />

Hessian<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(b)<br />

absolute number of correct matches<br />

1000<br />

900<br />

800<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

MSER<br />

DOG<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

Harris<br />

Hessian<br />

100<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(c)<br />

Figure 4.8: (a) Matching score <strong>for</strong> ”Group” scene relative to number of detections. (b) Matching<br />

score <strong>for</strong> ”Group” scene relative to number of possible matches. (c) Absolute number of correct<br />

matches.


4.6. Experimental evaluation 61<br />

matchingscore (rel. to #detection) [%]<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

MSER<br />

DOG<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

Harris<br />

Hessian<br />

10<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(a)<br />

matchingscore (rel. to #possible matches) [%]<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

MSER<br />

DOG<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

Harris<br />

Hessian<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(b)<br />

absolute number of correct matches<br />

1000<br />

900<br />

800<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

MSER<br />

DOG<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

Harris<br />

Hessian<br />

100<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(c)<br />

Figure 4.9: (a) Matching score <strong>for</strong> ”Room” scene relative to number of detections. (b) Matching<br />

score <strong>for</strong> ”Room” scene relative to number of possible matches. (c) Absolute number of correct<br />

matches.


4.6. Experimental evaluation 62<br />

view MSER DOG Har.-Aff. Hes.-Aff. IBR Harris Hessian<br />

change repeat. repeat. repeat. repeat. repeat. repeat. repeat.<br />

[ ◦ ] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.]<br />

15 44.6 47.3 35.6 41.9 46.1 35.4 46.2<br />

54 371 208 240 111 349 363<br />

20 38.1 46.0 31.3 34.0 42.5 29.3 39.1<br />

43 331 188 191 99 289 307<br />

25 31.3 39.1 26.6 28.0 35.8 23.4 32.3<br />

35 281 158 160 81 231 254<br />

30 24.8 35.8 26.8 27.8 37.7 19.0 26.0<br />

28 260 150 157 81 187 204<br />

35 23.3 33.5 24.6 26.2 41.9 18.2 24.6<br />

27 237 142 145 91 179 193<br />

40 21.4 29.2 20.8 22.5 31.9 12.5 13.4<br />

21 210 116 113 61 123 105<br />

45 23.1 28.4 19.8 20.5 29.4 12.5 14.0<br />

24 205 108 113 60 123 110<br />

50 20.0 25.7 15.8 15.8 28.0 9.5 12.0<br />

21 181 89 80 58 94 94<br />

55 19.3 26.9 16.2 14.7 27.2 10.3 10.1<br />

22 202 85 82 55 102 79<br />

60 12.9 24.0 14.4 15.5 24.0 7.8 8.9<br />

13 164 82 85 49 77 70<br />

65 18.9 24.4 13.8 15.1 29.0 8.7 8.8<br />

21 161 82 86 56 86 69<br />

70 13.5 22.9 12.8 12.7 27.1 8.4 7.4<br />

15 164 75 71 58 83 58<br />

75 17.9 22.9 14.4 15.5 30.0 9.9 10.3<br />

20 163 85 82 67 98 81<br />

80 14.4 23.8 12.6 13.2 29.9 9.1 9.2<br />

16 175 76 72 66 90 72<br />

85 13.5 21.7 11.9 11.4 27.6 9.2 9.3<br />

14 164 74 66 61 91 73<br />

90 11.5 22.1 12.1 11.7 28.8 8.2 9.4<br />

13 159 73 68 66 81 74<br />

Table 4.3: Repeatability score <strong>and</strong> absolute number of correspondences <strong>for</strong> ”Room” scene with<br />

changing viewpoint


4.6. Experimental evaluation 63<br />

view MSER DOG Har.-Aff. Hes.-Aff. IBR Harris Hessian<br />

change m. score m. score m. score m. score m. score m. score m. score<br />

[ ◦ ] [%, %, [%, %, [%, %, [%, %, [%, %, [%, %, [%, %,<br />

abs.] abs.] abs.] abs.] abs.] abs.] abs.]<br />

10 66.9 35.1 29.4 34.3 33.5 47.2 53.1<br />

75 38.5 31.6 36.8 37.3 88.1 88.2<br />

162 579 355 280 212 883 644<br />

15 56.5 22.7 24.3 28.3 26.9 35.5 43.4<br />

69 25.7 26.9 31.2 30.1 77.6 77.1<br />

140 379 293 231 163 647 497<br />

20 43.3 17.1 18.8 21.6 23.8 25.9 29.9<br />

57.3 19.4 22.1 25.6 26.9 67.2 61.9<br />

106 276 227 176 147 448 338<br />

25 36.8 12.4 16.3 19.1 22.7 23.0 24.9<br />

50.6 14.3 20.5 23.7 27.8 58.4 55.7<br />

85 199 196 156 135 374 290<br />

30 33.6 9.6 14.3 13.6 18.5 17.7 18.0<br />

46.3 11.3 18.7 17.6 23.1 51.4 43.7<br />

76 150 173 111 105 275 209<br />

35 27.0 8.7 11.5 14.1 19.4 13.8 15.1<br />

38.2 10.3 15.5 17.9 24.1 43.1 39<br />

60 131 139 115 109 214 181<br />

40 24.2 7.3 10.8 11.3 17.3 8.3 9.3<br />

37.6 8.5 14.2 15.2 21.4 30.4 28.6<br />

56 110 131 92 102 129 118<br />

45 24.8 5.9 8.7 10.5 14.2 7.0 7.6<br />

37.3 6.9 11.8 14.2 17.8 24.2 23.9<br />

56 85 105 86 82 110 99<br />

50 18.5 6.2 8.2 9.1 14.8 4.6 6.3<br />

29.5 6.9 11.3 11.8 16.7 15.7 19.7<br />

44 91 99 74 81 73 83<br />

55 20.5 5.0 6.5 7.1 13.2 3.4 4.0<br />

32.7 5.9 8.6 9.4 15.8 12.1 12.6<br />

49 76 78 58 73 54 52<br />

60 14.5 4 6.7 6.1 13.4 2.3 3.7<br />

24.2 5 9 8.2 16.9 9 11.8<br />

36 64 81 50 79 39 49<br />

65 14.1 3.9 7.5 6.5 11.1 1.9 2.7<br />

24.5 5 10.3 9.1 13.8 8 8.3<br />

36 65 90 53 65 34 35<br />

70 11.4 2.7 4.8 5.2 10.6 1.7 1.4<br />

20.1 3.5 6.7 7.5 14 7.9 4.7<br />

30 45 58 42 67 30 18<br />

75 9.4 2.8 5.2 5 10.5 0.7 1.4<br />

17.7 3.7 7.7 7.9 13.6 3.4 4.8<br />

25 47 63 41 62 13 18<br />

80 9.6 2.0 4.0 4.5 7.4 0.7 1.1<br />

18.1 2.6 5.8 7.2 9.2 3.5 3.4<br />

25 34 48 37 44 13 14<br />

85 6.3 1.9 4.2 3.6 7.4 0.5 0.6<br />

12.7 2.5 6.3 6 9.6 2.8 2<br />

16 31 51 29 43 10 8<br />

90 6.2 1.5 2.5 3.6 6.1 0.4 0.4<br />

13.2 1.9 3.8 6.1 8 2.2 1.4<br />

16 25 30 29 35 7 5<br />

Table 4.4: Matching score, matching score relative to number of possible matches <strong>and</strong> absolute<br />

number of correct matches <strong>for</strong> ”Group” scene with changing viewpoint


4.6. Experimental evaluation 64<br />

view MSER DOG Har.-Aff. Hes.-Aff. IBR Harris Hessian<br />

change m. score m. score m. score m. score m. score m. score m. score<br />

[ ◦ ] [%, %, [%, %, [%, %, [%, %, [%, %, [%, %, [%, %,<br />

abs.] abs.] abs.] abs.] abs.] abs.] abs.]<br />

15 19 11.2 12.8 16.6 12.8 16.4 2<br />

50.8 18.1 21.7 24.8 19.8 56.4 5.1<br />

31 102 91 106 36 202 19<br />

20 9.2 6.7 8.5 11.5 12.8 9.9 1.2<br />

29.4 10.9 14.6 18.4 22.9 39.8 3.8<br />

15 58 60 73 36 115 12<br />

25 5.5 4.8 7.3 9.0 8.9 5.1 0.4<br />

20.5 9.3 14.1 17.4 18.7 24.5 1.6<br />

9 42 51 57 25 57 4<br />

30 4.3 3.4 6 5.7 7.8 3.5 0.2<br />

20.6 7.2 12.2 11.3 16.4 21 0.9<br />

7 31 42 36 22 39 2<br />

35 3.7 2.2 3.2 3.9 4.6 2.0 0.2<br />

18.2 4.5 6.7 8 9.6 13.3 1<br />

6 20 23 25 13 24 2<br />

40 4.6 1.3 2.7 2.6 4 0.7 0.2<br />

25.9 3.5 7.1 6.5 11 7.1 1.8<br />

7 12 19 16 11 9 2<br />

45 3.7 0.7 0.9 1.9 3.9 0.9 0.1<br />

20.7 1.7 2.4 4.8 11.3 8.7 0.8<br />

6 6 6 12 11 11 1<br />

50 3.7 0.7 2.4 1.0 1.5 0.4 0<br />

24 1.9 7.6 3 4.6 5 0<br />

6 6 17 6 4 5 0<br />

55 2.5 0.7 1.6 0.9 2.1 0.3 0.1<br />

13.8 1.7 5 2.9 6.1 3.8 1.2<br />

4 6 11 6 6 4 1<br />

60 0.7 0.2 0.6 0.5 2.6 0.2 0<br />

5 0.7 1.8 1.6 8.6 2.5 0<br />

1 2 4 3 7 2 0<br />

65 1.8 0.2 0.6 0.3 2.2 0.2 0<br />

10 0.6 1.7 0.9 6.2 2.2 0<br />

3 2 4 2 6 2 0<br />

70 1.2 0.2 0.6 0.2 1.1 0 0<br />

9.5 0.6 1.7 0.6 2.9 0 0<br />

2 2 4 1 3 0 0<br />

75 1.2 0.1 0 0.6 1.8 0.1 0<br />

7.1 0.3 0 2.2 4.3 1 0<br />

2 1 0 4 5 1 0<br />

80 0 0.1 0.7 0.2 1.1 0.1 0<br />

0 0.3 2.3 0.5 2.8 1.1 0<br />

0 1 5 1 3 1 0<br />

85 0.6 0.1 0.4 0.5 1.8 0.2 0<br />

4.2 0.3 1.4 1.6 4.1 2.2 0<br />

1 1 3 3 5 2 0<br />

90 1.2 0.3 0.3 0.5 1.1 0.2 0<br />

10.5 0.9 0.9 1.6 2.5 2.4 0<br />

2 3 2 3 3 2 0<br />

Table 4.5: Matching score, matching score relative to number of possible matches <strong>and</strong> absolute<br />

number of correct matches <strong>for</strong> ”Room” scene with changing viewpoint


4.6. Experimental evaluation 65<br />

rel. number of non-overlapping regions<br />

(cumulative)<br />

1.0<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.0<br />

0 10 20 30 40 50 60 70 80 90 100 110 120 130<br />

viewpoint angle (approx.) [°]<br />

MSER<br />

Harris-<br />

Affine<br />

Hessian-<br />

Affine<br />

IBR<br />

DOG<br />

(a)<br />

rel. number of non-overlapping regions<br />

(cumulative)<br />

1.0<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.0<br />

0 10 20 30 40 50 60 70 80 90 100 110 120 130<br />

viewpoint angle (approx.) [°]<br />

MSER<br />

Harris-<br />

Affine<br />

Hessian-<br />

Affine<br />

IBR<br />

DOG<br />

(b)<br />

Figure 4.10: Relative numbers of non-overlapping matched regions <strong>for</strong> combination of 5 detectors.<br />

(a) ”Group” scene (b) ”Room scene”


4.6. Experimental evaluation 66<br />

rel. number of non-overlapping regions<br />

(cumulative)<br />

1.2<br />

MSER<br />

1.0<br />

DOG<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0.0<br />

0 10 20 30 40 50 60 70 80 90 100 110 120 130<br />

viewpoint angle (approx.) [°]<br />

(a)<br />

rel. number of non-overlapping regions<br />

(cumulative)<br />

1.2<br />

MSER<br />

1.0<br />

DOG<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0.0<br />

0 10 20 30 40 50 60 70 80 90 100 110 120 130<br />

viewpoint angle (approx.) [°]<br />

(b)<br />

Figure 4.11: Relative numbers of non-overlapping matched regions <strong>for</strong> combination of MSER<br />

<strong>and</strong> DOG detector. (a) ”Group” scene (b) ”Room scene”


4.6. Experimental evaluation 67<br />

rel. number of non-overlapping regions<br />

(cumulative)<br />

1.2<br />

1.0<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0.0<br />

0 10 20 30 40 50 60 70 80 90 100 110 120 130<br />

viewpoint angle (approx.) [°]<br />

Harris-<br />

Affine<br />

Hessian-<br />

Affine<br />

(a)<br />

rel. number of non-overlapping regions<br />

(cumulative)<br />

1.0<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.0<br />

0 10 20 30 40 50 60 70 80 90 100 110 120 130<br />

viewpoint angle (approx.) [°]<br />

Harris-<br />

Affine<br />

Hessian-<br />

Affine<br />

(b)<br />

Figure 4.12: Relative numbers of non-overlapping matched regions <strong>for</strong> combination of Harris-<br />

Affine <strong>and</strong> Hessian-Affine detector. (a) ”Group” scene (b) ”Room scene”


Chapter 5<br />

Maximally Stable Corner Clusters<br />

(MSCC’s) 1<br />

The development of this novel local detector is motivated by the need <strong>for</strong> highly descriptive<br />

<strong>and</strong> discriminative regions <strong>for</strong> wide-baseline feature matching. Discriminability means that the<br />

detection has a unique appearance <strong>and</strong> thus it is easy to distinguish from other detections.<br />

Descriptiveness means that the detection should possess a significant gray-value variance (e.g.<br />

texture) so that a meaningful feature-vector can be built. Both measures play a crucial role<br />

in finding correspondences between different images. Thus it is not surprising that already the<br />

Moravec operator [79] did account <strong>for</strong> it. In selecting points which show a high correlation<br />

difference if only the image window is shifted a little bit, point locations are detected which are<br />

highly descriptive. For instance, image regions with little texture (homogenous) would show a<br />

small response while highly textured regions (which usually results in high intensity frequencies)<br />

would show a high response. The aim <strong>for</strong> the new detector was there<strong>for</strong>e to identify highly<br />

textured, thus descriptive regions. The observation that a highly textured image region gives<br />

rise to a high number of Harris corners led to the idea to detect distinguished regions based on<br />

conglomerations of Harris corners. By means of clustering such regions can be detected. Each<br />

detected cluster will represent a distinguished region. The cluster center defines the position of<br />

the detection, whereas the outline of the region is defined by the cluster border. Speaking of<br />

descriptiveness <strong>and</strong> discriminability it is interesting to think about the best possible detections.<br />

Obviously, the larger an image region the larger the descriptiveness <strong>and</strong> discriminability. It<br />

seems quite convincing that the feature vector characterizing a detection should be calculated<br />

from the center of the detection to the image borders itself, thus creating the most discriminating<br />

feature vector. However, only small regions are robust against occlusions. Thus, one must be<br />

interested in detecting image regions as small as possible with a maximum of descriptiveness.<br />

Another important property of a local detector is its repeatability. That means detections are<br />

repetitively reported on the same locations although the image is undergoing trans<strong>for</strong>mations like<br />

rotation, scale change, lighting change, view point change, etc. Such trans<strong>for</strong>mations easily occur<br />

in practice <strong>and</strong> a good detector should be invariant to such trans<strong>for</strong>mations, or at least robust<br />

1 Based on the publications:<br />

F. Fraundorfer, M. Winter, <strong>and</strong> H. Bischof. Mscc: Maximally stable corner clusters. In Proc. 14th Sc<strong>and</strong>inavian<br />

Conference on Image Analysis, Joensuu, Finl<strong>and</strong>, pages 45–54, 2005 [36]<br />

F. Fraundorfer, M. Winter, <strong>and</strong> H. Bischof. Maximally stable corner clusters: A novel distinguished region<br />

detector <strong>and</strong> descriptor. In Proc. 1st Austrian Cognitive <strong>Vision</strong> Workshop, Zell an der Pram, Austria, pages<br />

59–66, 2005 [37]<br />

68


5.1. The MSCC detector 69<br />

against it. Already in 1998 Schmid et al. [95] investigated state-of-the-art interest point detectors<br />

on their robustness against some classes of trans<strong>for</strong>mation. In particular they investigated the<br />

robustness against rotation, scale <strong>and</strong> viewpoint change. The results <strong>for</strong> rotation <strong>and</strong> viewpoint<br />

change are reproduced in Figure 5.1 <strong>and</strong> 5.2. The experiments show the repeatability score of<br />

different interest point detectors under rotation <strong>and</strong> viewpoint change.<br />

(a)<br />

(b)<br />

(c)<br />

Figure 5.1: Results of the interest point detector evaluation of Schmid et al. [95]. The Harris<br />

detector shows high repeatability <strong>for</strong> rotated images. (a-b) Harris detections on original <strong>and</strong><br />

rotated image. (d) Repeatability score (Images from [95])<br />

The experiments are reproduced to stress the good repeatability score achieved <strong>for</strong> the simple<br />

Harris detector. Beyond it, if single points are stable than also point clusters will be stable. In<br />

fact, point clusters will be more stable since a few missing single points will not affect the clusters<br />

itself. This leads straight<strong>for</strong>ward to the idea <strong>for</strong> a new local detector based on clusters of interest<br />

points. In addition point clusters will provide a delineation of the region, yielding a textured<br />

thus highly descriptive image region. In the following we will describe a local detector based on<br />

this principle <strong>and</strong> we will call the detected regions Maximally Stable Corner Clusters (MSCC).<br />

5.1 The MSCC detector<br />

The detection of MSCC regions is equivalent to the detection of clusters in a 2-dimensional<br />

feature space. The used features are the x <strong>and</strong> y coordinates of detected interest points. Clustering<br />

can be per<strong>for</strong>med using graph based methods, where each interest point represents a<br />

node. Clusters may appear in different sizes (scales) <strong>and</strong> may be nested, thus a hierarchical<br />

approach is needed. An important concept of the MSCC detector is a stability criteria which<br />

results in reliable clusters only. Only clusters which are detected using varying scale parameters<br />

will be selected. A detected MSCC is finally defined by the extend of its distribution of points<br />

contributing to the constellation.<br />

The MSCC algorithm proceeds along the following three steps:


5.1. The MSCC detector 70<br />

(a)<br />

(b)<br />

(c)<br />

(d)<br />

Figure 5.2: Results of the interest point detector evaluation of Schmid et al. [95] on viewpoint<br />

changes. The Harris detector achieves high repeatability. (a-c) Examples <strong>for</strong> the test images.<br />

Viewpoint change introduces perspective distortions. (c) Repeatability score (Images from [95])<br />

1. Detect single interest points all over the image, e.g. Harris corners<br />

2. Per<strong>for</strong>m graph-based point clustering on multiple scales<br />

3. Select clusters which stay stable over a certain number of scales<br />

5.1.1 Interest point detection<br />

To detect the interest points acting as cluster primitives we employ the Harris corner detector<br />

[40]. We select a large number of corners (all local maxima above the noise level) as our<br />

corner primitives. This ensures that we are not dependent on a cornerness threshold. We do<br />

not apply non-maxima suppression which would be common <strong>for</strong> other applications. In our case<br />

we are interested in Harris corners in close spatial proximity. Non-maxima suppression would<br />

thin out possible clusters.<br />

5.1.2 Multi scale clustering<br />

We would like to find high density clusters of corners which are stable (i.e., a few missing corners<br />

or the addition of a few corners does not change the cluster structure). Since we do not know the<br />

number of clusters we have to use a non-parametric clustering method. Clustering is per<strong>for</strong>med<br />

by first computing the minimal spanning tree (MST) <strong>for</strong> the detected interest points <strong>and</strong> then<br />

removing edges so that the MST splits into multiple subtrees. Each subtree then corresponds<br />

to a cluster. The subdivison method is inspired by the MSER detector [70].<br />

The MST is computed by interpreting the interest points with coordinates x i = (x 1 , x 2 ) as<br />

the nodes of an undirected weighted graph√in 2D. The weight <strong>for</strong> the edge between two graph<br />

nodes i, k is their geometric distance d ik = (x i 1 − xj 1 )2 + (x i 2 − xj 2 )2 to which we will also refer


5.2. Region representation 71<br />

to as edge length. The minimal spanning tree (MST) is a subset of edges which connects all nodes<br />

with the smallest cumulative edge length. By computing the minimal spanning tree (MST) we<br />

create edges between nearby nodes. A well-known method to compute the MST is the Kruskal<br />

method [19]. Figure 5.3 shows a typical MST computed from detected Harris corners.<br />

(a)<br />

(b)<br />

Figure 5.3: (a) Image with detected Harris corners (b) MST computed from the Harris corners.<br />

Given a threshold T on the edge length we can get a subdivision of the MST into subtrees<br />

by removing all edges with an edge length higher than this threshold. Different values <strong>for</strong> T<br />

produce different subdivisions of the MST, i.e. different point clusters. To create a multi scale<br />

clustering we compute subdivisions of the MST <strong>for</strong> p regularly spaced thresholds T 1 ...T p between<br />

the minimal <strong>and</strong> maximal edge length occurring in the MST. An example <strong>for</strong> splitting a MST<br />

into subtrees is depicted in Figure 5.4. The full MST is shown in Figure 5.4(f). Five subdivisions<br />

are computed by applying 5 different thresholds T 1 ...T 5 with T i < T i+1 . Some subtrees stay the<br />

same <strong>for</strong> different thresholds, e.g. the two subtrees on the top of the image.<br />

5.1.3 Selection of stable clusters<br />

The previous step produced p different cluster sets. We are now interested in clusters which do<br />

not change their shape over several scales, i.e. those that are stable. As a stability criterion <strong>for</strong> a<br />

cluster we compare the set of their points. Clusters consisting of the same set of points across r<br />

different scales are defined as stable <strong>and</strong> constitute the output of the MSCC detector. A similar<br />

stability criterion is used by Matas et al. in the MSER detector [70] with great success.<br />

Figure 5.5 illustrates the method on a synthetic test image. The image shows 4 differently<br />

sized squares. The Harris corner detection step produces several responses on the corners of<br />

the squares. Connecting the single points with the MST reveals a structure where one can<br />

easily see that clustering can be done by removing the larger edges. Clusters of interest points<br />

are indicated by ellipses around them. The test image shows the capability of detecting stable<br />

clusters at multiple scales, starting from very small clusters at the corners of the squares itself<br />

up to the cluster containing all detected interest points.<br />

5.2 Region representation<br />

As mentioned be<strong>for</strong>e a MSCC region is defined by a clustered set of points C. Unlike many<br />

other detectors the MSCC clusters show arbitrary shapes, an approximative delineation may be


5.2. Region representation 72<br />

400 (1000) detected points<br />

(a) (b) (c)<br />

(d) (e) (f)<br />

Figure 5.4: (a-e) Subdivisions of the MST with 5 regularly spaced thresholds T 1 ...T 5 . Note that<br />

the two top subtrees do not change <strong>for</strong> the first three thresholds. They are stable. (f) Full MST<br />

computed from the Harris corners.<br />

Figure 5.5: Example of the MSCC detector on a synthetic test image (clustered interest points<br />

are indicated by ellipses around them).<br />

obtained by convex hull construction or fitting ellipses. Delineation using the convex hull is the<br />

preferred method. Ellipse fitting is only a poor estimate of the region delineation. However, as<br />

an ellipse the detection can be described efficiently with 4 parameters, length of major axis a,<br />

length of minor axis b, angle of major axis α <strong>and</strong> ellipse center C = (c x , c y ).<br />

The ellipse parameters are defined by the covariance ellipse (covariance matrix) of the point<br />

distribution C. The covariance matrix Σ is defined as<br />

Σ = E [ (X − E[X])(X − E[X]) T ] (5.1)


5.2. Region representation 73<br />

where X is a column vector with n scalar r<strong>and</strong>om variable components <strong>and</strong> E[X] is the expected<br />

value of X. In our case X is a matrix containing the x <strong>and</strong> y coordinates of the n points <strong>for</strong>ming<br />

the corner cluster.<br />

⎛<br />

Σ is then a 2 × 2 covariance matrix.<br />

X =<br />

⎜<br />

⎝<br />

⎞<br />

x 1 y 1<br />

. .<br />

. .<br />

x n y n<br />

⎟<br />

⎠ (5.2)<br />

Σ = 1 [<br />

(X − E[X])(X − E[X])<br />

T ] (5.3)<br />

n − 1<br />

A 2 × 2 covariance matrix can be represented as an ellipse. Lets denote it as the region ellipse.<br />

The parameters of the region ellipse, length of major axis a e <strong>and</strong> minor axis b e <strong>and</strong> rotation angle<br />

α e of the major axis are encoded in the covariance matrix. The parameters can be computed<br />

by Eigenvalue decomposition of Σ. Eigenvalue decomposition gives λ 1 <strong>and</strong> λ 2 where λ 1 > λ 2 .<br />

The length of the major axis a e = λ 1 <strong>and</strong> b e = λ 2 . Figure 5.6 illustrates the region ellipse <strong>for</strong> a<br />

MSCC point cluster. The region ellipse is drawn in black. The black crosses mark the individual<br />

corners of the point cluster where the covariance matrix Σ is computed from. The region ellipse<br />

is rotated according to the angle α e , pointing into the main direction of the point distribution.<br />

The main direction is defined by the Eigenvector <strong>for</strong> λ 1 denoted as v 1 = (v x , v y ).<br />

α e = arctan v y<br />

v x<br />

(5.4)<br />

The region delineation is now created by scaling the region ellipse to the size of the point<br />

distribution. The length of the major axis a is set to the maximum point distance to the center<br />

of the cluster points.<br />

a = max ‖C i − C‖ (5.5)<br />

i<br />

The ellipse center C is the center of gravity of the point distribution C. The length of the minor<br />

axis b is defined with<br />

b = b e<br />

a<br />

a e<br />

. (5.6)<br />

The scaling leads to the final region delineation shown as blue ellipse.<br />

Similar to other local detectors the covariance matrix Σ of the cluster points can be used <strong>for</strong><br />

affine normalization as described in Baumberg et al. [6]. Trans<strong>for</strong>ming the point distribution<br />

with the inverse square root of Σ removes an affine distortion up to a remaining rotation. The<br />

normalized MSCC C n is computed with<br />

C n = Σ − 1 2 C. (5.7)<br />

Σ 1 2 is the matrix square root which can be computed by Cholesky decomposition. An example<br />

<strong>for</strong> MSCC normalization is shown in Figure 5.7. Figure 5.7(a) shows the detected MSCC region.<br />

The black crosses are the corners constituting the cluster. A region delineation by computing<br />

the convex hull of the corners is shown <strong>for</strong> illustration issues. The region ellipse defined by the<br />

covariance matrix Σ is shown in black. Figure 5.7(b) shows the same MSCC region after applying<br />

an arbitrary affine trans<strong>for</strong>mation. The distortion effects are clearly visible. Figure 5.7(c) shows<br />

the normalized original region. The corners constituting the MSCC region are trans<strong>for</strong>med<br />

with the inverse square root of Σ. The effect of the normalization is visualized with the region


5.3. Computational complexity 74<br />

6.5<br />

6<br />

a<br />

5.5<br />

a e<br />

5<br />

4.5<br />

b e<br />

4<br />

b<br />

3.5<br />

8 8.5 9 9.5 10 10.5 11 11.5 12<br />

Figure 5.6: Region ellipse <strong>and</strong> region delineation <strong>for</strong> a MSCC point cluster. The black crosses<br />

mark the individual corners of the point cluster. The black ellipse is the region ellipse. The blue<br />

ellipse is a scaled version of the region ellipse resulting in the final region delineation.<br />

ellipse. The region ellipse is trans<strong>for</strong>med into a circle of unit radius. Normalizing the affine<br />

distorted region in Figure 5.7(b) with the according covariance matrix results in Figure 5.7(d).<br />

The resulting MSCC is within the same canonical coordinate system as the one of Figure 5.7(c).<br />

They only differ by an unknown rotation.<br />

5.3 Computational complexity<br />

The steps 2-4 of the algorithms can be implemented very efficiently. It is possible to do the multi<br />

scale clustering as well as the selection of stable clusters already during the MST construction.<br />

The time complexity of the algorithm is there<strong>for</strong>e determined by the time complexity of the<br />

MST construction which is in our case O(m log n) <strong>for</strong> the Kruskal method [19] where m is the<br />

number of edges in the graph <strong>and</strong> n the number of nodes. Checking the stability of the clusters<br />

introduces a constant term depending on the number of thresholds p but produces only very<br />

little overhead.<br />

Ultimately a linear time complexity of O(m) would be possible by using the r<strong>and</strong>omized<br />

MST construction proposed by Karger et al. [55]. The MST is found in linear time up to a<br />

certain probability.<br />

5.4 Parameters<br />

The properties of the MSCC detector can be adjusted with 3 parameters. In the following this<br />

3 parameters are described in detail <strong>and</strong> suggestions <strong>for</strong> choosing useful values are given. Some<br />

parameters depend on the interest point detector. In our case we describe the method using the<br />

Harris corner detector.


5.4. Parameters 75<br />

7<br />

7<br />

6.5<br />

6.5<br />

6<br />

6<br />

5.5<br />

5.5<br />

y<br />

y<br />

5<br />

5<br />

4.5<br />

4.5<br />

4<br />

4<br />

3.5<br />

3.5<br />

8.5 9 9.5 10 10.5 11 11.5 12<br />

x<br />

(a)<br />

8.5 9 9.5 10 10.5 11 11.5 12<br />

x<br />

(b)<br />

7<br />

7<br />

6.5<br />

6.5<br />

6<br />

6<br />

5.5<br />

5.5<br />

y<br />

y<br />

5<br />

5<br />

4.5<br />

4.5<br />

4<br />

4<br />

3.5<br />

3.5<br />

8.5 9 9.5 10 10.5 11 11.5 12<br />

x<br />

(c)<br />

8.5 9 9.5 10 10.5 11 11.5 12<br />

x<br />

(d)<br />

Figure 5.7: (a) Original detected MSCC. (b) Affine distorted MSCC. (c) Normalized original<br />

MSCC. (d) Normalized affine distorted MSCC. Both normalized regions are in the same canonical<br />

coordinate system differing only by a rotation.<br />

Harris cornerness threshold p h : When using the Harris corner detector one parameter is<br />

the cornerness threshold p h . The Harris corner detector computes a cornerness measure<br />

<strong>for</strong> every pixel position. A corner is defined by a high positive value. Usually corners show<br />

cornerness values in the range of 10 3 − 10 5 . In our case we simply want to find all corners<br />

above the noise level. There<strong>for</strong>e a low threshold in the range of 1 − 100 will work very<br />

well.<br />

Gaussian filter size p s : Another parameter of the Harris corner detector is the variance of<br />

the involved Gaussian filters p s . Simply speaking p s defines the scale on which the corners<br />

are detected. Our application requires a detection on a small scale, thus an appropriate


5.5. Detection examples 76<br />

parameter ”Box” ”Group” ”Doors”<br />

cornerness threshold of Harris detector p h 1 1 1<br />

sigma of Harris detector p s 0.5 0.5 0.5<br />

stability parameter p r 5 5 5<br />

Table 5.1: Parameter values used <strong>for</strong> the detection examples.<br />

value <strong>for</strong> p s would be in the range of 0.5 − 1.5.<br />

Stability parameter p r : The last parameter is the stability parameter p r . The parameter<br />

decides on the stability of a cluster <strong>and</strong> if the cluster should be selected as region. If a<br />

cluster fulfills the stability criteria <strong>for</strong> p r threshold steps the cluster is denoted as stable.<br />

The thresholds start with the minimal edge length in pixel <strong>and</strong> are increased by 1 pixel<br />

each step until the maximal edge length is reached. A high value produces only very stable<br />

clusters <strong>and</strong> lower values less stable clusters. Useful values <strong>for</strong> p r are in the range of 5−10.<br />

5.5 Detection examples<br />

This section shows detection examples <strong>for</strong> three different image sequences. Each sequence contains<br />

images with increasing view point change up to wide-baseline cases. This is to demonstrate<br />

the repeatability of the MSCC detector under viewpoint change. The interest points are shown<br />

as red crosses while the MSCC regions are shown as blue ellipses.<br />

”Box” scene: Figure 5.8 shows the MSCC detections <strong>for</strong> the ”Box” scene. The ”Box” scene<br />

is a set of images of a box from different viewpoints. The images were acquired on a<br />

turntable. The images are of a resolution of 800 × 600 pixel. Many regions are detected<br />

repetitively in each image. The multi-scale clustering detects very small as well as large<br />

regions.<br />

”Group” scene: Figure 5.9 shows the MSCC detections <strong>for</strong> the ”Group” scene. The scene<br />

consists of two piecewise planar objects on a turntable. The overall viewpoint change <strong>for</strong><br />

the whole image sequence is almost 90 ◦ . Again many regions are detected repetitively in<br />

each image despite of the large viewpoint change. The image resolution is 1024×896 pixel.<br />

”Doors” scene Figure 5.10 shows the MSCC detections <strong>for</strong> the ”Doors” scene. The ”Doors”<br />

image set is from a robot localization experiment. The image resolution is 720 × 288. The<br />

poster in the example contains a lot of written text. The MSCC detector manages to<br />

identify the different sections of the text as MSCC regions.<br />

The parameter settings <strong>for</strong> the 3 scenes are given in Table 5.1.


5.5. Detection examples 77<br />

Figure 5.8: Detection examples on ”Box” scene.


5.5. Detection examples 78<br />

Figure 5.9: Detection examples on ”Group” scene.


5.5. Detection examples 79<br />

Figure 5.10: Detection examples on ”Doors” scene.


5.6. Detector evaluation: Repeatability <strong>and</strong> matching score 80<br />

5.6 Detector evaluation: Repeatability <strong>and</strong> matching score<br />

The per<strong>for</strong>mance of the MSCC detector is compared to other approaches in terms of the repeatability<br />

<strong>and</strong> matching score (see Chapter 4 <strong>for</strong> details). The MSCC detector is evaluated<br />

on the planar ”Doors” scene using the publicly available evaluation framework of Mikolajczyk<br />

<strong>and</strong> Schmid [74]. In addition the MSCC detector is evaluated on the non-planar ”Group” <strong>and</strong><br />

”Room” scenes using the evaluation method of Chapter 4.<br />

5.6.1 Evaluation of the ”Doors” scene<br />

The ”Doors” scene consists of 10 images from a robot localization experiment. Figure 5.10<br />

shows the images of the test set along with detected MSCC regions. To comply with the<br />

evaluation framework ellipses are fitted to the MSCC regions, i.e the ellipse parameters are<br />

calculated from the covariance matrix of the interest points belonging to the region. We compare<br />

the repeatability score <strong>and</strong> the matching score of our MSCC detector to 4 other detectors on<br />

increasing viewpoint change up to 130 ◦ . For the matching score the SIFT descriptor [67] is used.<br />

Figure 5.11 shows the repeatability <strong>and</strong> matchings score of the MSCC detector compared to the<br />

Maximally Stable Extremal Regions (MSER) [70], the Hessian-Affine regions (HESAFF) [73],<br />

the Harris-Affine regions (HARAFF) [73] <strong>and</strong> the intensity based regions (IBR) [112]. The<br />

experiment reveals a competitive per<strong>for</strong>mance of our novel detector when compared to other<br />

approaches. The regions detected by our approach are consistently different from those of other<br />

detectors (see also Section 5.7).<br />

5.6.2 Evaluation of the ”Group” <strong>and</strong> ”Room” scene<br />

Using the evaluation method described in Chapter 4 we compare the MSCC detector to 4<br />

other local detectors on the ”Group” <strong>and</strong> ”Room” scene. The repeatability <strong>and</strong> matching score<br />

of the MSCC detector is compared to the Maximally Stable Extremal Regions (MSER) [70],<br />

the Hessian-Affine regions [73], the Harris-Affine regions [73] <strong>and</strong> the intensity based regions<br />

(IBR) [112].<br />

”Group” scene: Figure 5.12 shows the repeatability scores <strong>for</strong> the ”Group” scene. The graph<br />

of the MSCC detector starts with a lower value than the other detectors. With increasing<br />

viewpoint change the repeatability score decreases at a similar rate as the other detectors.<br />

However, from 30 ◦ to 75 ◦ the repeatability score stays constant, while the scores of the<br />

other detectors still decrease. For the last part the MSCC detector matches the values<br />

of the Hessian-Affine <strong>and</strong> Harris-Affine detector. Figure 5.12(b) shows that the MSCC<br />

detector produces as much regions as the MSER detector. Figure 5.14 shows the achieved<br />

matching scores. The matching score (relative to the number of possible matches) of the<br />

MSCC detector competitive to the scores of the IBR, Harris-Affine <strong>and</strong> Hessian-Affine<br />

detector. It is only outper<strong>for</strong>med by the MSER detector. For the last part the MSCC<br />

matching score is however higher than that achieved by the Hessian-Affine <strong>and</strong> Harris-<br />

Affine detector. Table 5.2 <strong>and</strong> Table 5.4 show the corresponding numbers.<br />

”Room” scene: Figure 5.13 shows the repeatability scores <strong>for</strong> the ”Room” scene. Up to 50 ◦<br />

viewpoint change the repeatability score of the MSCC detector is similar to the MSER,<br />

Harris-Affine <strong>and</strong> Hessian-Affine detector. For viewpoint changes more than 50 ◦ the MSCC<br />

detector is the second best only outper<strong>for</strong>med by the IBR detector. Figure 4.9 shows<br />

the matching scores <strong>for</strong> the ”Room” scene. None of the detectors achieves outst<strong>and</strong>ing


5.7. Combining MSCC with other local detectors 81<br />

repeatability [%]<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

MSCC<br />

MSER<br />

HARAFF<br />

HESAFF<br />

IBR<br />

10<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140<br />

viewpoint angle (approx.) [°]<br />

(a)<br />

matching score [%]<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

MSCC<br />

MSER<br />

HARAFF<br />

HESAFF<br />

IBR<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140<br />

viewpoint angle (approx.) [°]<br />

(b)<br />

Figure 5.11: (a) Repeatability score <strong>for</strong> ”Doors” scene. (b) Matching score <strong>for</strong> ”Doors” scene.<br />

matching scores on this scene. The corresponding numbers are given in Table 5.3 <strong>and</strong><br />

Table 5.5.<br />

5.7 Combining MSCC with other local detectors<br />

This experiment evaluates the complementarity of the MSCC detector. This is done by counting<br />

the non-overlapping correct matching regions from different detectors. Regions from different<br />

detectors are counted as non-overlapping if they do not overlap more than 40%. Matching is done<br />

using SIFT descriptors <strong>and</strong> nearest neighbor search (as implemented in Mikolajczyks evaluation<br />

framework). The experiment is carried out using the ”Doors” scene. Figure 5.16(a) shows the<br />

absolute number of matched MSER regions, MSER regions combined with HESAFF regions,<br />

combination of MSER, HESAFF <strong>and</strong> HARAFF, combination of MSER, HESAFF, HARAFF<br />

<strong>and</strong> IBR <strong>and</strong> combination of the previous detectors with the MSCC detector. Figure 5.16(b-e)


5.7. Combining MSCC with other local detectors 82<br />

repeatability [%]<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

MSER<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

MSCC<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(a)<br />

number of correspondences<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

MSER<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

MSCC<br />

200<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(b)<br />

Figure 5.12: (a) Repeatability score <strong>for</strong> ”Group” scene. (b) Absolute number of correspondences.<br />

show the region numbers <strong>for</strong> combining the MSCC detector with each of the other detectors.<br />

The graphs show that our MSCC detector is able to add a significant amount of new matches<br />

to the ones of the other detectors. Figure 5.16(f) <strong>and</strong> (g) show an example <strong>for</strong> 120 ◦ viewpoint<br />

change. The dashed dark ellipses mark the matches from the combination of MSER, HESAFF,<br />

HARAFF <strong>and</strong> IBR. The bright ellipses mark the additional matches obtained from the MSCC<br />

detector.


5.7. Combining MSCC with other local detectors 83<br />

repeatability score [%]<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

MSER<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

MSCC<br />

20<br />

10<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(a)<br />

number of correspondences<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

MSER<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

MSCC<br />

200<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(b)<br />

Figure 5.13: (a) Repeatability score <strong>for</strong> ”Room” scene. (b) Absolute number of correspondences.


5.7. Combining MSCC with other local detectors 84<br />

matchingscore (rel. to #detection) [%]<br />

90.0<br />

80.0<br />

70.0<br />

60.0<br />

50.0<br />

40.0<br />

30.0<br />

20.0<br />

10.0<br />

MSER<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

MSCC<br />

0.0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(a)<br />

matchingscore (rel. to #possible matches) [%]<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

MSER<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

MSCC<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(b)<br />

absolute number of correct matches<br />

400<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

MSER<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

MSCC<br />

50<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(c)<br />

Figure 5.14: (a) Matching score <strong>for</strong> ”Group” scene relative to number of detections. (b) Matching<br />

score <strong>for</strong> ”Group” scene relative to number of possible matches. (c) Absolute number of correct<br />

matches.


5.7. Combining MSCC with other local detectors 85<br />

matchingscore (rel. to #detection) [%]<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

MSER<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

MSCC<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(a)<br />

matchingscore (rel. to #possible matches) [%]<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

MSER<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

MSCC<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(b)<br />

absolute number of correct matches<br />

400<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

MSER<br />

Harris-Affine<br />

Hessian-Affine<br />

IBR<br />

MSCC<br />

50<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

viewpoint angle [°]<br />

(c)<br />

Figure 5.15: (a) Matching score <strong>for</strong> ”Room” scene relative to number of detections. (b) Matching<br />

score <strong>for</strong> ”Room” scene relative to number of possible matches. (c) Absolute number of correct<br />

matches.


5.7. Combining MSCC with other local detectors 86<br />

view MSER Har.-Aff. Hes.-Aff. IBR MSCC<br />

change repeat. repeat. repeat. repeat. repeat.<br />

[ ◦ ] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.]<br />

10 76.3 48.5 56.4 57.6 41.5<br />

180 586 457 364 207<br />

15 67.8 45.0 50.9 52.6 37.5<br />

164 543 412 318 187<br />

20 59.8 39.9 46.7 49.8 34.7<br />

143 482 358 308 173<br />

25 56.0 36.3 43.5 48.8 30.5<br />

126 436 359 290 152<br />

30 57.7 36.3 40.6 48.5 24.4<br />

127 439 341 275 122<br />

35 55.6 33.8 41.2 47.2 20.4<br />

120 408 353 265 101<br />

40 50.7 33.6 37.2 45.3 21.2<br />

114 406 236 267 106<br />

45 50.9 31.0 34.0 44.3 21.8<br />

112 375 253 255 109<br />

50 46.6 31.4 35.6 46.8 23.4<br />

108 379 258 256 117<br />

55 47.2 32.0 34.4 45.0 22.4<br />

110 387 279 247 112<br />

60 45.9 32.3 33.1 44.2 23.2<br />

111 390 268 258 116<br />

65 43.3 29.7 31.9 43.1 22<br />

107 359 258 249 110<br />

70 44.1 29.4 30.7 39.1 22.2<br />

109 355 249 236 111<br />

75 41.2 28.6 26.9 42.2 21.1<br />

100 346 218 232 105<br />

80 42.5 26.5 26.2 43.3 25.2<br />

97 320 212 229 111<br />

85 41.7 25.5 26.0 43.3 25.4<br />

91 293 206 218 101<br />

90 37.9 26.2 25.6 40.8 25.6<br />

83 292 188 200 91<br />

Table 5.2: Repeatability score <strong>and</strong> absolute number of correspondences <strong>for</strong> ”Group” scene with<br />

changing viewpoint


5.7. Combining MSCC with other local detectors 87<br />

view MSER Har.-Aff. Hes.-Aff. IBR MSCC<br />

change repeat. repeat. repeat. repeat. repeat.<br />

[ ◦ ] [%, abs.] [%, abs.] [%, abs.] [%, abs.] [%, abs.]<br />

15 44.6 35.6 41.9 46.1 28.4<br />

54 208 240 111 99<br />

20 38.1 31.3 34.0 42.5 27.2<br />

43 188 191 99 88<br />

25 31.3 26.6 28.0 35.8 26.8<br />

35 158 160 81 86<br />

30 24.8 26.8 27.8 37.7 22.6<br />

28 150 157 81 74<br />

35 23.3 24.6 26.2 41.9 19.0<br />

27 142 145 91 60<br />

40 21.4 20.8 22.5 31.9 18.1<br />

21 116 113 61 58<br />

45 23.1 19.8 20.5 29.4 19.7<br />

24 108 113 60 57<br />

50 20.0 15.8 15.8 28.0 19.2<br />

21 89 80 58 63<br />

55 19.3 16.2 14.7 27.2 19.5<br />

22 85 82 55 61<br />

60 12.9 14.4 15.5 24.0 18.3<br />

13 82 85 49 59<br />

65 18.9 13.8 15.1 29.0 16.8<br />

21 82 86 56 57<br />

70 13.5 12.8 12.7 27.1 18.4<br />

15 75 71 58 61<br />

75 17.9 14.4 15.5 30.0 18.2<br />

20 85 82 67 61<br />

80 14.4 12.6 13.2 29.9 18.4<br />

16 76 72 66 62<br />

85 13.5 11.9 11.4 27.6 18.8<br />

14 74 66 61 62<br />

90 11.5 12.1 11.7 28.8 18.7<br />

13 73 68 66 61<br />

Table 5.3: Repeatability score <strong>and</strong> absolute number of correspondences <strong>for</strong> ”Room” scene with<br />

changing viewpoint


5.7. Combining MSCC with other local detectors 88<br />

view MSER Har.-Aff. Hes.-Aff. IBR MSCC<br />

change m. score m. score m. score m. score m. score<br />

[ ◦ ] [%, %, [%, %, [%, %, [%, %, [%, %,<br />

abs.] abs.] abs.] abs.] abs.]<br />

10 66.9 29.4 34.3 33.5 13.8<br />

75 31.6 36.8 37.3 27.5<br />

162 355 280 212 69<br />

15 56.5 24.3 28.3 26.9 9.2<br />

69 26.9 31.2 30.1 19.0<br />

140 293 231 163 46<br />

20 43.3 18.8 21.6 23.8 11.8<br />

57.3 22.1 25.6 26.9 24.8<br />

106 227 176 147 59<br />

25 36.8 16.3 19.1 22.7 10.6<br />

50.6 20.5 23.7 27.8 24.2<br />

85 196 156 135 53<br />

30 33.6 14.3 13.6 18.5 8.2<br />

46.3 18.7 17.6 23.1 21.4<br />

76 173 111 105 41<br />

35 27.0 11.5 14.1 19.4 5.1<br />

38.2 15.5 17.9 24.1 18.3<br />

60 139 115 109 25<br />

40 24.2 10.8 11.3 17.3 5.2<br />

37.6 14.2 15.2 21.4 18.4<br />

56 131 92 102 26<br />

45 24.8 8.7 10.5 14.2 3.0<br />

37.3 11.8 14.2 17.8 10.4<br />

56 105 86 82 15<br />

50 18.5 8.2 9.1 14.8 5.6<br />

29.5 11.3 11.8 16.7 19.1<br />

44 99 74 81 28<br />

55 20.5 6.5 7.1 13.2 4.2<br />

32.7 8.6 9.4 15.8 13.6<br />

49 78 58 73 21<br />

60 14.5 6.7 6.1 13.4 5.0<br />

24.2 9 8.2 16.9 16.9<br />

36 81 50 79 25<br />

65 14.1 7.5 6.5 11.1 3.2<br />

24.5 10.3 9.1 13.8 11.4<br />

36 90 53 65 16<br />

70 11.4 4.8 5.2 10.6 4.0<br />

20.1 6.7 7.5 14 15.8<br />

30 58 42 67 20<br />

75 9.4 5.2 5 10.5 2.8<br />

17.7 7.7 7.9 13.6 11.4<br />

25 63 41 62 14<br />

80 9.6 4.0 4.5 7.4 2.4<br />

18.1 5.8 7.2 9.2 9.5<br />

25 48 37 44 12<br />

85 6.3 4.2 3.6 7.4 2.2<br />

12.7 6.3 6 9.6 8.3<br />

16 51 29 43 11<br />

90 6.2 2.5 3.6 6.1 1.4<br />

13.2 3.8 6.1 8 5.4<br />

16 30 29 35 7<br />

Table 5.4: Matching score, matching score relative to number of possible matches <strong>and</strong> absolute<br />

number of correct matches <strong>for</strong> ”Group” scene with changing viewpoint


5.7. Combining MSCC with other local detectors 89<br />

view MSER Har.-Aff. Hes.-Aff. IBR MSCC<br />

change m. score m. score m. score m. score m. score<br />

[ ◦ ] [%, %, [%, %, [%, %, [%, %, [%, %,<br />

abs.] abs.] abs.] abs.] abs.]<br />

15 19 12.8 16.6 12.8 5.4<br />

50.8 21.7 24.8 19.8 24.1<br />

31 91 106 36 21<br />

20 9.2 8.5 11.5 12.8 2.7<br />

29.4 14.6 18.4 22.9 10.8<br />

15 60 73 36 10<br />

25 5.5 7.3 9.0 8.9 2.9<br />

20.5 14.1 17.4 18.7 11.7<br />

9 51 57 25 11<br />

30 4.3 6 5.7 7.8 2.3<br />

20.6 12.2 11.3 16.4 12.3<br />

7 42 36 22 9<br />

35 3.7 3.2 3.9 4.6 1.3<br />

18.2 6.7 8 9.6 7.7<br />

6 23 25 13 5<br />

40 4.6 2.7 2.6 4 0.3<br />

25.9 7.1 6.5 11 2<br />

7 19 16 11 1<br />

45 3.7 0.9 1.9 3.9 0<br />

20.7 2.4 4.8 11.3 0<br />

6 6 12 11 0<br />

50 3.7 2.4 1.0 1.5 0.3<br />

24 7.6 3 4.6 1.9<br />

6 17 6 4 1<br />

55 2.5 1.6 0.9 2.1 1.3<br />

13.8 5 2.9 6.1 7.9<br />

4 11 6 6 5<br />

60 0.7 0.6 0.5 2.6 0.8<br />

5 1.8 1.6 8.6 5.2<br />

1 4 3 7 3<br />

65 1.8 0.6 0.3 2.2 1.2<br />

10 1.7 0.9 6.2 8.2<br />

3 4 2 6 5<br />

70 1.2 0.6 0.2 1.1 1<br />

9.5 1.7 0.6 2.9 6.4<br />

2 4 1 3 4<br />

75 1.2 0 0.6 1.8 0.7<br />

7.1 0 2.2 4.3 4.3<br />

2 0 4 5 3<br />

80 0 0.7 0.2 1.1 0<br />

0 2.3 0.5 2.8 0<br />

0 5 1 3 0<br />

85 0.6 0.4 0.5 1.8 0.5<br />

4.2 1.4 1.6 4.1 3.2<br />

1 3 3 5 2<br />

90 1.2 0.3 0.5 1.1 0<br />

10.5 0.9 1.6 2.5 0<br />

2 2 3 3 0<br />

Table 5.5: Matching score, matching score relative to number of possible matches <strong>and</strong> absolute<br />

number of correct matches <strong>for</strong> ”Room” scene with changing viewpoint


5.7. Combining MSCC with other local detectors 90<br />

abs. number of matches<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

MSER<br />

MSER+HESAFF<br />

MSER+HESAFF+HARAFF<br />

MSER+HESAFF+HARAFF+IBR<br />

MSER+HESAFF+HARAFF+IBR+MSCC<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140<br />

viewpoint angle (approx.) [°]<br />

(a)<br />

70<br />

70<br />

60<br />

MSER<br />

60<br />

IBR<br />

abs. number of matches<br />

50<br />

40<br />

30<br />

20<br />

MSER+MSCC<br />

abs. number of matches<br />

50<br />

40<br />

30<br />

20<br />

IBR+MSCC<br />

10<br />

10<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140<br />

viewpoint angle (approx.) [°]<br />

(b)<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140<br />

viewpoint angle (approx.) [°]<br />

(c)<br />

80<br />

120<br />

abs. number of matches<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

HARAFF<br />

HARAFF+MSCC<br />

abs. number of matches<br />

100<br />

80<br />

60<br />

40<br />

20<br />

HESAFF<br />

HESAFF+MSCC<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140<br />

viewpoint angle (approx.) [°]<br />

(d)<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140<br />

viewpoint angle (approx.) [°]<br />

(e)<br />

(f)<br />

(g)<br />

Figure 5.16: Absolute numbers of non-overlapping matched regions. (a) Combining all detectors.<br />

(b) Combining MSER <strong>and</strong> MSCC. (c) Combining IBR <strong>and</strong> MSCC. (d) Combining HARAFF<br />

<strong>and</strong> MSCC. (e) Combining HESAFF <strong>and</strong> MSCC. (f),(g) Matches <strong>for</strong> combination of all detectors<br />

at 120 ◦ viewpoint change. The bright ellipses mark the additional matches obtained from the<br />

MSCC detector.


Chapter 6<br />

Wide-baseline methods 1<br />

This chapter deals with image matching <strong>and</strong> 3D reconstruction <strong>for</strong> wide-baseline scenarios. The<br />

first section provides a solution <strong>for</strong> the problem of detecting corresponding regions in images<br />

from far different viewpoints. The proposed method builds upon the detection of affine invariant<br />

interest regions. Projective trans<strong>for</strong>mations introduced by the wide baseline get reduced by<br />

affine normalization. In the proposed method the projective distortion subsequently gets completely<br />

removed until both image patches are registered. In the registered image patches point<br />

correspondences have simply the same pixel coordinates <strong>and</strong> by knowing the applied trans<strong>for</strong>mations<br />

the pixel coordinates in the original image frame can be computed.<br />

The second section describes a method to recover scene planes of arbitrary position <strong>and</strong> orientation<br />

from oriented images using homographies. Given at least 2 wide-baseline images a<br />

piece-wise planar 3D reconstruction can be computed. Furthermore the input images are segmented<br />

into planar image parts. Planar regions are reconstructed using only sparse, affineinvariant<br />

sets of corresponding seed regions. These regions are iteratively exp<strong>and</strong>ed <strong>and</strong> refined<br />

using plane-induced homographies. 3D reconstruction needs a calibrated setup, while the planar<br />

segmentation is possible <strong>for</strong> uncalibrated images too.<br />

6.1 Wide-baseline region matching<br />

In the following the wide-baseline region matching method is described which is a key technique<br />

<strong>for</strong> the proposed map building <strong>and</strong> localization framework. The algorithm is a key ingredient<br />

of the plane segmentation <strong>and</strong> reconstruction method described in Section 6.2.1. The plane<br />

segmentation <strong>and</strong> reconstruction method is used to build piece-wise planar sub-maps (see Section<br />

7.1). Another application of this algorithm is in linking sub-maps into a complete world<br />

map (see Section 7.1.5). It is also a key component of the global localization algorithm presented<br />

in section 7.2. The method has been designed in a way to exhibit the following properties:<br />

1. Highly reliable matches, i.e. the algorithms produces a low number of outliers.<br />

2. Exact point correspondences, i.e. with sub-pixel accuracy.<br />

3. High number of point correspondences<br />

1 Based on the publication:<br />

F. Fraundorfer, K. Schindler, <strong>and</strong> H. Bischof. Piecewise planar scene reconstruction from sparse correspondences.<br />

Image <strong>and</strong> <strong>Vision</strong> Computing, 24(4):395–406, 2006 [38]<br />

91


6.1. Wide-baseline region matching 92<br />

Property 1 is achieved by a 2-step approach. In a first step, tentative correspondences are<br />

identified by nearest neighbor matching in feature space. The tentative matches however still<br />

contain a lot of outliers. In a second step the tentative matches are verified by area based<br />

matching, calculating the correlation over the whole interest region. This step ensures with<br />

maximal certainty the correctness of the match.<br />

To achieve property 2 matching patches get exactly registered onto each other, by an iterative<br />

registration procedure. Registration is per<strong>for</strong>med with sub-pixel accuracy which results in highly<br />

accurate point correspondences.<br />

Unlike other approaches this algorithm does not simple use the center point of a region<br />

match as final correspondence. Instead, within the matched <strong>and</strong> registered image regions, new<br />

point correspondences are detected. Each matched image region yields about 20-50 new point<br />

correspondences (property 3). The registration is done by computing the inter-image homography<br />

<strong>for</strong> each region which maps one region exactly onto the other. There<strong>for</strong>e the method is<br />

restricted to planar interest regions only. In fact, non-planar matches will be rejected by this<br />

method.<br />

6.1.1 Matching <strong>and</strong> registration<br />

Let us now have a close look at the details of the method. It is a 2-step approach consisting of<br />

generating tentative matches <strong>and</strong> verification (see Algorithm 2 <strong>for</strong> a compact description). First<br />

we will describe the generation of the tentative matches. Input is a wide-baseline image pair I<br />

<strong>and</strong> I ′ . In each of the images interest regions are detected. We denote the set of interest regions<br />

in I with L <strong>and</strong> in I ′ with L ′ . The method is not restricted to one special detector, every affine<br />

interest region detector (see [76] <strong>for</strong> examples) is possible. After detection a local affine frame<br />

(LAF) is computed <strong>for</strong> every region in L <strong>and</strong> L ′ . Next the interest regions are normalized using<br />

the LAF. Normalization tries to remove the perspective distortion of a viewpoint change <strong>and</strong> two<br />

corresponding regions will appear almost identical. Some normalization methods create multiple<br />

normalized images <strong>for</strong> a single interest region. The multiple appearances are simply added to<br />

the region set. For the set of normalized regions L <strong>and</strong> L ′ SIFT descriptors are extracted <strong>and</strong><br />

stored in D <strong>and</strong> D ′ . Each entry in D <strong>and</strong> D ′ is a vector of length 128 describing the appearance<br />

of a normalized patch using orientation histograms. Corresponding interest regions can now be<br />

found by nearest neighbor search in this 128-dimensional feature space. For efficient matching a<br />

KD-tree K is built with the feature vectors in D ′ . Corresponding interest regions <strong>for</strong> the entries<br />

in D are now found by querying the KD-tree. The corresponding region <strong>for</strong> D i is the closest<br />

feature in D ′ return by the KD-tree query. As distance metric the Euclidean distance is used.<br />

To avoid r<strong>and</strong>om matches a measure based on the ratio of the nearest to the second closest<br />

feature vector is used. A match is accepted if<br />

d 0<br />

d 1<br />

< d th , (6.1)<br />

where d 0 is the Euclidean distance between the query feature <strong>and</strong> the nearest neighbor. d 1 is the<br />

distance from the query feature to the second closest feature vector. d th is a user set threshold.<br />

According to [67] an appropriate threshold is 0.8. We denote correspondences detected in this<br />

way as tentative matches. T is the set of tentative matches with T i = (L i , L ′ j ) <strong>and</strong> is the<br />

prerequisite <strong>for</strong> the verification step. The tentative matches T are now verified by area based<br />

matching. Correspondence is checked by normalized cross-correlation. This procedure is quite<br />

slow, but it is applied to the set of tentative matches only, which is significantly smaller than<br />

the initial set of detected regions. The cross-correlation is calculated on a registered pair of


6.1. Wide-baseline region matching 93<br />

interest regions. Due to the affine normalization matching interest regions are almost registered.<br />

This initial registration is improved by estimating the homography to trans<strong>for</strong>m one interest<br />

region from the matching pair into the other one. This is done with an iterative method. Let<br />

us denote a match pair in T as t ↔ t ′ . First, a fixed number of interest points p ′ (we use<br />

Harris corners [40]) are detected in t ′ . This is justified by the fixed size of the patches of<br />

nx × ny pixel. We start with the assumption that both patches are already registered. Thus,<br />

we establish a set of point correspondences p ↔ p ′ with p = p ′ within the region. See Figure 6.1<br />

<strong>for</strong> a step-wise illustration. p ↔ p ′ is a set of point pairs as represented by the blue crosses<br />

at k = 0 in the illustration. However, as the patches are not perfectly registered the point<br />

matches in p ↔ p ′ do not represent the best matches. The point locations in p are shifted<br />

within a search window to the position of the optimal match (marked with the red cross in<br />

Figure 6.1). We define the optimal matching position as the one with maximal correlation<br />

value. Finding the best position is done by searching. The correlation values <strong>for</strong> all pixel<br />

positions within a search window are calculated <strong>and</strong> the new position is the one with the highest<br />

value. The optimal position is refined at sub-pixel accuracy with an interpolation based on the<br />

correlation coefficients as described in [87]. Point matches with a maximal correlation value<br />

below threshold c th get removed from the set. The such established <strong>and</strong> refined point matches<br />

will be used to compute a trans<strong>for</strong>mation which registers the patches t <strong>and</strong> t ′ . A homography h k<br />

is estimated from p ↔ p ′ when at least 4 point correspondences could be established. Patch t is<br />

resampled applying the homography h k . This step is depicted in Figure 6.1 at k = 1. After the<br />

trans<strong>for</strong>mation the difference between the guessed position (blue cross) <strong>and</strong> the optimal position<br />

(red cross) got diminished. However a small difference still exists, the calculated homography<br />

was not accurate enough 2 . The process needs to be iterated. Point correspondences have to<br />

be established <strong>and</strong> refined again. A new homography has to be computed <strong>and</strong> applied. Each<br />

such iteration registers t <strong>and</strong> t ′ more accurate. The process can be stopped when the difference<br />

between two successive iteration falls below a threshold ɛ, or when the estimated homography<br />

is identical to the identity matrix with a given accuracy ɛ. If t <strong>and</strong> t ′ are exactly registered the<br />

homography between both patches will be the identity matrix. The algorithm converges fast,<br />

usually less than 5 iterations are necessary (see the last row in Figure 6.1). To avoid artifacts<br />

introduced by iterative resampling the subsequent trans<strong>for</strong>mations are concatenated <strong>and</strong> applied<br />

to the original image. See Algorithm 3 <strong>for</strong> a compact outline of the registration method.<br />

Point correspondences in the coordinate frame of the whole image can now be computed<br />

from every pixel location of the registered image pair. For every location in t ′ the corresponding<br />

location in t can be computed by inversely applying the homography sequence h 0 , h 1 , ..., h n . A<br />

point location in t is p = h −1 p ′ , where h is the inverse homography sequence h = h n ...h 1 h 0 . p<br />

is now in the coordinate frame of the LAF. By applying the inverse affine trans<strong>for</strong>mation used<br />

<strong>for</strong> the patch normalization one gets into the original image frame. t <strong>and</strong> t ′ were created by<br />

different LAF’s, A i <strong>and</strong> A ′ i respectively. Point correspondences in the original image frame are<br />

given by:<br />

p o = A −1 i p (6.2)<br />

p ′ o = A ′ i−1 h −1 p ′ (6.3)<br />

Multiple point correspondences obtained from a single region match are another main benefit<br />

of this special method. Other wide-baseline matching methods return only a single point per<br />

region match. In [70] only the center of gravity of the detected region is returned.<br />

2 The accuracy of the applied sub-pixel interpolation is limited <strong>and</strong> thus still a deviation remains after one<br />

application of the warping


6.1. Wide-baseline region matching 94<br />

k=0<br />

k=1<br />

k=n<br />

Figure 6.1: Iterative registration procedure. At k = 0 the patches are aligned by LAF normalization<br />

only. Blue crosses denote the same location in both patches. The red cross indicates the<br />

position with the highest correlation value <strong>for</strong> the point location marked in the right patch. The<br />

dashed square illustrates the correlation window. The shifted point location (red) <strong>and</strong> the original<br />

point location (blue) are used to estimate a homography. The illustration shows only one<br />

point pair, <strong>for</strong> homography estimation additional point correspondences are established (>= 4).<br />

The left patch is then resampled using the homography. After that we are arrived at k=1 <strong>and</strong><br />

the procedure is repeated. Iteratively new homographies are estimated <strong>and</strong> applied to the left<br />

patch until the patches are registered (see k=n). Usually this is achieved in a few number of<br />

iterations. One may note that a part of the correlation window at k=1 is outside the defined<br />

image area. In the illustration the correlation window is drawn enlarged <strong>for</strong> easy reading. In<br />

the implementation one of course has to choose an appropriate window size to avoid problems<br />

on the borders.


6.1. Wide-baseline region matching 95<br />

Algorithm 2 Region matching algorithm<br />

Detect interest regions in images I <strong>and</strong> I’, resulting in interest region sets L <strong>and</strong> L’<br />

Normalize each entry in L <strong>and</strong> L ′ with the LAF <strong>and</strong> resample to size 64 × 64 pixel<br />

Compute SIFT descriptor <strong>for</strong> every entry in L <strong>and</strong> L ′ , resulting in feature sets D <strong>and</strong> D’<br />

Construct KD-tree K from feature set D ′<br />

<strong>for</strong> all entries in D do<br />

Query KD-tree K with D i (query results n closest feature vectors D ′closest <strong>and</strong> Euclidean<br />

distances to query feature D i d closest = (d 0 , d 1 , d 2 , ..., d n ) in ascending order<br />

Store L i < − > L j indexed by D < − > D 0 ′closest as tentative match in T if d 0<br />

d 1<br />

< d th<br />

end <strong>for</strong><br />

<strong>for</strong> all entries in T do<br />

Register patches T i = (t, t ′ ). Registration returns correlation coefficient c i , transfer distance<br />

e i , homography h i <strong>and</strong> point correspondences p i within patch<br />

Store T i as final match in M if c i > c th ∧ e i < e th<br />

end <strong>for</strong><br />

Algorithm 3 Registration<br />

Input: t,t ′ ... image patches to register<br />

Output: c ... correlation coefficient<br />

Output: e ... point distance<br />

Output: h ... homography matrix, to warp t ′ onto t<br />

Output: p ... point matches in t<br />

Output: p ′ ... point matches in t’<br />

Detect n strongest Harris corner in t ′ , store in p ′<br />

Initialize p ← p ′<br />

h ← 3 × 3 identity matrix<br />

repeat<br />

<strong>for</strong> all entries in p do<br />

Compute d i = (d x , d y ), to maximize corr(p i + d i , p ′ i , t, t′ )<br />

end <strong>for</strong><br />

Remove p i , p ′ i with corr(p i + d i , p ′ i , t, t′ ) < c th<br />

Estimate homography h k (t → t ′ ) with p, p ′<br />

h ← hh k<br />

Warp t using h k<br />

diff = ‖h k − h k−1 ‖<br />

until (diff < ɛ)<br />

c = 1<br />

|p|<br />

e = 1<br />

|p|<br />

∑<br />

i corr(p i + d i , p ′ i , t, t′ )<br />

∑<br />

i ‖d i‖


6.2. Piece-wise planar scene reconstruction 96<br />

6.2 Piece-wise planar scene reconstruction<br />

In this section we present a method of reconstructing planar regions of a scene, which is useful<br />

<strong>for</strong> many man-made objects, such as <strong>for</strong> example buildings or machinery parts. The approach<br />

works with inter-image homographies which are a particularly interesting tool <strong>for</strong> reconstruction<br />

of planar surfaces: they directly exploit the perspective mapping of planes <strong>and</strong> thus stay closer<br />

to the original data than methods, which start with a conventional point-wise reconstruction<br />

<strong>and</strong> segment the resulting point cloud or depth map. In the following we describe an automatic,<br />

image-driven method, which simultaneously solves the region segmentation <strong>and</strong> the matching<br />

problem <strong>for</strong> the planar parts of a scene containing an unknown number of planar regions. This<br />

is achieved through a novel <strong>and</strong> innovative combination of state-of-the-art matching <strong>and</strong> 3D<br />

reconstruction methods. It uses well-defined interest points to initialize a piecewise planar<br />

model of the scene. Based on this initialization the raw gray-values are used to refine the initial<br />

estimate <strong>and</strong> to achieve a planar scene segmentation.<br />

Previous methods based on homographies either require lines to restrict each plane to a oneparameter<br />

family, or require a dense image matching, or deliver only sparse reconstructions [3–<br />

5, 94, 116, 119]. Our approach recovers scene planes of arbitrary position <strong>and</strong> orientation using<br />

only sparse point correspondences <strong>and</strong> homographies. Furthermore, the method delivers an<br />

approximate delineation of the detected planar object patches.<br />

To get a Euclidean 3D reconstruction of the planar structure the camera setup needs to<br />

be calibrated, i.e. the projection matrices <strong>for</strong> all cameras are known. However, the relations<br />

upon which the method is built, are also valid <strong>for</strong> the uncalibrated case. Plane segmentation<br />

is still possible but plane reconstruction is limited to a projective reconstruction only. In the<br />

following we will assume a calibrated camera setup, but we will deal with the uncalibrated case<br />

in Section 6.2.2.<br />

6.2.1 Reconstruction using homographies<br />

The idea, when using homographies <strong>for</strong> planar reconstruction, is to exploit the fact that a plane<br />

in 3D space, which is viewed by two perspective cameras, induces a homography between their<br />

two images. One can think of this as two consecutive perspective projections, one from the first<br />

image plane to the object plane <strong>and</strong> a second one from the object plane to the second image plane.<br />

Let the two cameras (without loss of generality) be given by their (3×4) projection matrices<br />

C 0 = [I|0] <strong>and</strong> C 1 = [A|a], <strong>and</strong> the plane by the homogeneous 4-vector p = [p 1 , p 2 , p 3 , p 4 ] T . Then<br />

the homography induced by p is given by [69]<br />

H(p) = A + av T where v = − 1 p 4<br />

(p 1 , p 2 , p 3 ) T (6.4)<br />

The homography H(p) belongs to a subclass of homographies, which has only 3 degrees of<br />

freedom, corresponding to the three parameters of a plane in 3D space. The constraints <strong>for</strong><br />

this subclass are given by the epipolar geometry between the two images, which is coded in the<br />

fundamental matrix F = [a] x A.<br />

H T F = F T H = 0 (6.5)<br />

Given C 0 , C 1 <strong>and</strong> p, the corresponding homography H(p) can be computed. With H(p) the<br />

image I 0 can be trans<strong>for</strong>med: I ′ 0 = HI 0. If a region in the scene is incident to p, the similarity


6.2. Piece-wise planar scene reconstruction 97<br />

p<br />

I<br />

0<br />

I 1<br />

C<br />

H<br />

C<br />

0 1<br />

Figure 6.2: Detection of planar regions with homographies. The images of a plane p are related<br />

by a homography H, which trans<strong>for</strong>ms the first image I 0 to the second image I 1 .<br />

between corresponding regions in I ′ 0 <strong>and</strong> I 1 will be high. A similarity measure S(p) such as the<br />

normalized cross-correlation can there<strong>for</strong>e be used to decide, whether p describes the region.<br />

Furthermore, given C 0 , C 1 <strong>and</strong> three or more corresponding point pairs on a planar region<br />

{x 0,i ↔ x 1,i }, the homography H(p) <strong>and</strong> the plane p can be computed. In this case the similarity<br />

S(p) between I ′ 0 <strong>and</strong> I 1 can be employed to find the image regions incident to p.<br />

All these relations are already valid at the projective reconstruction level, since they are<br />

built upon the incidence relation, which is invariant under projective trans<strong>for</strong>mations.<br />

6.2.2 Piece-wise planar reconstruction<br />

The proposed reconstruction method starts with the detection of planar seed regions. Affine<br />

invariant detectors as described in Chapter 3 will provide suitable seed regions. After detection<br />

<strong>and</strong> matching, the seed regions are grown by adding image points, which are consistent with<br />

their respective plane-induced homographies. In an iterative framework, the detection of new<br />

points of a planar region is alternated with the optimal estimation of the homography based<br />

on the newly detected points 3 . This results in a segmentation of the images into scene planes<br />

<strong>and</strong> simultaneously into a 3D reconstruction of the segmented planes 4 . Algorithm 4 outlines the<br />

entire reconstruction method.<br />

Initial homographies from sparse matches<br />

Plane reconstruction starts with the detection of seed regions <strong>for</strong> the planes, i.e. corresponding<br />

image regions originating from a planar part of the scene. In a first step interest regions are<br />

detected in both images of the image pair leading to two sets of regions R L ,R R . Region matching<br />

using the method described in Section 6.1 gives the set of corresponding regions M L,R . Each<br />

3 A similar iterative updating procedure has been employed in [89] <strong>for</strong> fundamental matrix estimation.<br />

4 In the following we assume only two images, which we will call the ’left’ image I L <strong>and</strong> the ’right’ image I R<br />

(this is done only to make the explanation easier to read, the method can readily be extended to more than one<br />

’right’ image).


6.2. Piece-wise planar scene reconstruction 98<br />

Algorithm 4 Piecewise planar reconstruction outline.<br />

Detect interest regions<br />

Match regions (en<strong>for</strong>cing planarity constraint)<br />

Estimate initial homographies from corresponding regions<br />

repeat<br />

Grow regions by extrapolation of local homographies<br />

Generate new point correspondences in the extended regions<br />

Update homographies with new set of correspondences<br />

until Homographies do not change anymore<br />

Forward project planar regions onto 3D planes<br />

matched pair M L,R provides a set of point correspondences. These point correspondences are<br />

then used to locally estimate the plane-induced homography of the planar region.<br />

Region growing<br />

Starting from the corresponding planar seed regions, a region-growing scheme can be employed<br />

to find the remaining parts of the planar regions they belong to. For each plane, the initially<br />

estimated plane-induced homography H of the seed region is used to trans<strong>for</strong>m the right image:<br />

I<br />

R ′ = HI R. With the new image, the seed regions can be exp<strong>and</strong>ed by conventional region<br />

growing. The homogeneity criterion <strong>for</strong> adding a pixel x to the region is a high similarity<br />

between I L (x) <strong>and</strong> I<br />

R ′ (x). In our implementation similarity is checked by thresholding the<br />

normalized cross-correlation (NCC) in the neighborhood of x. This concept is depicted in<br />

Figure 6.3.<br />

Iterative homography improvement<br />

Since each homography has been computed only from points within the seed region, using it <strong>for</strong><br />

growing is an extrapolation, <strong>and</strong> the accuracy thus decreases rapidly with increasing distance<br />

from the seed region. There<strong>for</strong>e, an iterative scheme is required: in the new, extended region,<br />

interest points are detected in the left image (our implementation uses the Harris detector).<br />

With the current estimate of the homography, these points are transferred to the right image <strong>and</strong><br />

refined with the sub-pixel matching method of Lan <strong>and</strong> Mohr [62], which is reported to achieve<br />

a matching precision better than 0.1 pixels <strong>for</strong> selected interest points. With the new, larger<br />

set of accurate correspondences, the homography H is updated, <strong>and</strong> region growing is continued<br />

with a new, more accurate image I<br />

R ′ .<br />

A stopping criterion <strong>for</strong> the iteration can now easily be derived. If an iteration does not<br />

add new point correspondences to the point set, the homography estimation would remain<br />

unchanged, <strong>and</strong> further iterations would not change the region anymore. Experiments show<br />

that the method converges fast. The algorithm generally finishes in less then 10 iteration steps.<br />

To speed up region growing a hierarchical representation is used. The images are downscaled<br />

during the intermediate growing steps, while the detection <strong>and</strong> matching of the interest<br />

points is done at full resolution. Let us assume that the input images are reduced by a factor<br />

N = 2 k . The speedup due to the reduction is twofold: firstly, the required area A w of the<br />

correlation window decreases by a factor of N 2 . The examples shown in section 6.2.3 have been<br />

computed with N = 2 <strong>and</strong> A w = (15 × 15) pixels. Secondly, the number of iterations decreases,<br />

since the tolerance <strong>for</strong> corresponding image points x L <strong>and</strong> x ′ R<br />

is raised from 1 to N pixels. After


6.2. Piece-wise planar scene reconstruction 99<br />

(a) (b) (c) (d)<br />

(e) (f) (g) (h)<br />

Figure 6.3: Detecting planar regions with homographies. (a) Left image I L . (b) Right image<br />

I R . (c) Right image I<br />

R ′ after trans<strong>for</strong>mation with the homography induced by the top plane.<br />

(d) Overlay of I L <strong>and</strong> I<br />

R ′ with two rectangular windows marked. (e),(f) The upper window in<br />

I L <strong>and</strong> I<br />

R ′ . The similarity is high. (g),(h) The lower window in I L <strong>and</strong> I<br />

R ′ . The similarity is<br />

low.<br />

convergence, the final growing step is repeated in the full resolution images to obtain the optimal<br />

result.<br />

The uncalibrated case<br />

So far, we have assumed a calibrated setup, i.e., the projection matrices <strong>for</strong> all cameras are<br />

known, <strong>and</strong> the principal aim was a Euclidean 3D reconstruction of the planar structures.<br />

However, the relations upon which the method is built, are also valid in the uncalibrated case,<br />

when we have only a set of images with unknown camera parameters. In this case, the algorithm<br />

can still recover the scene planes, <strong>and</strong> we will argue that <strong>for</strong> scenes with a lot of planar structures,<br />

this facilitates the subsequent orientation <strong>and</strong> self-calibration.<br />

Given the corresponding regions, the homography now has to be estimated from four correspondences,<br />

without using the as yet unknown epipolar constraint. Like in the calibrated case, a<br />

robust estimator such as ransac [28] should be used to make sure that the estimate is not corrupted<br />

by any remaining matching errors. There is a subtle difference between the two methods<br />

here, which may lead to slight differences in the results: in the calibrated case, both the plane<br />

corresponding to the homography <strong>and</strong> the 3D point corresponding to the two image points are<br />

known in Euclidean space. There<strong>for</strong>e, one can use the orthogonal distance from the point to<br />

the plane to find inliers. In the uncalibrated case, no Euclidean frame is available, hence we<br />

use the symmetric transfer error d(x 1 , Hx 0 ) 2 + d(x 0 , H −1 x 1 ) 2 in the image plane. Note that the<br />

uncalibrated case has a degenerate situation: if the camera which took the images underwent<br />

only a rotation around its projection center, then the two entire image planes are always related<br />

by a single homography, which is not due to any 3D plane.


6.2. Piece-wise planar scene reconstruction 100<br />

When dealing with scenes, which contain a large amount of planar structure, recovering these<br />

structures be<strong>for</strong>eh<strong>and</strong> can benefit subsequent structure-<strong>and</strong>-motion steps. During the growing<br />

stage, a large <strong>and</strong> well-distributed set of correspondences is recovered, which are already checked<br />

<strong>for</strong> correctness, because they satisfy the homography, a stronger constraint than the fundamental<br />

matrix. This large <strong>and</strong> outlier-free point set enables reliable <strong>and</strong> accurate estimation of the<br />

fundamental matrix. Note that as soon as at least two planar structures are found, which are<br />

different <strong>and</strong> cannot be merged, it is guaranteed that we are not dealing with a degenerate case<br />

of motion estimation, since<br />

1. the camera motion cannot be a pure rotation, otherwise all detected homographies would<br />

be the same <strong>and</strong> would eventually merge.<br />

2. the recovered scene points cannot be coplanar, since that would again imply that all<br />

detected homographies would be the same.<br />

Although we have not further investigated this issue, we conjecture that in the case of more<br />

than two images, the large amount of correct <strong>and</strong> well-distributed points would also benefit<br />

self-calibration to upgrade the projective reconstruction to a Euclidean one with a method such<br />

as [88].<br />

In section 6.2.3, some experiments are given <strong>for</strong> the uncalibrated case, which show that in<br />

practice, the segmentation into planar scene parts is almost the same as <strong>for</strong> the calibrated case.<br />

6.2.3 Experimental evaluation<br />

In this section we present experiments on synthetic <strong>and</strong> real image data. First, we evaluate<br />

the reconstruction accuracy <strong>and</strong> the robustness of the method under large baseline changes on<br />

synthetic image data. Second, we show the per<strong>for</strong>mance of the method on practically relevant<br />

scenes in experiments with real image data<br />

Synthetic Images<br />

The ’Cube’ data-set consists of images with resolution 800×800 pixels, which have been rendered<br />

from a CAD-model of a cube. Each plane has been textured differently using real world<br />

images from the freely available ’Graffiti’ image database of Mikolajczyk 5 . The interior <strong>and</strong><br />

exterior orientation are known from the rendering. Seed regions were detected by an extended<br />

version of the salient region detector [31]. The following results were gained using the calibrated<br />

method. Figure 6.4(a) shows the left image with the detected matching planar regions. Figure<br />

6.4(b) shows the initial seed regions. The homographies <strong>and</strong> planes are calculated from<br />

the point correspondences gained in the region matching process. Figure 6.4(c-g) shows the<br />

intermediate steps of iterative region growing <strong>and</strong> homography estimation, as described in the<br />

previous section. Figure 6.4(h) shows the final segmentation of the image. The delineation has<br />

been improved by intersecting the final planes <strong>and</strong> snapping to the reprojected intersection lines.<br />

The reconstruction is very accurate. Table 6.1 compares the reconstructed edge length <strong>and</strong> the<br />

angle of the planes to the z-axis with the ground truth.<br />

The detection <strong>and</strong> delineation of the planar scene regions can be regarded as a segmentation<br />

of the input images into planar regions. For the synthetic ’Cube’ data set, ground truth is<br />

also available <strong>for</strong> this segmentation process, i.e. the correct label is known <strong>for</strong> every pixel. A<br />

5 ’Graffiti’ images from Krystian Mikolajczyk available at http://www.robots.ox.ac.uk/∼vgg/research/affine/


6.2. Piece-wise planar scene reconstruction 101<br />

plane edge length edge length angle to z-axis [ ◦ ] angle to z-axis [ ◦ ]<br />

ground truth reconstruction ground truth reconstruction<br />

1 1 1.0030 90 89.89<br />

2 1 1.0029 0 0<br />

3 1 1.0035 90 89.91<br />

Table 6.1: Comparison of edge length <strong>and</strong> angle to z-axis with ground truth.<br />

quantitative evaluation has there<strong>for</strong>e been carried out to assess the per<strong>for</strong>mance of the proposed<br />

algorithm. The algorithm was run on image pairs with increasing baseline <strong>and</strong> the pixel sets<br />

assigned to the visible planes were compared to the ground truth. We counted the number<br />

of pixels assigned wrongly to a plane (false positives) <strong>and</strong> missed pixels (false negatives). No<br />

parameter tuning was allowed <strong>for</strong> different baselines. All image pairs were treated with the same<br />

values, which are depicted in Table 6.2. The experiment was conducted with the calibrated <strong>and</strong><br />

the uncalibrated method. The segmentation results of the evaluation are illustrated in Figure 6.5<br />

(calibrated method). Numerical values <strong>for</strong> the calibrated methods are given in Table 6.3 (top<br />

plane) <strong>and</strong> Table 6.4 (front plane). Figure 6.6(a) <strong>and</strong> Figure 6.7(a) show the according graphs.<br />

The results <strong>for</strong> the uncalibrated methods are given in Table 6.5 (top plane) <strong>and</strong> Table 6.6 (front<br />

plane). The according graphs are shown in Figure 6.6(b) <strong>and</strong> Figure 6.7(b).<br />

An important observation is that the proposed homography-based region-growing scheme<br />

can h<strong>and</strong>le larger baselines than the employed region-matching method. The critical breakdown<br />

point was reached when the region matcher was no longer able to provide seed regions<br />

(in most cases at more than 60 ◦ viewpoint change), while at this point the regions could still<br />

be correctly recovered when starting from manually selected seed regions. Thus the evaluation<br />

has been carried out with manual initialization too, to show the capabilities of the homographybased<br />

region growing scheme. The stability to view-point changes could there<strong>for</strong>e be further<br />

improved, if a better wide-baseline region matching would be available.<br />

For every test-case less than 5 iterations were necessary to obtain the resulting segmentation.<br />

As a summary one can say that<br />

• ≈95% of the points on a visible planar region are correctly assigned (most of the missed<br />

pixel are due to homogeneous image regions)<br />

• the rate of non-plane points assigned to a planar region is ≈1% (most of these wrongly<br />

classified pixels are located on depth edges on the border of the plane <strong>and</strong> are there<strong>for</strong>e<br />

difficult to match)<br />

• the error rates are almost constant over a wide range of viewing angles <strong>and</strong> baselines<br />

respectively<br />

The segmentation results of the calibrated <strong>and</strong> uncalibrated method are comparable. However,<br />

the calibrated method seems to be more robust against outliers in the point sets <strong>for</strong> the<br />

initial homographies, leading to more accurate estimates of the initial homographies. This is<br />

indicated by the front plane reconstruction at > 65 ◦ , where the uncalibrated method was not<br />

able to grow the region from its initial homography while the calibrated method could do it.


6.2. Piece-wise planar scene reconstruction 102<br />

(a) (b) (c)<br />

(d) (e) (f)<br />

(g) (h) (i)<br />

Figure 6.4: Results <strong>for</strong> synthetic ’Cube’ data-set. (a) Left image with detected seed regions<br />

(salient region detector). (b) Seed regions. (c-g) Region growing iterations 1-5. The gray image<br />

parts depict the iteratively growing planar regions. (h) Final delineated segmentation. (i) View<br />

of the recovered 3D model.<br />

6.2.4 Real Images<br />

The ’Laptop’ data-set consists of two images with resolution 2160×1440 pixels taken with a<br />

calibrated camera. The images were oriented <strong>and</strong> the described method was applied <strong>for</strong> reconstruction.<br />

Seed regions were detected by an extended version of the salient region detector [31].<br />

Figure 6.8 shows the different steps leading from a sparse correspondence to the final segmentation.<br />

Region growing has converged after five iterations. The scene contains five major planes,


6.2. Piece-wise planar scene reconstruction 103<br />

Figure 6.5: Segmentation results <strong>for</strong> synthetic ’Cube’ data-set with view angle changes from 5 ◦<br />

to 75 ◦ (calibrated method).<br />

which are more or less textured. Figure 6.8(f) shows that all five planes have been correctly<br />

detected <strong>and</strong> separated. Attention should be drawn to the table <strong>and</strong> the keyboard of the laptop.<br />

Both areas are parallel <strong>and</strong> fairly close to each other, nevertheless the method is accurate


6.2. Piece-wise planar scene reconstruction 104<br />

parameter<br />

value<br />

cornerness threshold of Harris detector 100<br />

size of correlation window<br />

(15×15) pixels<br />

threshold <strong>for</strong> normalized cross correlation 0.5<br />

Table 6.2: Parameter values used <strong>for</strong> the quantitative evaluation of the algorithm. See text <strong>for</strong><br />

details.<br />

plane pixel [%]<br />

30<br />

25<br />

20<br />

15<br />

10<br />

Top FN [%]<br />

Top FP [%]<br />

Front FN [%]<br />

Front FP [%]<br />

5<br />

0<br />

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75<br />

view angle [°]<br />

(a)<br />

plane pixel [%]<br />

30<br />

25<br />

20<br />

15<br />

10<br />

Top FN [%]<br />

Top FP [%]<br />

Front FN [%]<br />

Front FP [%]<br />

5<br />

0<br />

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75<br />

view angle [°]<br />

(b)<br />

Figure 6.6: Comparison of ’Cube’ plane reconstructions to ground truth <strong>for</strong> different view angles.<br />

Seed regions have been selected manually. (a) Calibrated method. (b) Uncalibrated method.<br />

enough to allow a correct separation. One may notice that the reconstructed planes show holes<br />

in homogeneous image regions. In these region, the ’similarity’ between the images does not<br />

convey any geometric in<strong>for</strong>mation, hence no reliable reconstruction is possible. We refer to this<br />

as the ’safe’ reconstruction. If we assume that homogeneous parts within a planar region are<br />

part of the region, the missing areas can be filled. This leads to nicer models, but of course<br />

this assumption is a heuristic <strong>and</strong> may in certain cases lead to an incorrect reconstruction. Figure<br />

6.8(b) shows the variance of the gray-values within the correlation window of the left image<br />

(dark areas denote low variance in correlation window, i.e. homogenous regions). Figure 6.8(f)


6.2. Piece-wise planar scene reconstruction 105<br />

plane pixel [%]<br />

30<br />

25<br />

20<br />

15<br />

10<br />

Top FN [%]<br />

Top FP [%]<br />

Front FN [%]<br />

Front FP [%]<br />

5<br />

plane pixel [%]<br />

0<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5 10 15 20 25 30 35 40 45 50 55 60<br />

view angle [°]<br />

(a)<br />

Top FN [%]<br />

Top FP [%]<br />

Front FN [%]<br />

Front FP [%]<br />

5<br />

0<br />

5 10 15 20 25 30 35 40 45 50 55 60<br />

view angle [°]<br />

(b)<br />

Figure 6.7: Comparison of ’Cube’ plane reconstructions to ground truth <strong>for</strong> different view angles.<br />

Seed regions have been matched automatically. (a) Calibrated method. (b) Uncalibrated<br />

method.<br />

shows the ’safe’ reconstruction of the scene while the 3D model in Figure 6.8(g) has been created<br />

using the homogeneity assumption.<br />

The ’Oberkapfenberg castle’ data-set is an outdoor scene recorded with the same calibrated<br />

camera with a resolution of 2160 × 1440 pixels. The images were oriented <strong>and</strong> the described<br />

method was applied <strong>for</strong> reconstruction. As seed regions the MSER regions [70] were used. The<br />

eight major planes have been recovered. The results are shown in Figure 6.9. In this particular<br />

case holes in the reconstruction are not exclusively due to homogenous image parts. Some walls<br />

we intended to reconstruct were not built completely planar. But even in this complex, partially<br />

cluttered scene a reconstruction of the overall structure has been possible.<br />

The results of the experiments can be summarized as follows. The experiments with synthetic<br />

data showed that the method allows an accurate reconstruction of the scene planes. The<br />

experiment also demonstrates that the method can cope with large baseline changes with almost<br />

constant error rates. The experiments on real scenes show the application to practically relevant<br />

reconstruction tasks. In both of the scenes the major planes could be reconstructed. The<br />

experiments revealed that in real world scenes difficulties with non-textured regions <strong>and</strong> with<br />

not completely planar structures occur. However, we showed that it is possible to overcome this


6.2. Piece-wise planar scene reconstruction 106<br />

viewing ground manual seed regions automatic seed regions<br />

angle truth false neg. false pos. false neg. false pos.<br />

[ ◦ ] [pixel] [pixel (%)] [pixel (%)] [pixel (%)] [pixel (%)]<br />

5 123432 6592 ( 5.34 ) 1082 ( 0.88 ) 7684 ( 6.23 ) 812 ( 0.66 )<br />

10 123491 6131 ( 4.96 ) 1217 ( 0.99 ) 6255 ( 5.07 ) 1203 ( 0.97 )<br />

15 123505 6267 ( 5.07 ) 1505 ( 1.22 ) 6615 ( 5.36 ) 1471 ( 1.19 )<br />

20 123518 6685 ( 5.41 ) 1430 ( 1.16 ) 6239 ( 5.05 ) 1907 ( 1.54 )<br />

25 123558 6405 ( 5.18 ) 1354 ( 1.10 ) 5970 ( 4.83 ) 1763 ( 1.43 )<br />

30 123540 6264 ( 5.07 ) 1389 ( 1.12 ) 6147 ( 4.98 ) 1465 ( 1.19 )<br />

35 123597 6264 ( 5.07 ) 1246 ( 1.01 ) 6175 ( 5.00 ) 1363 ( 1.10 )<br />

40 123540 6227 ( 5.04 ) 1219 ( 0.99 ) 6351 ( 5.14 ) 1185 ( 0.96 )<br />

45 123564 6193 ( 5.01 ) 980 ( 0.79 ) 6135 ( 4.97 ) 1012 ( 0.82 )<br />

50 123559 6355 ( 5.14 ) 824 ( 0.67 ) 6291 ( 5.09 ) 863 ( 0.70 )<br />

55 123594 6511 ( 5.27 ) 612 ( 0.50 ) 6339 ( 5.13 ) 739 ( 0.60 )<br />

60 123544 6700 ( 5.42 ) 519 ( 0.42 ) 5948 ( 4.81 ) 911 ( 0.74 )<br />

65 123571 7043 ( 5.70 ) 394 ( 0.32 ) — —<br />

70 123551 7114 ( 5.76 ) 380 ( 0.31 ) — —<br />

75 123507 7393 ( 5.99 ) 373 ( 0.30 ) — —<br />

Table 6.3: Comparison of ’Cube’ top plane reconstruction to ground truth <strong>for</strong> different view<br />

angles (calibrated method).<br />

viewing ground manual seed regions automatic seed regions<br />

angle truth false neg. false pos. false neg. false pos.<br />

[ ◦ ] [pixel] [pixel (%)] [pixel (%)] [pixel (%)] [pixel (%)]<br />

5 200935 11396 ( 5.67 ) 1157 ( 0.58 ) 11120 ( 5.53 ) 2226 ( 1.11 )<br />

10 196926 11070 ( 5.62 ) 1166 ( 0.59 ) 11046 ( 5.61 ) 1291 ( 0.66 )<br />

15 190304 12596 ( 6.62 ) 1257 ( 0.66 ) 12284 ( 6.45 ) 1664 ( 0.87 )<br />

20 181260 12373 ( 6.83 ) 1308 ( 0.72 ) 12805 ( 7.06 ) 864 ( 0.48 )<br />

25 169992 11247 ( 6.62 ) 1251 ( 0.74 ) 11536 ( 6.79 ) 886 ( 0.52 )<br />

30 156766 9953 ( 6.35 ) 1235 ( 0.79 ) 9974 ( 6.36 ) 1092 ( 0.70 )<br />

35 141781 8615 ( 6.08 ) 1167 ( 0.82 ) 8655 ( 6.10 ) 1051 ( 0.74 )<br />

40 125572 6865 ( 5.47 ) 1061 ( 0.84 ) 6753 ( 5.38 ) 1169 ( 0.93 )<br />

45 108269 4878 ( 4.51 ) 1003 ( 0.93 ) 4861 ( 4.49 ) 949 ( 0.88 )<br />

50 90277 3450 ( 3.82 ) 954 ( 1.06 ) 3405 ( 3.77 ) 980 ( 1.09 )<br />

55 72018 2352 ( 3.27 ) 902 ( 1.25 ) 2305 ( 3.20 ) 905 ( 1.26 )<br />

60 53702 1788 ( 3.33 ) 914 ( 1.70 ) — —<br />

65 35725 2127 ( 5.95 ) 879 ( 2.46 ) — —<br />

Table 6.4: Comparison of ’Cube’ front plane reconstruction to ground-truth <strong>for</strong> different view<br />

angles (calibrated method).<br />

problems by using some heuristic assumptions.


6.2. Piece-wise planar scene reconstruction 107<br />

viewing ground manual seed regions automatic seed regions<br />

angle truth false neg. false pos. false neg. false pos.<br />

[ ◦ ] [pixel] [pixel (%)] [pixel (%)] [pixel (%)] [pixel (%)]<br />

5 123432 8014 ( 6.49 ) 754 ( 0.61 ) 7791 ( 6.31 ) 764 ( 0.62 )<br />

10 123491 6027 ( 4.88 ) 1477 ( 1.20 ) 6487 ( 5.25 ) 1019 ( 0.83 )<br />

15 123505 6388 ( 5.17 ) 1553 ( 1.26 ) 6632 ( 5.37 ) 1376 ( 1.11 )<br />

20 123518 6800 ( 5.51 ) 1360 ( 1.10 ) 6385 ( 5.17 ) 1667 ( 1.35 )<br />

25 123558 6449 ( 5.22 ) 1356 ( 1.10 ) 7662 ( 6.20 ) 1072 ( 0.87 )<br />

30 123540 6285 ( 5.09 ) 1327 ( 1.07 ) 6010 ( 4.86 ) 1381 ( 1.12 )<br />

35 123597 6325 ( 5.12 ) 1320 ( 1.07 ) 6446 ( 5.22 ) 1151 ( 0.93 )<br />

40 123540 6340 ( 5.13 ) 1250 ( 1.01 ) 6339 ( 5.13 ) 1199 ( 0.97 )<br />

45 123564 6168 ( 4.99 ) 1004 ( 0.81 ) 6158 ( 4.98 ) 1005 ( 0.81 )<br />

50 123559 6245 ( 5.05 ) 861 ( 0.70 ) 6225 ( 5.04 ) 864 ( 0.70 )<br />

55 123594 6339 ( 5.13 ) 738 ( 0.60 ) 6274 ( 5.08 ) 749 ( 0.61 )<br />

60 123544 6493 ( 5.26 ) 609 ( 0.49 ) 5950 ( 4.82 ) 897 ( 0.73 )<br />

65 123571 6190 ( 5.01 ) 646 ( 0.52 ) — —<br />

70 123551 6591 ( 5.33 ) 572 ( 0.46 ) — —<br />

75 123507 6915 ( 5.60 ) 485 ( 0.39 ) — —<br />

Table 6.5: Comparison of ’Cube’ top plane reconstruction to ground truth <strong>for</strong> different view<br />

angles (uncalibrated method).<br />

viewing ground manual seed regions automatic seed regions<br />

angle truth false neg. false pos. false neg. false pos.<br />

[ ◦ ] [pixel] [pixel (%)] [pixel (%)] [pixel (%)] [pixel (%)]<br />

5 200935 11038 ( 5.49 ) 2545 ( 1.27 ) 11087 ( 5.52 ) 2346 ( 1.17 )<br />

10 196926 11312 ( 5.74 ) 1021 ( 0.52 ) 10855 ( 5.51 ) 1519 ( 0.77 )<br />

15 190304 12503 ( 6.57 ) 1417 ( 0.74 ) 12316 ( 6.47 ) 1626 ( 0.85 )<br />

20 181260 12222 ( 6.74 ) 1430 ( 0.79 ) 12487 ( 6.89 ) 1002 ( 0.55 )<br />

25 169992 11179 ( 6.58 ) 1331 ( 0.78 ) 11122 ( 6.54 ) 1285 ( 0.76 )<br />

30 156766 9852 ( 6.28 ) 1248 ( 0.80 ) 9979 ( 6.37 ) 948 ( 0.60 )<br />

35 141781 8530 ( 6.02 ) 1257 ( 0.89 ) 8486 ( 5.99 ) 1322 ( 0.93 )<br />

40 125572 6810 ( 5.42 ) 1165 ( 0.93 ) 6736 ( 5.36 ) 1165 ( 0.93 )<br />

45 108269 4851 ( 4.48 ) 995 ( 0.92 ) 4847 ( 4.48 ) 993 ( 0.92 )<br />

50 90277 3396 ( 3.76 ) 930 ( 1.03 ) 3388 ( 3.75 ) 932 ( 1.03 )<br />

55 72018 2330 ( 3.24 ) 886 ( 1.23 ) 2310 ( 3.21 ) 860 ( 1.19 )<br />

60 53702 1516 ( 2.82 ) 867 ( 1.61 ) — —<br />

Table 6.6: Comparison of ’Cube’ front plane reconstruction to ground-truth <strong>for</strong> different view<br />

angles (uncalibrated method).


6.2. Piece-wise planar scene reconstruction 108<br />

(a)<br />

(b)<br />

(c) (d) (e)<br />

(f)<br />

(g)<br />

Figure 6.8: Results <strong>for</strong> ’Laptop’ data-set. (a) Left image with detected seed regions (salient<br />

region detector). (b) Confidence map (dark regions denote low variance in correlation window,<br />

i.e. homogenous regions) (c-e) First three region growing iterations. (f) Final segmentation. (g)<br />

3D model (homogeneous areas have been filled).


6.2. Piece-wise planar scene reconstruction 109<br />

(a)<br />

(b)<br />

(c)<br />

Figure 6.9: Results <strong>for</strong> ’Oberkapfenberg castle’ data-set. (a) Left image with detected seed<br />

regions (MSER detector). (b) Final segmentation. (c) 3D model (’safe’ reconstruction).


Chapter 7<br />

Living in a piecewise planar world 1<br />

This chapter is designated to explain how the most important tasks in mobile robotics, map<br />

building <strong>and</strong> localization, can be accomplished using the wide-baseline methods described in<br />

the previous chapters. The approach which is to be presented differs in multiple points from<br />

previous work. One mentionable property of the new approach is that a dense 3D reconstruction<br />

augmented with partial texture is used as world representation. Current vision based SLAM<br />

approaches as described in Chapter 2 use much simpler primitives as world representation, like<br />

3D lines, 3D points or small planar fiducial markers. Irrespective of the primitives used, previous<br />

approaches created only sparse world representations. Throughout this chapter we describe the<br />

advantages of our method over previous methods <strong>and</strong> explain the new method in detail. We<br />

will show that with our proposed world representation one gains valuable benefits. A second<br />

novelty of our method is that it is possible to do global localization with a single l<strong>and</strong>mark<br />

correspondence only. This enables localization in extreme situations, e.g. when large occlusions<br />

occur. Large occlusions or major temporary scene changes provide a big challenge <strong>for</strong> state-ofthe-art<br />

localization methods. Robot localization is deeply connected to the underlying world<br />

representation <strong>and</strong> our method of localizing with a single l<strong>and</strong>mark is made feasible by the new<br />

world representation.<br />

The great potential of the proposed world representation resides in the use of 3D plane<br />

patches as map primitives instead of 3D points <strong>and</strong> 3D lines. The geometrical constraints introduced<br />

by plane primitives proved extremely valuable, definitely being worth the more complex<br />

map building algorithms. However, by using 3D plane primitives we introduce a strong assumption<br />

into our world representation, that our world is piece-wise planar. The world contains a lot<br />

of structure with cannot be modelled by simple plane primitives. And some may think that this<br />

assumption is too strict. But locally a piece-wise planar approximation will always come close to<br />

the original structure. Moreover man-made places contain at a high degree planar structure <strong>and</strong><br />

most robotic plat<strong>for</strong>ms are only capable of driving indoors. Furthermore, the 3D reconstructions<br />

used as maps are <strong>for</strong> robot localization only, thus it is not necessary to model all the details. It<br />

is only necessary to model enough details to allow successful localization. On the contrary the<br />

following particular benefits are gained by the piece-wise planar world description:<br />

Localization from a single l<strong>and</strong>mark: A single 3D l<strong>and</strong>mark is a small planar patch containing<br />

6 3D parameters. This gives more constraints <strong>for</strong> pose estimation than a single 3D<br />

1 Based on the publication:<br />

F. Fraundorfer <strong>and</strong> H. Bischof. Global localization from a single feature correspondence. In Proc. 30th Workshop<br />

of the Austrian Association <strong>for</strong> Pattern Recognition, Obergurgl, Austria, pages 151–160, 2006 [34]<br />

110


7.1. Map building 111<br />

point l<strong>and</strong>mark. In fact, already a single plane match allows pose estimation while this is<br />

not possible with a single 3D point l<strong>and</strong>mark.<br />

Additional geometric constraints: L<strong>and</strong>marks which are located on one <strong>and</strong> the same 3D<br />

plane are connected by geometric constraints. Plane projective relations are much more<br />

restrictive than general projective relations. A planar homography can be used very efficiently<br />

to verify feature matches geometrically.<br />

Feature reduction: By selecting only l<strong>and</strong>marks located on 3D planes the number of stored<br />

features in the map is reduced significantly. The map uses less memory <strong>and</strong> the computation<br />

time <strong>for</strong> feature matching of course depends on the number of features. It also<br />

increases the robustness <strong>and</strong> reliability. Non-planar features may change in appearance<br />

more significantly than planar features under viewpoint changes. Such l<strong>and</strong>marks are the<br />

reason <strong>for</strong> ambiguities in feature matching, <strong>and</strong> mis-matches will occur more frequently<br />

which cause problems in pose estimation.<br />

Easier matching of planar l<strong>and</strong>marks: State-of-the-art wide-baseline methods assume that<br />

l<strong>and</strong>marks undergo a planar projective trans<strong>for</strong>mation under viewpoint change. Approximating<br />

the projective trans<strong>for</strong>mation by an affine trans<strong>for</strong>mation to create viewpoint<br />

normalized descriptors are the currently most advanced matching methods. L<strong>and</strong>marks<br />

located on 3D corners strongly violate the just mentioned assumption. Such features would<br />

cause troubles <strong>for</strong> matching algorithms <strong>and</strong> should not be stored as l<strong>and</strong>marks in the map.<br />

Increased accuracy: The accuracy of the 3D reconstruction can be increased with plane in<strong>for</strong>mation.<br />

3D point reconstructions are coupled by geometric constraints <strong>and</strong> the 3D<br />

coordinates can be optimized to be arranged exactly as a plane.<br />

In the following a batch method <strong>for</strong> map building is presented. Input <strong>for</strong> the method is an<br />

image sequence acquired from a mobile robot equipped with a single perspective camera. The<br />

camera needs to be calibrated be<strong>for</strong>eh<strong>and</strong>. Structure-from-motion algorithms <strong>and</strong> wide-baseline<br />

stereo methods are applied to build the piece-wise planar world representation. The created<br />

map can then be used <strong>for</strong> purely vision based global localization. A mobile robot equipped with<br />

a single perspective camera can estimate its pose in respect to the world map from a single<br />

camera image.<br />

The localization approach to be presented is in analogy to [56] as it computes the robot<br />

pose from 3D-2D point correspondences. The novelty is the use of small planar patches as<br />

3D l<strong>and</strong>marks <strong>and</strong> that the pose can be computed from a single l<strong>and</strong>mark correspondence.<br />

This allows to do localization under extreme conditions, where other methods which require<br />

usually a high number of correspondences would normally fail. The novel localization approach<br />

is presented in the second part of this chapter.<br />

7.1 Map building<br />

The world is represented as a network of linked metric sub-maps (see Figure 7.1 <strong>for</strong> illustration).<br />

Each sub-map has its own local coordinate system <strong>and</strong> each link between two sub-maps<br />

represents a rigid trans<strong>for</strong>mation (containing rotation, translation <strong>and</strong> scaling) connecting both<br />

local coordinate system. Thus it is possible to express a position within a specific sub-map<br />

from each local coordinate system. Furthermore each sub-map contains the trans<strong>for</strong>mation into


7.1. Map building 112<br />

s,R,t<br />

sub-map<br />

s,R,t<br />

sub-map<br />

s,R,t<br />

s,R,t<br />

world coordinate<br />

system<br />

s,R,t<br />

sub-map<br />

Figure 7.1: The world is represented as a network of linked metric sub-maps.<br />

one global world coordinate system yielding one big metric world representation. Robot localization<br />

is done in the scope of a single sub-map. The pose is initially expressed within the<br />

local coordinate system but can be transferred into the global world coordinate system with the<br />

corresponding trans<strong>for</strong>mation. A single sub-map is created by 3D reconstruction from a shortbaseline<br />

image pair. The links between the sub-maps are established via wide-baseline feature<br />

matching. Map building is treated as an off-line process. Images are acquired by one or multiple<br />

robots (either controlled manually or using additional sensors, e.g. a laser range finder). From<br />

this unordered pile of images the environment map is constructed within three steps. In a first<br />

step the image pile is partitioned into smaller piles containing similar images which will correspond<br />

to sub-maps. Next, single sub-maps are created using two images of each smaller pile. In<br />

a last step the individual sub-maps are linked to <strong>for</strong>m the complete world representation. The<br />

such created map can now be used on a mobile robot only equipped with a single camera <strong>for</strong><br />

global localization within the mapped environment. In the following the three steps are outlined<br />

in detail. Global localization within the proposed map is dealt with subsequently.<br />

7.1.1 Sub-map identification<br />

Starting point is a large set of images I 1 ...I n taken at a high frame rate. We assume that the<br />

ordering of the images is not known, i.e. that we do not know which images are subsequent<br />

to others. The task of this step is to partition the whole set, into sub-sets C 1 ...C c containing<br />

images with a short-baseline variation only. Each partition will than act as a sub-map. The<br />

partitioning is done by means of clustering. A global similarity criteria is used to group visually<br />

similar images into clusters. The requirement <strong>for</strong> the images in each partition is that it is possible


7.1. Map building 113<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

sub-map<br />

identification<br />

sub-map 2<br />

sub-map 3<br />

sub-map 1<br />

Figure 7.2: Sub-map identification: An image sequence is partitioned into clusters of visually<br />

similar images. Each cluster acts as a sub-map. Images within one partition should show small<br />

baseline variations only. There should be some overlap between images from subsequent clusters.<br />

to do a stereo reconstruction. Furthermore the images in adjacent partitions should have an<br />

overlapping part (necessary <strong>for</strong> sub-map linking). Figure 7.2 illustrates the partitioning of an<br />

image sequence.<br />

As similarity measure the Euclidean distance between SIFT descriptors [67] is used. For<br />

each image a single SIFT feature vector is computed. The feature vector is computed from a<br />

low resolution version of the image. For feature extraction the images are re-sampled to 64×64<br />

pixel. Each image is represented by a single feature vector of length 128. This will result in n<br />

feature vectors x 1 ...x n which will correspond to the images I 1 ...I n , with


7.1. Map building 114<br />

x = (x 1 ...x 128 ) ∈ IR 128 . (7.1)<br />

This results in a 128-dimensional feature space. Visually similar images will <strong>for</strong>m clusters<br />

<strong>and</strong> the partitions can be found by clustering. Simple k-means clustering [25] worked well on<br />

this problem. The algorithm was run with different initial cluster numbers <strong>and</strong> the solution<br />

yielding the most compact clusters was selected. Alternatively algorithms could be used which<br />

do not require an initial guess <strong>for</strong> the cluster numbers, e.g. hierarchical clustering [25] or mean<br />

shift clustering [18]. Clustering returns c sets of feature vectors C 1 ...C c <strong>and</strong> the corresponding<br />

cluster centers x 1 ...x c . The cluster center is the mean value of the feature vectors of a cluster<br />

written as<br />

x i = 1<br />

|C i |<br />

∑<br />

x∈C i<br />

x j . (7.2)<br />

For each cluster two images are chosen to represent the sub-map. The remaining images<br />

will not further be processed. The two selected images must allow a 3D stereo reconstruction<br />

as well as l<strong>and</strong>mark extraction. The selection of the two images is done in feature space. The<br />

first image is the one corresponding to the median cluster center. We define the median cluster<br />

center as<br />

x median = arg min (|x j − x|). (7.3)<br />

x∈C<br />

The second image is selected in the following way. For each feature vector within the cluster<br />

the Euclidean distance to the feature vector of the first image x median is calculated. The image<br />

corresponding to the feature vector with median distance is then selected as second image.<br />

x = arg median x∈C (|x j − x median |) (7.4)<br />

To verify the selected images region matching as described in Chapter 6 is per<strong>for</strong>med. The<br />

region matches must satisfy the epipolar constraint, otherwise another image of the cluster gets<br />

selected. The sub-map identification step also works as data reduction. From the initially large<br />

set of n images only 2c images (c ≪ n) are passed on to the subsequent steps.<br />

7.1.2 Sub-map creation<br />

Let us define a sub-map as a 9-tuple<br />

S = 〈T W L , K, I, L, C, Π, D, A, P L 〉. (7.5)<br />

Table 7.1 provides a quick overview of the sub-map components. In the following the components<br />

will be discussed in detail. The key components of the sub-map are the l<strong>and</strong>marks.<br />

L<strong>and</strong>marks are interest regions detected in image I i . The position in the image <strong>and</strong> the 3D<br />

position in the local coordinate system of the sub-map are known. A l<strong>and</strong>mark description as a<br />

feature vector is available too, it allows the detection of corresponding l<strong>and</strong>marks. Only image<br />

regions which are planar are used as l<strong>and</strong>marks. I, L, D, Π, A are used to store the l<strong>and</strong>marks.<br />

L is a set of image patches of size n containing one view of each l<strong>and</strong>mark (n is the number of<br />

detected l<strong>and</strong>marks). The image patches are stored normalized (<strong>and</strong> re-sampled) with a size of<br />

64 × 64 pixel. Normalization is done by applying an affine trans<strong>for</strong>mation. The normalization<br />

trans<strong>for</strong>mation is different <strong>for</strong> each l<strong>and</strong>mark <strong>and</strong> describes how to trans<strong>for</strong>m the image patch<br />

of the l<strong>and</strong>mark from the original image coordinate system into a canonical coordinate system.


7.1. Map building 115<br />

Sub-map component<br />

T W L<br />

K<br />

I<br />

L<br />

C<br />

Π<br />

D<br />

A<br />

P L<br />

Description<br />

rigid trans<strong>for</strong>mation into the global coordinate system<br />

camera calibration matrix<br />

plane index image<br />

l<strong>and</strong>mark image patches<br />

plane covariances<br />

3D planes<br />

l<strong>and</strong>mark SIFT descriptors<br />

l<strong>and</strong>mark normalization trans<strong>for</strong>mations<br />

camera matrix of the local coordinate system<br />

Table 7.1: Components of the piece-wise planar sub-map.<br />

A is a set of size n holding a trans<strong>for</strong>mation <strong>for</strong> each l<strong>and</strong>mark in L. An entry of A is an affine<br />

trans<strong>for</strong>mation matrix of size 3 × 3. D is a set of feature vectors providing a description <strong>for</strong><br />

each l<strong>and</strong>mark of size n. Each entry of D is a SIFT feature vector of length 128 providing the<br />

description <strong>for</strong> the corresponding l<strong>and</strong>mark. The SIFT feature vector is computed from the<br />

normalized image patches in L. I <strong>and</strong> Π are used to represent the 3D coordinate of a l<strong>and</strong>mark.<br />

Π is a set of 3D plane descriptions of the sub-map of size p, where p is the number of planes<br />

detected in the sub-map. Each plane is described by a 6-vector (parameterized with normal<br />

vector <strong>and</strong> one 3D point) representing the 3D parameters within the local coordinate system.<br />

Each l<strong>and</strong>mark is located on these planes in 3D space. The corresponding mapping is stored in<br />

I which is an index image holding the in<strong>for</strong>mation which pixel in the image space corresponds<br />

to which plane in 3D. The map also contains uncertainties <strong>for</strong> the 3D planes. The set C of size<br />

p contains covariance matrices <strong>for</strong> the different 3D planes. K <strong>and</strong> P L define the local coordinate<br />

system. P L is the camera matrix which connects the 3D planes to the image coordinates. K is<br />

the corresponding 3 × 3 camera calibration matrix. TL<br />

W represents a rigid trans<strong>for</strong>mation into<br />

the global coordinate system. It is a 4×4 similarity trans<strong>for</strong>mation matrix (rotation, translation,<br />

scale) which trans<strong>for</strong>ms 3D points from the local into the global coordinate system. A sub-map<br />

defined in that way contains all necessary in<strong>for</strong>mation <strong>for</strong> global localization. In the following<br />

the computation of the various entries will be described.<br />

7.1.3 Structure computation<br />

We will now describe how to extract the 3D map structure from two images. From the sub-map<br />

identification step a short-baseline stereo image pair I, I ′ is established. Goal of the structure<br />

computation is to identify planes in the image scene <strong>and</strong> compute a 3D reconstruction of the<br />

planes. The reconstruction will only contain planes, non-planar image parts will be discarded.<br />

This will result in a piece-wise planar 3D reconstruction of the scene. The segmentation of the<br />

scene into planar regions will be done with the method described in Chapter 6. Prerequisite <strong>for</strong><br />

this method is that the camera poses are known. Thus in a first step the camera positions <strong>for</strong><br />

the images I, I ′ have to be computed. DoG interest points are detected in both images I, I ′ <strong>and</strong><br />

SIFT descriptors D, D ′ are computed <strong>for</strong> every detected interest point. Corresponding points<br />

are identified by nearest neighbor search in feature space. As distance measure the Euclidean


7.1. Map building 116<br />

distance is used. Two features correspond if<br />

d 01<br />

d 02<br />

< t (7.6)<br />

where d 01 is the Euclidean distance between the query feature <strong>and</strong> the nearest feature point from<br />

D ′ <strong>and</strong> d 02 is the Euclidean distance to the second closest feature. t is a distance threshold.<br />

Good results can be achieved with t=0.8. The distance measure has been suggested by Lowe [67]<br />

<strong>for</strong> SIFT feature matching. The such established feature correspondences can now be used <strong>for</strong><br />

estimating the camera poses. As already mentioned we assume calibrated cameras, i.e. the<br />

calibration matrix K is know. Thus we can estimate the essential matrix which encodes the<br />

camera positions of the two viewpoints. Essential matrix estimation is per<strong>for</strong>med using the 5-<br />

point algorithm of Nister [83] within a st<strong>and</strong>ard RANSAC scheme [28]. The essential matrix E<br />

can be decomposed into two camera matrices P, P ′ where P is the canonical camera matrix <strong>and</strong><br />

P ′ defines the second camera position in the local canonical coordinate frame (see equation 7.7<br />

<strong>and</strong> 7.8).<br />

⎛<br />

⎞<br />

1 0 0 0<br />

P = ⎝ 0 1 0 0 ⎠ (7.7)<br />

0 0 1 0<br />

P ′ = [R|t] (7.8)<br />

R is a 3 × 3 rotation matrix <strong>and</strong> t is a translation 3-vector defining the baseline of the stereo<br />

case. P, P ′ are input parameters <strong>for</strong> the subsequent plane segmentation <strong>and</strong> reconstruction step.<br />

The algorithm requires also initial guesses <strong>for</strong> small planar regions in the images I, I ′ as input<br />

parameters with the according inter-image homographies. For that MSER regions are detected<br />

in I <strong>and</strong> I ′ . Region matching is per<strong>for</strong>med (as described in Section 6.1) which returns point<br />

correspondences within each interest region <strong>and</strong> the according homography trans<strong>for</strong>m. This<br />

constitutes the initial guesses <strong>for</strong> the plane segmentation algorithm. Plane segmentation <strong>and</strong><br />

reconstruction is now per<strong>for</strong>med yielding the following map components:<br />

• An index image I.<br />

• A set of detected <strong>and</strong> reconstructed planes Π. Each plane is represented by a 6-vector<br />

giving the full 3D parameters in the local coordinate frame.<br />

• Covariances <strong>for</strong> each plane giving an uncertainty measure <strong>for</strong> the reconstruction accuracy.<br />

The structure computation will be completed by a post-processing step. Planes which are<br />

extended behind the camera planes are removed. This consistency check deletes incorrect reconstructed<br />

image parts resulting in higher robustness. The different steps of structure reconstruction<br />

are illustrated in Figure 7.3.<br />

7.1.4 L<strong>and</strong>mark extraction<br />

Up to now the sub-map is still missing essential components, the l<strong>and</strong>marks. Closing this gap<br />

is the goal of the next step. L<strong>and</strong>mark appearance has to be connected to 3D in<strong>for</strong>mation <strong>and</strong><br />

incorporated into the sub-map. In the following we describe an approach using MSER interest<br />

regions as l<strong>and</strong>marks. However, the definition of the sub-map is general enough to allow the use<br />

of any other kind of interest regions. In the following the necessary steps are outlined:


7.1. Map building 117<br />

(a) (b) (c)<br />

(d)<br />

(e)<br />

Figure 7.3: Sub-map creation: (a),(b) Short-baseline image pair (with l<strong>and</strong>marks shown). (c)<br />

Index image resulting from plane segmentation. (d) Reconstructed 3D planes (e) Planar l<strong>and</strong>marks<br />

in 3D.<br />

• Detection of interest regions (MSER’s)<br />

• Normalization using LAF<br />

• SIFT descriptor extraction<br />

• Computation of 3D coordinates of the l<strong>and</strong>marks (by projection onto the according 3D<br />

plane)<br />

First, interest regions are detected in one of the images of the sub-maps short-baseline pair, in<br />

our case MSER regions. Each region will be represented by its region border. The region border<br />

will simply be stored as a list of image coordinates of every border pixel. Next normalization<br />

of the regions will be per<strong>for</strong>med. We use one of the methods described in [85]. The border is<br />

searched <strong>for</strong> points of maximal concavity or convexity. Two such points A, B together with the<br />

regions center of gravity C define a local affine frame (LAF). CA <strong>and</strong> CB are the axes of a 2D<br />

coordinate system having undergone an arbitrary affine trans<strong>for</strong>mation. Normalization can be<br />

done by applying a trans<strong>for</strong>mation which restores the orthogonality of CA <strong>and</strong> CB <strong>and</strong> scaling<br />

the axes to unit length. The LAF is trans<strong>for</strong>med into a canonical coordinate system. The<br />

canonical coordinate system is defined as a 64 × 64 pixel sized image patch <strong>and</strong> the extracted<br />

MSER region is re-sampled into the normalized frame. Multiple LAF’s can be constructed <strong>for</strong><br />

a single MSER region yielding different normalized MSER regions. In our framework each new


7.1. Map building 118<br />

normalization is simple added to the set of l<strong>and</strong>marks L. In a next step the SIFT descriptor is<br />

computed from the normalized MSER regions. The size of 64 × 64 is ideal <strong>for</strong> the computation<br />

of the SIFT orientation histogram. For each normalized MSER region a feature vector with<br />

length 128 is computed. D is the set of all feature vectors <strong>for</strong> the extracted l<strong>and</strong>marks. Next,<br />

<strong>for</strong> each l<strong>and</strong>mark the corresponding 3D coordinates are computed. This is done with the index<br />

image I representing the segmentation into scene planes. First, the plane corresponding to each<br />

l<strong>and</strong>mark has to be identified. The index image I basically works as look-up table. Every<br />

gray-value in the index image I works as plane identifier. For every pixel within a l<strong>and</strong>mark we<br />

look at the same pixel position in I <strong>and</strong> read the plane identifier, which indexes the planes in Π.<br />

For robustness, we build a histogram from the looked up plane identifiers. The plane with the<br />

maximal histogram value gets assigned to the l<strong>and</strong>mark. This approach allows us to deal with<br />

imperfect segmentation. Now <strong>for</strong> every l<strong>and</strong>mark pixel the 3D coordinate can be calculated by<br />

computing the intersection of the according 3D plane with the ray connecting the pixel in the<br />

image plane <strong>and</strong> the camera center. This approach results in an exact planar 3D reconstruction<br />

of the l<strong>and</strong>mark. Furthermore it is not necessary to explicitly store the 3D coordinates, they<br />

simply can be computed when they are needed from the index image I <strong>and</strong> the plane set Π. An<br />

illustration of reconstruced l<strong>and</strong>marks in 3D is given in Figure 7.3(e).<br />

7.1.5 Sub-map linking<br />

Let us consider two sub-maps,<br />

<strong>and</strong><br />

A = 〈T W (A)<br />

L<br />

, K (A) , I (A) , L (A) , C (A) , Π (A) , D (A) , A (A) , P (A)<br />

L 〉 (7.9)<br />

B = 〈T W (B)<br />

L<br />

, K (B) , I (B) , L (B) , C (B) , Π (B) , D (B) , A (B) , P (B)<br />

L<br />

〉. (7.10)<br />

Each sub-map defines its own local coordinate system. The coordinate systems are Euclidean<br />

<strong>and</strong> differ by<br />

• an arbitrary scale factor s,<br />

• a 3 × 3 rotation matrix R<br />

• <strong>and</strong> a translation vector t of length 3.<br />

Together this represents a scaled rigid point trans<strong>for</strong>mation (or similarity trans<strong>for</strong>m). 3D points<br />

in sub-map B p (B) can be trans<strong>for</strong>med into the coordinate frame of sub-map A by<br />

p (A) = s(Rp (B) + t). (7.11)<br />

In homogenous coordinates the sequence of trans<strong>for</strong>mations can be encapsuled by<br />

[ ] R t<br />

RT = s . (7.12)<br />

0 1<br />

RT is a 4 × 4 trans<strong>for</strong>mation matrix trans<strong>for</strong>ming homogeneous 3D points. The trans<strong>for</strong>mation<br />

is written as<br />

p (A)<br />

h<br />

= RT p (B)<br />

h<br />

(7.13)<br />

where p (A)<br />

h<br />

<strong>and</strong> p (B)<br />

h<br />

are the homogenous counterparts of p (A) <strong>and</strong> p (B) . The goal of sub-map<br />

linking is now to estimate the values of the parameters s, R <strong>and</strong> t. The necessary parameters<br />

can be estimated from 3D point correspondences between the two sub-maps with the following<br />

steps:


7.1. Map building 119<br />

• Establishing 3D point correspondences<br />

• Calculating the scale factor s<br />

• Estimating R <strong>and</strong> t of the rigid trans<strong>for</strong>mation<br />

Let us first focus on the generation of 3D point correspondences. By wide-baseline region<br />

matching (see section 6.1) l<strong>and</strong>mark correspondences between both sub-maps are detected. For<br />

matching the already extracted features L (A) , D (A) <strong>and</strong> L (B) , D (B) can be used. Sub-map linking<br />

is possible from a single region match. However, a higher number of matches will increase the<br />

robustness of the method. But let us continue with the case of a single region match only.<br />

The region matching returns multiple point correspondences q ↔ q ′ (at an order of 20-100)<br />

within the region match. As already described in Section 7.1.4 3D coordinates can be computed<br />

by projecting q <strong>and</strong> q ′ onto their corresponding plane. The resulting 3D points Q <strong>and</strong> Q ′ are<br />

defined in the local coordinate systems of the sub-maps. Point correspondences in 3D Q ↔ Q ′<br />

are directly known from the 2D point correspondences <strong>and</strong> do not contain outliers 2 . The scale<br />

s between the sub-maps is the first parameter we estimate from the 3D point correspondences<br />

Q ↔ Q ′ . Two point pairs Q i ↔ Q ′ i <strong>and</strong> Q j ↔ Q ′ j are arbitrarily selected from Q ↔ Q′ . The<br />

scale change from sub-map B to A is defined as<br />

s = ‖Q i − Q j ‖<br />

‖Q ′ i − Q′ j ‖ (7.14)<br />

where ‖Q i − Q j ‖ is the Euclidean distance. Be<strong>for</strong>e we continue with the next steps the scaling<br />

trans<strong>for</strong>m has to be applied to the points Q ′ from sub-map B resulting in the scaled coordinates<br />

Q ′ s = sQ ′ . (7.15)<br />

After scaling both 3D point sets Q <strong>and</strong> Q ′ s differ only by a rigid trans<strong>for</strong>mation. The rigid<br />

trans<strong>for</strong>mation parameters R <strong>and</strong> t are computed from Q ↔ Q ′ s using the quaternion-based<br />

method described by Horn [47]. Now all needed parameters <strong>for</strong> the similarity trans<strong>for</strong>m are<br />

computed <strong>and</strong> it is now possible to trans<strong>for</strong>m each 3D point in the local coordinate system of<br />

sub-map B into the coordinate frame of sub-map A.<br />

We just developed the method to link two sub-maps. Let us now focus on the problem<br />

of linking a number of n sub-maps into one complete environment map. The idea is to store<br />

the in<strong>for</strong>mation which sub-maps are linked in a graph-like structure altogether with the corresponding<br />

coordinate trans<strong>for</strong>ms. For that we adapt the approach introduced by Schaffalitzky<br />

et al. [92]. A set of n images is spatially organized by means of wide-baseline feature matching.<br />

Key ingredient is full two-view image matching including epipolar geometry estimation. Only if<br />

the epipolar geometry was successfully established two views get linked. However, this usually<br />

requires a high overlap in the image data. This is in general not the case with our image data,<br />

in particular we are interested in the cases with small overlapping area. In our approach we<br />

start with the computation of a distance matrix <strong>for</strong> all occurring l<strong>and</strong>marks. We use the already<br />

detected l<strong>and</strong>marks which are contained in the sub-map representation. In detail we calculate<br />

a distance matrix over all descriptors D <strong>for</strong> each sub-map. As distance metric the normalized<br />

cross-correlation is used. The correlation value has the favorable property to be limited between<br />

−1 <strong>and</strong> 1 which allows the use of absolute thresholds. The size of the distance matrix is N × N<br />

2 Dealing with outliers in the registration of two 3D point clouds is very challenging. The way the 3D point<br />

correspondences are generated in our method eases the solution of the problem


7.2. Localization 120<br />

where N = |D 0 | + |D 1 | + ... + |D n |. Each entry of the distance matrix represents a tentative link<br />

<strong>and</strong> in the following tentative links are verified starting by the link with the highest correlation<br />

value. In the first iteration the match with the highest correlation value is searched <strong>and</strong> verified<br />

with the wide-baseline matching method described in Section 6.1. If the match is confirmed the<br />

sub-maps are linked with the previously described method using the detected l<strong>and</strong>mark correspondences.<br />

A graph structure G is established where the sub-maps represent the nodes <strong>and</strong> the<br />

detected links between sub-maps are the edges in the graph. A confirmed match adds an edge to<br />

G. With a confirmed match no further links are searched <strong>for</strong> the two already linked sub-maps.<br />

If the match could not be confirmed, the match with the second highest correlation value is investigated.<br />

The algorithm ends if all sub-maps are linked or if all entries in the distance matrix<br />

are processed. This could lead to a worst case computational complexity of O(N 2 ), however in<br />

practice one can also end the algorithm if the correlation values of the remaining match pairs<br />

drop below some threshold c th .<br />

Please note, that this algorithm does not guarantee a completely linked graph G. G can<br />

contain isolated clusters or individual nodes. However, this depends on the provided image data<br />

only. A complete environment map can be constructed by the acquisition of additional images.<br />

7.2 Localization<br />

In the following we define the pose of the robot to be the rotation <strong>and</strong> position of the single<br />

camera mounted on the robot. The pose is defined within a local sub-map by a rotation R <strong>and</strong><br />

translation t in reference to the origin of the coordinate system. The origin of the local sub-map<br />

is equivalent with the camera center of the sub-map’s l<strong>and</strong>marks. This relation is illustrated in<br />

Figure 7.4.<br />

The matched l<strong>and</strong>marks define 2D ↔ 2D point correspondences between the current view<br />

Π i <strong>and</strong> the sub-maps view Π 0 . 3D points can be created by projecting the 2D points onto<br />

the 3D planes of the sub-map. This yields 3D ↔ 2D correspondences from which the pose<br />

R, t can be computed. The pose estimation method proposed by Lu et al. [68] is known to be<br />

fast <strong>and</strong> robust. And it can deal with planar l<strong>and</strong>marks, i.e. the 3D points are located on a<br />

plane, which makes the algorithm suitable <strong>for</strong> our world representation. Pose estimation within<br />

such a local sub-map S gives the pose within the local coordinate frame R L , t L of the sub-map.<br />

The goal is now to compute the pose of the robot R W , t W within the global world coordinate<br />

system W on the basis of the pose estimation within local sub-maps S, each in a different local<br />

coordinate frame L. Each sub-map contains the necessary trans<strong>for</strong>m to the global coordinate<br />

frame, denoted TL W . T L<br />

W is a 4 × 4 point trans<strong>for</strong>m matrix. A 3D point X L in the sub-map is<br />

trans<strong>for</strong>med into the global coordinate system by X W = TL W X L. The following steps outline<br />

the pose estimation <strong>and</strong> putting the pose into the global coordinate frame.<br />

From region matching 2D − 2D point correspondences x ↔ x ′ between a map image <strong>and</strong> the<br />

current image are retrieved. x are the points from the map image <strong>and</strong> x ′ are the points from<br />

the current view. The next step is to create 3D − 2D point correspondences out of the 2D − 2D<br />

point correspondences. Projecting the points p onto the corresponding map plane Π L gives the<br />

according 3D points X L . The 3D points X L are in the local coordinate frame of the sub-map<br />

L defined by the camera center P L = K[R L |t L ]. Next the 2D points from the current image are<br />

normalized by the calibration matrix K resulting in ˆx ′ = K −1 x ′ . Normalization resolves some<br />

numerical issues (see Section 7 in [44]). Now the 3D − 2D correspondences X L ↔ ˆx ′ necessary<br />

<strong>for</strong> the pose estimation algorithm have been set up. Pose estimation is per<strong>for</strong>med <strong>and</strong> returns<br />

R L , t L . R L , t L is the pose of the camera according to the actual camera image in the coordinate


7.2. Localization 121<br />

3D structure<br />

C i<br />

π i<br />

R,t<br />

π 0<br />

C 0<br />

Figure 7.4: The pose of the robot is defined as rotation R <strong>and</strong> translation t from the origin of<br />

the local sub-map.<br />

frame of the local sub-map. Trans<strong>for</strong>ming R L , t L into the global coordinate system using TL<br />

W<br />

done as follows. First the camera center C L is expressed explicitly with<br />

is<br />

C L = −R T Lt L . (7.16)<br />

The camera center C L can be trans<strong>for</strong>med directly using the point trans<strong>for</strong>m T W L .<br />

C W = T W L C L (7.17)<br />

Now the rotation R L is trans<strong>for</strong>med into the world coordinate frame using the rotational part<br />

RL W of T L W only. TL W = [ RL W |t W ]<br />

L S<br />

W<br />

L (7.18)<br />

R W = R W L R L (7.19)<br />

Having computed R W <strong>and</strong> C W the camera matrix in the world coordinate system P W can be<br />

set up.<br />

P W = K [R W | − R W C W ] (7.20)<br />

K is the camera calibration matrix.<br />

7.2.1 Localization from a single l<strong>and</strong>mark<br />

The situation of pose estimation from a single l<strong>and</strong>mark is illustrated in Figure 7.5. The 3D−2D<br />

point correspondences from within a single l<strong>and</strong>mark are basically outlier free, assured from the


7.2. Localization 122<br />

registration process. The 3D points are exactly located on a plane, because they are computed<br />

by projecting 2D points onto a scene plane. Thus they do not contain noise. However the<br />

2D − 2D point correspondences are obtained by correlation based matching <strong>and</strong> there<strong>for</strong>e are<br />

assumed to be distorted by noise. We assume that the 2D points within the l<strong>and</strong>mark from the<br />

actual view are distorted by Gaussian noise. In the following we will check experimentally how<br />

the Gaussian noise influences the pose estimation accuracy.<br />

π 0<br />

π 1<br />

Figure 7.5: Pose estimation from 3D ↔ 2D point correspondences. The 3D points are exact,<br />

the 2D points are assumed to be disturbed by Gaussian noise. The effect of the noise is that<br />

the rays from the point correspondences do not intersect exactly at the camera center.<br />

The influence of Gaussian noise distorted 2D points is evaluated with synthetic data. Figure<br />

7.6 shows the results of the Lu <strong>and</strong> Hager pose estimation <strong>for</strong> our special situation. Pose<br />

estimation <strong>for</strong> synthetic 3D − 2D point correspondences has been per<strong>for</strong>med with noise added<br />

to the 2D coordinates of the l<strong>and</strong>mark points from the query image only. Gaussian noise of<br />

st<strong>and</strong>ard deviation σ = 0.1, 0.3 <strong>and</strong> 0.7 (in pixel) was added to the 2D points. The experiment<br />

has been repeated 1000 times. In Figure 7.6 each point denotes an estimated camera position.<br />

Figure 7.6(a-c) show the distribution of the camera position <strong>for</strong> Gaussian noise with σ = 0.1<br />

in different views. The blue cross marks the noise-free computed camera position. Noisy 2D<br />

coordinates create a spherical distribution around the true position. Perpendicular to the line<br />

connecting the true position <strong>and</strong> the 3D coordinates the points are spread out widely while the<br />

depth distribution is small. This experiment shows that Gaussian noise influences the pose estimation<br />

from a small number of point correspondences within a small image region (l<strong>and</strong>mark)<br />

significantly.<br />

Next we investigate a solution to alleviate the influence of noise in the 2D − 2D point correspondences.<br />

We analyze the effect of estimating the pose from a small sample of correspondences<br />

only instead of using all 3D ↔ 2D correspondences. In the following experiment the pose is<br />

estimated from 1000 r<strong>and</strong>om samples of size 5, 10, 20 out of 56 correspondences. The correspondences<br />

are obtained by correlation based matching. The estimated poses are shown in<br />

Figure 7.7. Sub-sampling generates a distribution of poses around the pose computed with all


7.2. Localization 123<br />

(a)<br />

(b)<br />

(c)<br />

(d)<br />

(e)<br />

Figure 7.6: Effect of added Gaussian noise to 2D points in pose estimation. The blue dot marks<br />

the noise-free pose estimate. (a-c) σ = 0.1 (d) σ = 0.3 (e) σ = 0.7<br />

available correspondences. In fact, the pose estimated from all correspondences is optimal with<br />

respect to the point correspondences from within the l<strong>and</strong>mark. However, we can evaluate the<br />

solution by means of additional correspondences from outside our l<strong>and</strong>mark. Let us assume a<br />

set of 3D ↔ 2D point correspondences distributed over the whole image denoted Q ↔ q. A<br />

pose estimate R, t can now be used to compute 2D coordinates ˆq with<br />

ˆq = [R|t]Q. (7.21)


7.2. Localization 124<br />

(a)<br />

(b)<br />

(c)<br />

Figure 7.7: Effect of pose estimation from r<strong>and</strong>om samples. The blue dot marks the pose<br />

estimate using all available correspondences. (a) Sample size = 5 (b) Sample size = 10 (c)<br />

Sample size = 20<br />

A re-projection error ɛ can be defined as the distance between q ↔ ˆq.<br />

ɛ = ∑ ‖q i − ˆq i ‖ (7.22)<br />

Given multiple pose estimates the most accurate one can be identified as the one with the<br />

smallest re-projection error ɛ. We analyzed the re-projection error <strong>for</strong> the pose estimates shown<br />

in Figure 7.7. The re-projection error is coded in the point color. A dark green coded pose has<br />

a re-projection error smaller than the median re-projection error. For a light green coded pose<br />

the re-projection error is bigger than the median error. The results are shown in Figure 7.8.<br />

The pose estimate using all available point correspondences is marked as blue dot. The pose<br />

estimated with the smallest re-projection error is the red dot. It is evident that the best pose<br />

estimate does not coincide with the all-points solution. Furthermore, the color coding reveals<br />

the area in 3D space where the best pose estimate is located. The figure also reveals that the<br />

distribution gets more compact when bigger sample sets are used. A small sample size will<br />

create widely spread out hypotheses.<br />

The conclusion is now, that the all-points solution will not guarantee the best pose estimate.<br />

A sub-sampling scheme computing multiple pose estimates will in any case contain a better<br />

pose estimate as the all-points solution. Scoring the different hypotheses with the re-projection<br />

error the best hypotheses can be selected. However, when dealing with a single l<strong>and</strong>mark match<br />

additional 3D −2D point correspondences will not be available to compute a re-projection error.


7.2. Localization 125<br />

A scoring method is needed <strong>for</strong> which additional l<strong>and</strong>mark matches are not necessary. In the<br />

next section we will introduce such a method.<br />

1<br />

1<br />

t z<br />

0.5<br />

t z<br />

0.5<br />

0<br />

3<br />

0<br />

3<br />

2.5<br />

2<br />

2.5<br />

2<br />

1.5<br />

2<br />

2<br />

1<br />

1.5<br />

1.5<br />

0.5<br />

0.5<br />

t y 1 0<br />

t<br />

t y<br />

1 0 x<br />

t x<br />

1<br />

1.5<br />

(a)<br />

(b)<br />

1<br />

t z<br />

0.5<br />

0<br />

3<br />

2.5<br />

2<br />

2<br />

t y<br />

1.5<br />

1<br />

0<br />

0.5<br />

t x<br />

1<br />

1.5<br />

(c)<br />

Figure 7.8: Pose estimates color coded with re-projection error ɛ. Dark green dots: ɛ ≤ median,<br />

light green dots: ɛ ≥ median. The blue dot marks the pose estimate using all available correspondences.<br />

The red dot marks the pose estimate with the smallest re-projection error. (a)<br />

Sample size = 5 (b) Sample size = 10 (c) Sample size = 20<br />

7.2.2 The local plane score<br />

In the previous section we explained how the additional 3D − 2D point correspondences can be<br />

used to score the pose hypothesis. However, in the absence of the additional l<strong>and</strong>mark matches<br />

no additional point correspondences are available. In the following we introduce a new measure,<br />

the local plane score (lp-score, Π l -score). The lp-score is based on in<strong>for</strong>mation implicitly stored<br />

in the piece-wise planar structure of the world map. It will be shown that with the lp-score good<br />

hypothesis can be selected similar to the re-projection error ɛ.<br />

We use the fact, that each l<strong>and</strong>mark is a small planar patch Π l <strong>and</strong> in most cases is part<br />

of a bigger plane structure Π. See Figure 7.9 <strong>for</strong> an illustration. Given a pose hypothesis P h it<br />

is now possible to create 2D ↔ 2D point correspondences <strong>for</strong> the complete extend of the plane<br />

Π. We call Π the support area of Π l . 3D ↔ 2D point correspondences within the support area<br />

can be created by projecting image locations onto the 3D plane Π. The created 3D points can


7.2. Localization 126<br />

3D structure<br />

3D-2D<br />

C i<br />

π i<br />

2D-2D<br />

π C<br />

0<br />

0<br />

Figure 7.9: The l<strong>and</strong>mark (green plane) is part of a bigger planar structure (blue plane). Given<br />

a pose estimate C i 2D − 2D point correspondence <strong>for</strong> the bigger structure can be computed<br />

between π 0 <strong>and</strong> π i . The transfer error of the homography from the l<strong>and</strong>mark point set achieved<br />

on the additional points defines a quality measure <strong>for</strong> the pose estimate.<br />

be projected into the image plane of the actual view I, which is subject to pose estimation,<br />

using the pose hypothesis P h . The such created 2D ↔ 2D point correspondences Q = (q, q ′ )<br />

are noise-free <strong>and</strong> exact. We want to stress that this point correspondences are not necessarily<br />

located in the field of view of the actual image I. The correspondence is determined purely<br />

geometrically based on the pose hypothesis P h . From l<strong>and</strong>mark matching we already have a set<br />

of 2D ↔ 2D point correspondences Q l = (q l , q ′ l ) within the l<strong>and</strong>mark. For Q l a homography H l<br />

can be computed which relates q <strong>and</strong> q ′ by<br />

q ′ l = H lq l . (7.23)<br />

This relation must also hold <strong>for</strong> the point correspondences Q of the support region, if they<br />

are computed from a correct pose hypothesis. Let ˆq ′ be the 2D points obtained from an incorrect<br />

hypothesis with<br />

ˆq ′ = H l q. (7.24)<br />

Then ˆq ′ ≠ q ′ <strong>and</strong> the difference can be quantified with the transfer error ɛ t (see Figure 7.9<br />

<strong>for</strong> an illustration). The transfer error ɛ t is defined as


7.2. Localization 127<br />

ɛ t = ∑ i<br />

‖q ′ i − ˆq ′ i‖. (7.25)<br />

Let us denote the transfer error ɛ t as our local plane score (lp-score). The lp-score is computed<br />

<strong>for</strong> every pose estimate. Figure 7.10 shows some results. Each point is a pose estimate <strong>and</strong> the<br />

lp-score is coded in the point color. The poses corresponding to the n smallest lp-scores are<br />

coded dark green. The remaining poses are light green. n is set to 10% of the number of all<br />

poses. The pose estimate using all available point correspondences is marked as blue dot. The<br />

pose estimated with the smallest lp-score is the red dot. The results show that the lp-score<br />

is consistent with the results from the re-projection error but can be computed from a single<br />

l<strong>and</strong>mark <strong>and</strong> map in<strong>for</strong>mation.<br />

1<br />

1<br />

t z<br />

0.5<br />

t z<br />

0.5<br />

0<br />

3<br />

0<br />

3<br />

2.5<br />

2<br />

2.5<br />

2<br />

1.5<br />

2<br />

2<br />

1<br />

1.5<br />

1.5<br />

0.5<br />

0.5<br />

t y 1 0<br />

t<br />

t y<br />

1 0 x<br />

t x<br />

1<br />

1.5<br />

(a)<br />

(b)<br />

1<br />

t z<br />

0.5<br />

0<br />

3<br />

2.5<br />

2<br />

2<br />

t y<br />

1.5<br />

1<br />

0<br />

0.5<br />

t x<br />

1<br />

1.5<br />

(c)<br />

Figure 7.10: Pose estimates color coded with lp-score. Dark green dots: ɛ ≤ median, light green<br />

dots: ɛ ≥ median. The blue dot marks the pose estimate using all available correspondences.<br />

The red dot marks the pose estimate with the smallest lp-score. (a) Sample size = 5 (b) Sample<br />

size = 10 (c) Sample size = 20<br />

7.2.3 Algorithms<br />

In the following we describe two algorithms <strong>for</strong> pose estimation from a single l<strong>and</strong>mark. The first<br />

one is using an epipolar criteria to select the best solution. This method can only be applied if


7.2. Localization 128<br />

other l<strong>and</strong>mark matches are available to compute the epipolar distance. The other one is using<br />

the lp-score. Both algorithm are very similar up to the scoring function.<br />

Epipolar criteria method<br />

Let us start with l<strong>and</strong>mark extraction <strong>and</strong> matching. Similar to the sub-map linking step, MSER<br />

regions are extracted from the image of the current view. A LAF is computed <strong>for</strong> each region<br />

<strong>and</strong> region normalization is per<strong>for</strong>med. A SIFT-descriptor is computed from the normalized<br />

image patch. L<strong>and</strong>mark correspondences are detected with the method of section 6.1. The<br />

subsequent pose estimation step requires planar l<strong>and</strong>marks, thus the planarity has to be checked<br />

<strong>for</strong> each image region. Non-planar l<strong>and</strong>marks will be discarded. During map building images are<br />

segmented into piece-wise planar areas. If a l<strong>and</strong>mark is located within the detected planar areas<br />

its planarity is confirmed. The region matching algorithm establishes point correspondences<br />

within the support region of the feature (the area the SIFT-descriptor is computed from). For<br />

each l<strong>and</strong>mark we now compute a set of pose hypotheses Q. For this first 3D − 2D point<br />

correspondences are computed from the 2D − 2D point correspondences by projection onto<br />

the 3D plane. Multiple hypotheses are generated by drawing smaller samples (of size p) from<br />

the whole set of 3D − 2D correspondences. For each of the n r<strong>and</strong>om sub-sets S the pose is<br />

computed as follows. First the 5-point algorithm [83] is used to create an initial rotation R<br />

from the 2D − 2D point correspondences. With this initial rotation R the pose is computed<br />

from the 3D − 2D correspondences using the iterative method of Lu <strong>and</strong> Hager [68]. The pose<br />

hypotheses is added to Q if it represents a valid configuration, i.e. the 3D points are in front of<br />

the camera. This is repeated <strong>for</strong> all matched l<strong>and</strong>marks. If all pose hypotheses are created the<br />

epipolar distance is computed <strong>for</strong> each hypothesis. As resulting pose the one with the smallest<br />

epipolar distance is selected. The algorithm is outlined in Algorithm 5.<br />

Algorithm 5 Pose estimation from a single planar l<strong>and</strong>mark (solution selection with epipolar<br />

criteria)<br />

Q ← [] {list to hold possible solutions R, t <strong>for</strong> pose estimation}<br />

<strong>for</strong> all region correspondences do<br />

project 2D points on plane to create 3D-2D matches<br />

<strong>for</strong> i = 1 to n do<br />

select r<strong>and</strong>om subset S from 3D-2D correspondences of size p<br />

compute R,t from S using 5-point algorithm<br />

3D-2D pose estimation with initial rotation R<br />

add pose (R,t) to Q if 3D points are located in front of the camera<br />

end <strong>for</strong><br />

end <strong>for</strong><br />

<strong>for</strong> i = 1 to length(Q) do<br />

calculate mean epipolar distance using R, t from Q(i) on 2D-2D correspondences over all<br />

matching regions<br />

end <strong>for</strong><br />

return R, t with minimal epipolar distance


7.2. Localization 129<br />

lp-score method<br />

The lp-score method is very similar to the previous method. The l<strong>and</strong>mark extraction <strong>and</strong><br />

matching is identical <strong>and</strong> the reader may be referred to the previous section. Again <strong>for</strong> each<br />

l<strong>and</strong>mark we compute a set of pose hypotheses Q. For this first 3D − 2D point correspondences<br />

are computed from the 2D − 2D point correspondences by projection onto the 3D plane.<br />

Multiple hypotheses are generated by drawing smaller samples (of size p) from the whole set<br />

of 3D − 2D correspondences. For each of the n r<strong>and</strong>om sub-sets S the pose is computed as<br />

follows. First the 5-point algorithm [83] is used to create an initial rotation R from the 2D − 2D<br />

point correspondences. With this initial rotation R the pose is computed from the 3D − 2D<br />

correspondences using the iterative method of Lu <strong>and</strong> Hager [68]. The pose hypotheses is added<br />

to Q if it represents a valid configuration, i.e. the 3D points are in front of the camera. In<br />

addition a homography is estimated from the 2D − 2D sub-sample. The homographies <strong>for</strong> each<br />

valid pose estimate are maintained in a list H. This is repeated <strong>for</strong> all matched l<strong>and</strong>marks. If all<br />

pose hypotheses are created the lp-score is computed <strong>for</strong> each hypothesis using the corresponding<br />

homographies. If there is only a single l<strong>and</strong>mark match the pose with the smallest lp-score<br />

is selected as resulting pose. For the case of multiple l<strong>and</strong>mark matches the resulting pose is<br />

found by clustering. A subset Q is selected containing the k entries of Q with the smallest<br />

lp-score. Thus Q contains pose estimates from various l<strong>and</strong>marks. Now clustering is applied<br />

to the translational part of Q to find the dominant cluster. The resulting pose is then the one<br />

closest to the center of the dominant cluster. This approach was found to be more robust in the<br />

case of multiple l<strong>and</strong>mark matches. The algorithm is outlined in Algorithm 6.<br />

Algorithm 6 Pose estimation from a single planar l<strong>and</strong>mark (solution selection with lp-score)<br />

Q ← [] {list holding possible solutions R, t <strong>for</strong> pose estimation}<br />

H ← [] {list holding a homography <strong>for</strong> each pose in Q}<br />

<strong>for</strong> all region correspondences do<br />

project 2D points on plane to create 3D-2D matches<br />

<strong>for</strong> i = 1 to n do<br />

select r<strong>and</strong>om subset S from 3D-2D correspondences of size p<br />

compute R, t from S using 5-point algorithm<br />

3D-2D pose estimation with initial rotation R<br />

compute homography h from 2D-2D matches corresponding to subset S using normalized<br />

DLT<br />

add pose (R, t) to Q <strong>and</strong> homography h to H if 3D points are located in front of the<br />

camera<br />

end <strong>for</strong><br />

end <strong>for</strong><br />

<strong>for</strong> all entries in H do<br />

calculate median transfer error of H(i) on 2D-2D correspondences generated from map<br />

plane<br />

end <strong>for</strong><br />

select subset Q from Q containing the k entries with the smallest transfer error<br />

compute clustering of the poses in Q (use translation only)<br />

return R, t of the pose with median position within the dominant cluster


Chapter 8<br />

Map building <strong>and</strong> localization<br />

experiments<br />

In map building <strong>and</strong> localization all methods presented so far are working together. The l<strong>and</strong>marks<br />

used in map building are chosen based on the evaluation described in Chapter 4. Map<br />

building also uses the wide-baseline methods <strong>for</strong> region matching <strong>and</strong> scene reconstruction described<br />

in Chapter 6. Localization as well uses the wide-baseline region matching <strong>and</strong> local<br />

detectors. Thus the experiments conducted in this chapter do not only evaluate the map building<br />

<strong>and</strong> localization methods proposed in the previous chapter but implicitly also measure the<br />

per<strong>for</strong>mance of the wide-baseline methods described previously. Localization will only be successful<br />

if the localization algorithm can work with reliable l<strong>and</strong>marks. This is also valid <strong>for</strong><br />

the map building algorithm. Thus successful localization <strong>and</strong> map building results do not only<br />

confirm the benefits of localization using a piece-wise planar world map but also confirm the<br />

capabilities of the wide-baseline region matching, the piece-wise planar scene reconstruction <strong>and</strong><br />

as well the validity of the local detector evaluation results.<br />

The experiments conducted in this chapter will be carried out using our mobile robot plat<strong>for</strong>m,<br />

the ActivMedia PeopleBot 1 . The experiments will include map building experiments in<br />

a room-size office environment as well as a large scale mapping of a hallway. The ”Office” environment<br />

is rather small but represents nevertheless a typical localization scenario. Accurate<br />

localization results would be expected <strong>for</strong> such an environment. The ”Hallway” scenario is a<br />

much more challenging scenario. It has much bigger extents <strong>and</strong> contains less useful texture to<br />

extract l<strong>and</strong>marks. Nevertheless we will demonstrate successful map building, where the overall<br />

map consist already of 13 single sub-maps.<br />

The mapping results will be compared to a map created by elaborate laser mapping. The<br />

laser map will also act as ground truth <strong>for</strong> the localization experiment. The robot positions<br />

computed by visual localization are compared to the positions from laser based localization.<br />

The experiments will demonstrate that the proposed localization method achieves results<br />

competitive to the current state-of-the-art methods proposed in [96] <strong>and</strong> [56] but requires only<br />

a single l<strong>and</strong>mark match there<strong>for</strong>e being superior to the current state-of-the-art methods.<br />

1 http://www.activrobots.com/robots/peoplebot.html<br />

130


8.1. Experimental setup 131<br />

8.1 Experimental setup<br />

The experimental setup in general consists of an ActivMedia PeopleBot equipped with a range<br />

of sensors including a laser range finder <strong>and</strong> an additional single camera. In the following the<br />

components are described in detail.<br />

8.1.1 ActivMedia PeopleBot<br />

The mobile robot used <strong>for</strong> this experiments is an ActivMedia PeopleBot. The robot already<br />

comes fully equipped <strong>and</strong> operational with lots of localization <strong>and</strong> navigation software, except<br />

visual localization <strong>and</strong> navigation. We choose to work with an already elaborate robotics plat<strong>for</strong>m<br />

to focus on the visual localization solely. The PeopleBot has a size of 47cm×38cm×112cm.<br />

It features a highly holonomic differential drive, thus it is able to rotate in-place. The safe maximum<br />

speed is about 0.8m/s. The onboard sensors include wheel encoders, range-finding sonar<br />

<strong>and</strong> infrared sensors. The range-finding sonar consists of 24 ultrasonic transducers arranged to<br />

provide 360-degree coverage. The sonar sensors ranges from 15cm to 7m. Due to the height<br />

of the robot it includes also infrared sensors pointing in an <strong>for</strong>ward upward direction to detect<br />

obstacles above the sonar array. The combination of sonar <strong>and</strong> infrared sensors already allows<br />

safe navigation in an unknown environment. In addition the robot is equipped with a laser range<br />

finder (LRF). The LRF allows precise localization <strong>and</strong> map building. One of the main reason<br />

to use the PeopleBot is its height. It allows to mount the camera at a height advantageous <strong>for</strong><br />

vision based localization. Figure 8.1 shows the mobile robot <strong>and</strong> its sensor configuration.<br />

color camera<br />

laser range finder<br />

infrared sensor<br />

ultrasonic sensors<br />

ActivMedia PeopleBot<br />

Figure 8.1: The ActivMedia PeopleBot <strong>and</strong> its sensor configuration used <strong>for</strong> our localization<br />

<strong>and</strong> map building experiments.


8.2. Map building experiments 132<br />

8.1.2 Laser range finder<br />

The mobile robot is equipped with a SICK LMS-200 laser range finder (LRF). The LRF allows<br />

range measurements with a field-of-view of 180 ◦ . The angular resolution is 0.5 ◦ . Thus in our<br />

configuration we get 360 range measurements per laser reading. The range measurements are<br />

accurate up to +/- 15mm <strong>for</strong> a range of 1m to 8m. The LRF is mounted at a height of<br />

about 30cm. The robot comes with a laser localization <strong>and</strong> navigation system, which allows the<br />

creation of accurate maps using the LRF. The software ScanStudio stitches together individual<br />

laser measurements to a global map <strong>and</strong> per<strong>for</strong>ms a final global registration. The such created<br />

maps <strong>and</strong> robot positions are used as ground truth <strong>for</strong> the visual localization experiments.<br />

8.1.3 Camera setup<br />

For our experiments the mobile robot has been equipped with a 2-mega-pixel digital camera LU-<br />

205 from Lumenera. The camera features a CMOS sensor <strong>and</strong> is able to capture color images<br />

with a maximal resolution of 1600 × 1200 pixel. The achieved frame rate <strong>for</strong> this resolution is<br />

15 frames per second. For our actual experiments we used images with a resolution of 800 × 600<br />

by using the sub-sampling option of the camera.<br />

The camera is equipped with a 4.8mm wide-angle lens. The wide-angle lens has a field-ofview<br />

of about 90 ◦ but introduces already heavy radial distortions. Thus the captured images<br />

have to be re-sampled to remove the radial distortion be<strong>for</strong>e further processing. Figure 8.2(a)<br />

shows an image be<strong>for</strong>e the radial distortion got removed. Figure 8.2(b) shows the re-sampled<br />

image where the radial distortion got removed. Re-sampling is done with bilinear interpolation.<br />

The radial distortion as well as the principal point <strong>and</strong> the focal length of the camera setup are<br />

estimated using the calibration toolbox of Bouguet 2 .<br />

(a)<br />

(b)<br />

Figure 8.2: (a) Original image of the camera setup using a wide-angle lens showing heavy radial<br />

distortions. (b) Re-sampled image with radial distortion removed.<br />

8.2 Map building experiments<br />

In the next two sections results <strong>for</strong> the map building algorithm are shown. One experiment was<br />

carried out in an office room. Another experiment was carried out in a hallway of the university<br />

2 Camera calibration toolbox <strong>for</strong> Matlab (available at www.vision.caltech.edu/bouguetj/)


8.2. Map building experiments 133<br />

building. The experiments will demonstrate the capability of building piece-wise planar maps<br />

using the algorithm presented in the previous chapter.<br />

8.2.1 Office environment<br />

The first experiment to be discussed is visual mapping of an office room with about 12m 2<br />

size. For this a mobile robot was driven through the office capturing images <strong>and</strong> acquiring<br />

range measurements with a laser range finder. The goal of this experiments is to per<strong>for</strong>m the<br />

construction of a piece-wise planar world map from the image data <strong>and</strong> compare it to the map<br />

created from the laser measurements. The map created from the laser range finder readings acts<br />

as ground truth. The software ScanStudio 3 has been used to create a floor plan of the office<br />

room. Figure 8.3 shows the map built from the laser range data as well as the path of the robot<br />

run. The robot positions from where laser readings were taken are marked with blue circles.<br />

The path of the robot is drawn in black <strong>and</strong> interpolated between the laser readings.<br />

Figure 8.3: Test environment ”Office”: Floor plan created by laser range finder data. Circles<br />

mark the positions of laser readings. The robots path is drawn in black <strong>and</strong> interpolated between<br />

the laser readings.<br />

3 ScanStudio is available from http://www.activrobots.com/


8.2. Map building experiments 134<br />

During the robot run 951 images were taken at a frame rate of 3 frames per second. To<br />

reduce the number of input images <strong>for</strong> the map building algorithm a manual preselection of the<br />

images has been per<strong>for</strong>med. Images showing a too small baseline <strong>for</strong> stereo reconstruction have<br />

been removed. Map building has been per<strong>for</strong>med as described in the previous section. First,<br />

sub-maps are identified in the image stack. Second, each sub-map is reconstructed separately.<br />

Finally, the reconstructed sub-maps are linked together to <strong>for</strong>m a global world map. For the<br />

”Office” environment 5 sub-maps were identified <strong>and</strong> reconstructed to build a world map. Figure<br />

8.4 shows the short-baseline image pairs used <strong>for</strong> sub-map reconstruction. Figure 8.5 shows<br />

the individual sub-map reconstructions. The top row of the images shows the reconstructed<br />

planar structures of the scene which contain l<strong>and</strong>marks. The bottom row shows the extracted<br />

planar l<strong>and</strong>marks only. Map building does not create an entire 3D reconstruction, only the<br />

parts of a scene useful <strong>for</strong> robot localization are reconstructed <strong>and</strong> extracted. The 5 sub-maps<br />

together <strong>for</strong>m the complete world map of the ”Office” environment. Linking works by searching<br />

<strong>for</strong> l<strong>and</strong>mark correspondences within the different sub-maps. With a single l<strong>and</strong>mark correspondence<br />

the similarity trans<strong>for</strong>m between two sub-maps can be computed <strong>and</strong> the sub-maps<br />

can be linked. Only a small overlap is needed to link two sub-maps. The linked world map is<br />

shown in Figure 8.6(a). Figure 8.6(b) shows details of the left corner. Table 8.1 summarizes the<br />

intrinsics of the created world map, the number of sub-maps, the number of planes, the number<br />

of l<strong>and</strong>marks <strong>and</strong> the metric size.<br />

To analyze the quality of the world map we compare the visual map with the laser map. As<br />

the laser map is available as floor plan only we compare a bird’s view of the visual map with the<br />

laser map. The comparison of the visual map <strong>and</strong> the laser map is shown in Figure 8.7. The<br />

laser map is shown in blue <strong>and</strong> the visual map is shown in green. The alignment of both maps<br />

(translation, rotation <strong>and</strong> scale) has been done manually. The laser map has been acquired<br />

in a height of about 30cm above floor level. The visual map in most cases represents scene<br />

structures at different heights, thus the room outline of the visual map <strong>and</strong> the laser map does<br />

not necessarily coincide. This is in particular visible on the right part of the map. On the right<br />

part the wall shows an indentation at the height of about 1m. The laser range finder measures<br />

the wall at 30cm height while the visual method reconstructed the poster in the indentation. The<br />

bird’s view of the world map reveals that the rectilinear structure got accurately reconstructed.<br />

Furthermore the piece-wise planar representation allows an accurate alignment of the plane<br />

structures from different sub-maps. The top wall is represented by 3 different sub-maps <strong>and</strong> the<br />

linking process manages an exactly collinear representation without a gap.<br />

”Office” map intrinsics<br />

Number of sub-maps 5<br />

Number of planes 18<br />

Number of l<strong>and</strong>marks 1294<br />

Map size [m] 4 × 3<br />

Table 8.1: Intrinsics of the ”Office” map.


8.2. Map building experiments 135<br />

(a) (b) (c)<br />

(d)<br />

(e)<br />

Figure 8.4: Short-baseline image pairs used <strong>for</strong> sub-map reconstruction.


8.2. Map building experiments 136<br />

(a) (b) (c)<br />

(d)<br />

(e)<br />

Figure 8.5: Individual sub-maps of the ”Office” environment. The top image of each row shows<br />

the reconstructed planar structures containing l<strong>and</strong>marks. The bottom row shows the extracted<br />

planar l<strong>and</strong>marks only.


8.2. Map building experiments 137<br />

(a)<br />

(b)<br />

Figure 8.6: (a) Piece-wise planar world map of the ”Office” environment consisting of 5 linked<br />

sub-maps (3D view). (b) Enlarged detail.


8.2. Map building experiments 138<br />

Figure 8.7: Laser map (blue) <strong>and</strong> bird’s eye view of the visual map (green) overlayed.


8.2. Map building experiments 139<br />

8.2.2 Hallway environment<br />

The second mapping experiment has been conducted in a hallway of our university building.<br />

The ”Hallway” environment is a very challenging environment. First, the area to be mapped is<br />

large, 30m × 8m. Second, the environment consists of lots of un-textured areas. A total number<br />

of 2293 images has been acquired by multiple robot runs. The robot runs were per<strong>for</strong>med on<br />

different days too. A selection of the whole image set was used to create the world map. For<br />

map building first sub-maps have been identified within the image set. In a next step each<br />

identified sub-map has been reconstructed. The last map building step was the linking of the<br />

sub-maps. Map building resulted in the reconstruction <strong>and</strong> linking of 13 sub-maps. Table 8.2<br />

summarizes the details of the ”Hallway” map. Figure 8.9 shows two views of the piece-wise<br />

planar world map in 3D. A bird’s eye view of the map is shown in Figure 8.8. Each sub-map is<br />

shown in a different color. The color coding reveals the sub-map structure of the whole map. It<br />

is visible from the color coding that the sub-maps differ in size. The upper right part of the map<br />

contains a high number of small sub-maps. In that area the sub-maps overlap substantially. The<br />

sub-maps in the lower part are bigger. In the lower part sub-map linking has to be per<strong>for</strong>med<br />

over wide baselines. The upper <strong>and</strong> lower part of the map are linked only by a single l<strong>and</strong>mark<br />

contained in the upper left sub-map (brown). Despite linked by a single l<strong>and</strong>mark only both<br />

parts are nicely parallel <strong>and</strong> show the capabilities of the sub-map reconstruction <strong>and</strong> linking<br />

methods.<br />

”Hallway” map intrinsics<br />

Number of sub-maps 13<br />

Number of planes 37<br />

Number of l<strong>and</strong>marks 2093<br />

Map size [m] 30 × 8<br />

Table 8.2: Intrinsics of the ”Hallway” map.


8.2. Map building experiments 140<br />

Figure 8.8: Bird’s eye view of the visual map of the ”Hallway” environment. The color coding<br />

shows the individual sub-maps.


8.2. Map building experiments 141<br />

(a)<br />

(b)<br />

Figure 8.9: (a) Piece-wise planar world map of the ”Hallway” environment consisting of 13<br />

linked sub-maps (3D view). (b) Enlarged detail.


8.3. Localization experiments 142<br />

8.3 Localization experiments<br />

This section shows the results of different localization experiments within the ”Office” <strong>and</strong> the<br />

”Hallway” environment demonstrating the capabilities <strong>and</strong> limitations of the proposed approach.<br />

8.3.1 Localization accuracy<br />

To assess the accuracy of the proposed localization method the pose estimates are compared to<br />

pose estimation using the laser range finder. The pose estimates of the laser range finder act as<br />

ground truth. As test scenario the ”Office” environment has been chosen. The robot was moved<br />

to 18 distinct positions (L1-L18) in the room where laser measurements were taken as well as<br />

images were captured. The laser range finder only reports a 2D position <strong>and</strong> the heading of the<br />

robot, while the proposed localization method produces a full 3D position. Thus only the x <strong>and</strong><br />

y component of the position <strong>and</strong> only one rotation angle of the heading can be compared to the<br />

laser results. For comparison a position error is calculated as the Euclidean distance between<br />

the corresponding laser position <strong>and</strong> visual position. In addition a rotation error between the<br />

robot heading from the laser <strong>and</strong> from the visual localization is computed. The rotation error is<br />

the absolute distance of the robot heading <strong>and</strong> the visual heading. Table 8.3 shows the average<br />

error, the median error, the minimal error, the maximal error <strong>and</strong> the st<strong>and</strong>ard deviation over<br />

all locations <strong>for</strong> the position <strong>and</strong> rotation error. Figure 8.10 illustrates the localization results<br />

using the epipolar criteria algorithm. The tested positions are labelled L1 to L18. Blue circles<br />

mark the pose estimates from the laser localization. Green circles mark the pose estimates<br />

from the visual localization. Visual pose estimation failed <strong>for</strong> the positions L4 <strong>and</strong> L5. The<br />

positions L4 <strong>and</strong> L5 are the top left positions. Both positions are already very close to the wall.<br />

Especially this area of the wall is mainly un-textured. No l<strong>and</strong>marks could be detected in the<br />

images from this viewpoint. The laser localization however had no problems with this positions.<br />

Localization has also per<strong>for</strong>med using the lp-score method. The localization results achieved<br />

with the lp-score are almost identical to the epipolar criteria. Localization <strong>for</strong> the positions L4<br />

<strong>and</strong> L5 did also fail <strong>for</strong> the lp-score method. The difference is only in the accuracy of the pose<br />

estimates. Figure 8.11 shows the localization results using the lp-score algorithm.<br />

The visual position estimates show only a small deviation from the laser positions. The<br />

position estimation with the epipolar criteria algorithm is accurate up to an average error of<br />

0.061m. The median positional error is 0.059m. The average rotational error is 3.92 ◦ . The<br />

lp-score algorithm shows slightly higher errors but gives in general similar results. Note that<br />

the lp-score algorithm uses no additional l<strong>and</strong>marks <strong>for</strong> hypothesis selection. Both algorithms<br />

achieve an average epipolar error of 0.9 pixel on additional l<strong>and</strong>mark matches. The average<br />

l<strong>and</strong>mark size used <strong>for</strong> pose estimation is <strong>for</strong> both algorithms around 800 pixel. The achieved<br />

localization accuracy is competitive to the accuracies reported <strong>for</strong> the methods of [96] <strong>and</strong> [56],<br />

but requires only a single l<strong>and</strong>marks match of around 800 pixel size.


8.3. Localization experiments 143<br />

epipolar criteria lp-score<br />

position error (average) [m] 0.061 0.078<br />

position error (std.) [m] 0.031 0.037<br />

position error (min) [m] 0.018 0.027<br />

position error (max) [m] 0.120 0.134<br />

position error (median) [m] 0.059 0.084<br />

rotation error (average) [ ◦ ] 3.93 4.09<br />

rotation error (std.) [ ◦ ] 1.88 2.27<br />

rotation error (min) [ ◦ ] 1.68 0.84<br />

rotation error (max) [ ◦ ] 7.80 7.39<br />

rotation error (median) [ ◦ ] 3.92 4.39<br />

avg. epipolar distance [pixel] 0.89 0.92<br />

avg. l<strong>and</strong>mark area [pixel] 876 813<br />

Table 8.3: Positional <strong>and</strong> rotational error of visual based localization compared to laser ground<br />

truth.<br />

Figure 8.10: Localization experiment with epipolar criteria. The blue circles mark the laser<br />

ground truth. The green circles mark the positions estimated by visual localization. For L3 <strong>and</strong><br />

L4 no l<strong>and</strong>mark matches <strong>for</strong> visual localization could detected.


8.3. Localization experiments 144<br />

Figure 8.11: Localization experiment with lp-score. The blue circles mark the laser ground<br />

truth. The green circles mark the positions estimated by visual localization. For L3 <strong>and</strong> L4 no<br />

l<strong>and</strong>mark matches <strong>for</strong> visual localization could detected.


8.3. Localization experiments 145<br />

8.3.2 Path reconstruction<br />

The main application <strong>for</strong> global localization is to compute an initial robot position to initialize<br />

a probabilistic SLAM framework, e.g. [78]. The robot position is then usually maintained by an<br />

extended Kalman filter. The Kalman filter will include the knowledge of previous robot positions<br />

<strong>and</strong> also previous speed <strong>and</strong> heading estimates. Propagating this values probabilistically will<br />

result in a smooth reconstruction of the traversed path.<br />

In absence of a SLAM framework the traversed path has to be reconstructed by global localization<br />

solely. Each position estimate is then computed independently. Figure 8.12 shows a<br />

part of the robots path through the ”Office” environment reconstructed by global localization.<br />

The path consists of 204 independent pose estimated. The pose estimates are marked with<br />

black dots, together <strong>for</strong>ming the traversed path. The red dots mark gross outliers. Table 8.4<br />

summarizes the corresponding numbers. From 204 total pose estimates 21 showed a large deviation<br />

from the original path (from laser localization), thus they are marked as gross outliers.<br />

Such gross outliers would be detected by the Kalman filtering. The average epipolar distance,<br />

computed as a measure of accuracy, is 1.45 pixel.<br />

Figure 8.13 shows the reconstruction of a robots path in the ”Hallway” environment. In fact,<br />

three different sections of a robots path are shown. In total 124 positions have been computed.<br />

The corresponding paths are drawn as black dots. For this scenario no laser ground truth is<br />

available, thus outlier detection did not apply.<br />

#frames (positions) #correct (%) #bad estimates (%) avg. epipolar distance<br />

(std. dev.) [pixel]<br />

Office 204 183 (89.7%) 21 (10.3%) 1.45 (1.0)<br />

Hallway 124 - - 1.49 (0.74)<br />

Table 8.4: Number of correct pose estimates <strong>and</strong> bad estimates <strong>for</strong> the path reconstruction<br />

experiment. The accuracy of the pose estimates are expressed by the epipolar distance.


8.3. Localization experiments 146<br />

Figure 8.12: Reconstruction of a robots path through the ”Office” environment by global localization.<br />

Each position of the path is estimated independently. The path is visualized by the<br />

black dots. Red dots mark gross outliers.


8.3. Localization experiments 147<br />

Figure 8.13: Reconstruction of a robots path (3 different sections) through the ”Hallway” environment<br />

by global localization. Each position of the path is estimated independently. The path<br />

is visualized by the black dots.


8.3. Localization experiments 148<br />

8.3.3 Evaluation of the sub-sampling scheme<br />

This experiment deals with investigating the increase in accuracy gained by the proposed subsampling<br />

scheme compared to the st<strong>and</strong>ard application of the Lu et al. [68] pose estimation<br />

using all 3D-2D correspondences, denoted as LH method. As a measure of accuracy the epipolar<br />

distance between 2D image points <strong>and</strong> epipolar lines is used. Each pose is computed from a<br />

single l<strong>and</strong>mark only. The other detected l<strong>and</strong>mark matches are now used to assess the quality<br />

of the computed pose by computing the epipolar distance between the 2D point correspondences<br />

<strong>and</strong> the epipolar lines.<br />

The proposed sub-sampling scheme (see Chapter 7) creates n subsets of size p from the point<br />

correspondences of a region. That means, every region generates n solutions. The best solution<br />

is selected <strong>and</strong> we will show in this experiment, that in most cases there exists a subset<br />

which produces a better solution than computing the pose from all correspondences. We will<br />

investigate the two hypothesis selection strategies proposed in the previous chapter, the epipolar<br />

criteria <strong>and</strong> the lp-score method. The LH method uses all correspondences of a l<strong>and</strong>mark<br />

<strong>for</strong> pose estimation. The results of the three methods are compared by means of the epipolar<br />

distance measure.<br />

For our algorithm n was set to 50 <strong>and</strong> p was set to 10. Figure 8.14 <strong>and</strong> Table 8.5 summarize<br />

the results. We calculated the poses <strong>for</strong> 3 different sequences (all part of the robots path<br />

through the office). Each sequence uses a different sub-map <strong>for</strong> localization. The table shows<br />

the achieved average epipolar distance (of the best solutions), minimal <strong>and</strong> maximal epipolar<br />

distance <strong>and</strong> the st<strong>and</strong>ard deviation as well. It is evident that our proposed algorithm achieves a<br />

smaller epipolar error than the LH method. The sub-sampling scheme with the epipolar criteria<br />

achieved the smallest epipolar distances. The lp-score shows a little higher epipolar distances but<br />

is still better than the LH method. The result is even more impressive by looking at individual<br />

frames, e.g., <strong>for</strong> frame 15 in sequence 2 the epipolar distance achieved by the LH method was<br />

13.15 pixel while our method came down to 0.75 pixel. This is an order of magnitude improvement.<br />

The differences between both methods are illustrated in Figure 8.15 where the positions<br />

of a <strong>for</strong>ward motion sequence are computed. That means, all the camera positions should be<br />

aligned in a row. The result with the sub-sampling method (epipolar criteria) shows only a<br />

little deviation from a straight line. The results obtained from the LH method however shows<br />

large deviations. One hardly sees, that this should be the path from a strict <strong>for</strong>ward motion only.


8.3. Localization experiments 149<br />

avg. epipolar distance [pixel]<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

Sub-sampling (epipolar criteria)<br />

Sub-sampling (lp-score)<br />

LH method<br />

0<br />

S1 S2 S3 S1+S2+S3<br />

Figure 8.14: The graph compares the average epipolar distance <strong>for</strong> our sub-sampling method<br />

(lp-score <strong>and</strong> epipolar criteria) to the LH method <strong>for</strong> 3 different image sequences (S1,S2,S3) <strong>and</strong><br />

all sequences together. Our sub-sampling methods produce a smaller error than the LH method.<br />

sequence 1 avg. epidist min. epidist max. epidist std. epidist<br />

31 frames [pixel] [pixel] [pixel] [pixel]<br />

sub-sampling (epipolar criteria) 1.07 0.87 1.40 0.12<br />

sub-sampling (lp-score) 1.40 0.94 1.86 0.24<br />

LH method 1.47 0.98 2.55 0.31<br />

sequence 2 avg. epidist min. epidist max. epidist stddev epidist<br />

21 frames [pixel] [pixel] [pixel] [pixel]<br />

sub-sampling (epipolar criteria) 1.42 1.03 1.97 0.28<br />

sub-sampling (lp-score) 1.55 1.10 2.16 0.31<br />

LH method 5.60 0.43 28.33 8.67<br />

sequence 3 avg. epidist min. epidist max. epidist stddev epidist<br />

9 frames [pixel] [pixel] [pixel] [pixel]<br />

sub-sampling (epipolar criteria) 1.09 0.90 1.33 0.14<br />

sub-sampling (lp-score) 1.28 1.02 1.81 0.24<br />

LH method 1.53 1.05 1.93 0.29<br />

all sequences avg. epidist min. epidist max. epidist stddev epidist<br />

61 frames [pixel] [pixel] [pixel] [pixel]<br />

sub-sampling (epipolar criteria) 1.19 0.87 1.97 0.25<br />

sub-sampling (lp-score) 1.44 0.94 2.16 0.28<br />

LH method 2.90 0.43 28.33 5.39<br />

Table 8.5: Epipolar distances <strong>for</strong> pose estimation using the LH method <strong>and</strong> the sub-sampling<br />

method (lp-score <strong>and</strong> epipolar criteria).


8.3. Localization experiments 150<br />

(a)<br />

(b)<br />

Figure 8.15: (a) Results <strong>for</strong> global localization. Images show poses computed <strong>for</strong> a <strong>for</strong>ward<br />

motion sequence using the sub-sampling algorithm (b) LH method. Estimated poses show big<br />

deviations from a strict <strong>for</strong>ward motion.


8.4. Summary 151<br />

8.4 Summary<br />

The results gained by the experiments in this chapter are very satisfying. Map building as<br />

well as localization experiments with the proposed methods were carried out very successfully.<br />

Map building has been demonstrated <strong>for</strong> two different scenarios, the ”Office” scenario <strong>and</strong> the<br />

”Hallway” scenario. The ”Office” scenario was used to demonstrate the accuracy of the visual<br />

map building, by comparing it to a map created by a laser range finder. The ”Hallway” scenario<br />

on the other h<strong>and</strong> is much more challenging. It demonstrates the capability of the method to<br />

build maps on a larger scale. The ”Hallway” map consists of about 13 different sub-maps.<br />

The localization experiments confirm, that a piece-wise planar world map is a beneficial<br />

world representation <strong>for</strong> visual global localization. The 3D plane parameters <strong>for</strong> each l<strong>and</strong>mark<br />

incorporated in the piece-wise planar map allows pose estimation from only a single l<strong>and</strong>mark<br />

match. The localization experiments in the ”Office” environment reveal that the achieved accuracy<br />

is competitive to the current state-of-the-art methods [96] <strong>and</strong> [56]. A detail analysis<br />

of the proposed sub-sampling scheme <strong>and</strong> the lp-score quantifies the accuracy improvements by<br />

measuring the epipolar distance. The results show a significant improvement by sub-sampling<br />

<strong>and</strong> hypothesis selection.


Chapter 9<br />

Conclusion<br />

More than 25 years have passed since Moravec [80] presented the first astonishing results in visual<br />

robot localization. But progress did advance slower than people expected from the advances<br />

in the early days. Visual robot localization turned out to be a challenging task <strong>and</strong> research is<br />

still going on. A lot of people have already participated in this challenge <strong>and</strong> their research provided<br />

bits <strong>and</strong> pieces towards a reliable <strong>and</strong> robust visual localization system <strong>for</strong> mobile robots.<br />

Systems like reported in [96] <strong>and</strong> [56] are already on the edge to fully operational <strong>and</strong> reliable<br />

visual localization <strong>and</strong> mapping <strong>for</strong> constrained indoor environments. However, if nowadays<br />

systems will probably work reliable in 95% of the cases there are still 5% missing. And the last<br />

5% will maybe represent a bigger challenge than the already achieved 95%. It is reasonable that<br />

<strong>for</strong> solving the last 5% gap, a set of different <strong>and</strong> specialized methods <strong>and</strong> algorithms has to be<br />

developed <strong>and</strong> integrated into current systems, keeping busy lots of researchers.<br />

This thesis focused on the development of such a specialized method <strong>for</strong> visual global localization.<br />

Global localization is a hard problem <strong>and</strong> very important in mobile robotics. Global<br />

localization is a key technique <strong>for</strong> resolving the following situations:<br />

• Initial position after power on<br />

• Kidnapped robot problem<br />

• Recover from failure<br />

• Loop closing<br />

• Homing<br />

Analyzing the current state-of-the-art in visual global localization revealed the deficiencies<br />

of the current approaches <strong>and</strong> showed the necessity <strong>for</strong> further research. This resulted in the<br />

following research issues chosen as the main objectives of this thesis:<br />

Robust global localization: A lot of different effects influence the per<strong>for</strong>mance of global localization.<br />

Occlusion of the l<strong>and</strong>marks is one of the worst effects. Most of the time a<br />

mobile robot is <strong>for</strong>ced to operate in a dynamically changing environment where people are<br />

moving around close to the robot <strong>and</strong> occlude large parts of the robots sight. Thus global<br />

localization needs to be robust to such occlusions <strong>and</strong> should be possible even in cases<br />

where only a few l<strong>and</strong>marks are visible.<br />

152


153<br />

Accurate pose estimation from a small number of l<strong>and</strong>mark matches: The accuracy<br />

of pose estimation will increase with the number of detected l<strong>and</strong>mark matches. Current<br />

methods require about 10-20 l<strong>and</strong>mark matches to achieve a reasonable accuracy. The<br />

goal is to achieve accurate pose estimation <strong>for</strong> cases with less than 10 l<strong>and</strong>mark matches<br />

or even <strong>for</strong> the case of a single l<strong>and</strong>mark match.<br />

Reliable detection of l<strong>and</strong>mark correspondences: The correspondence problem is an inherent<br />

problem in mobile robotics, not only when using vision sensors but <strong>for</strong> all sensor<br />

modalities. Especially mis-matches pose great difficulties <strong>for</strong> localization algorithms. Using<br />

vision sensors recent advances in wide-baseline image matching successfully demonstrated<br />

how to tackle the correspondence problem under a variety of image trans<strong>for</strong>ms, including<br />

illumination change, viewpoint change, etc. Thus, the application of wide-baseline<br />

methods <strong>for</strong> l<strong>and</strong>mark matching in mobile robotics is very promising.<br />

The above listed main objectives were in the following tackled by applying wide-baseline<br />

methods to the field of mobile robotics. The research resulted in the following main contributions:<br />

Per<strong>for</strong>mance evaluation of local detectors: Local detectors are a key ingredient <strong>for</strong> widebaseline<br />

image matching. A wide variety of different methods already exists, each having<br />

different advantages <strong>and</strong> disadvantages <strong>and</strong> <strong>for</strong> each application the best fitting method<br />

should be chosen. The detector comparison of Mikolajczyk et al. [76] provides a basis <strong>for</strong><br />

such an assessment, however it does not evaluate the detectors on scenes significant <strong>for</strong><br />

mobile robot applications. The evaluation method is not applicable to realistic, complex<br />

scenes as will be encountered in mobile robot experiments. One contribution of this thesis<br />

there<strong>for</strong>e was the development of a method to evaluate the different local detectors on<br />

realistic, complex scenes. The resulting comparison showed a significant difference to the<br />

previous evaluation on the restricted test cases.<br />

Maximally Stable Corner Clusters: The analysis of the new detector evaluation result led<br />

to the development of a new local detector, the Maximally Stable Corner Cluster (MSCC)<br />

detector. MSCC regions are clusters of simple corner points in images, robustly detected<br />

by implying a stability criterium. A comparison with other methods revealed that the<br />

MSCC regions are detected at image locations left out by the other methods, thus they<br />

are complementary to other methods. This complementarity is the key property of the new<br />

detector, it allows an effective combination with other current state-of-the-art detectors.<br />

3D Piece-wise planar world map: The proposed piece-wise planar world map incorporates<br />

a higher degree of structural in<strong>for</strong>mation in the world representation than other methods,<br />

e.g. [56, 96]. L<strong>and</strong>marks are defined by a small plane patch (6DOF), a SIFT-descriptor <strong>and</strong><br />

the original appearance from the image. New methods <strong>for</strong> wide-baseline region matching<br />

<strong>and</strong> piece-wise planar scene reconstruction were developed to build the piece-wise planar<br />

world map.<br />

Global localization from a single l<strong>and</strong>mark: The piece-wise planar world representation,<br />

including plane structures, allowed the development of a new localization algorithm which<br />

enables pose estimation from a single l<strong>and</strong>mark match. Accurate pose estimation is already<br />

possible from an image region of the size of only 400 pixel area. This allows global<br />

localization to deal with a high level of occlusions, necessary <strong>for</strong> crowded environments.


9.1. Future work 154<br />

Map building <strong>and</strong> localization experiments demonstrated the capabilities of the proposed<br />

approach. Map building was successfully shown <strong>for</strong> two indoor environments, the ”Office”<br />

<strong>and</strong> the ”Hallway” environment. The ”Hallway” environment represents a large, challenging<br />

environment of the size of 30m × 8m. The localization experiments showed by comparison<br />

with a ground truth created by laser mapping competitive accuracies to current state-of-the-art<br />

methods, but using only a single l<strong>and</strong>mark match <strong>for</strong> pose estimation.<br />

In summary global localization as proposed in this thesis will result in accurate pose estimates,<br />

even despite of heavy occlusions <strong>and</strong> few l<strong>and</strong>mark matches. Finally Table 9.1 shows<br />

how the key aspects of the new method compare to the current state-of-the-art.<br />

Authors World map Sensor<br />

system<br />

Map features<br />

L<strong>and</strong>mark<br />

matching<br />

Map<br />

building<br />

Global<br />

localization<br />

(#l<strong>and</strong>marks ∗ )<br />

Pose representation<br />

Se,<br />

Lowe,<br />

Little [96]<br />

sparse<br />

metric<br />

stereo 3D points + SIFT feature<br />

matching<br />

SLAM<br />

tri-angulation,<br />

map-alignment<br />

(>= 10)<br />

2D (3DOF)<br />

Karlsson<br />

et al. [56]<br />

sparse<br />

metric<br />

monocular<br />

3D points + SIFT<br />

+ appearance<br />

feature<br />

matching<br />

SLAM<br />

3D-2D<br />

(>= 4)<br />

2D (3DOF)<br />

Davison<br />

et al. [21]<br />

sparse<br />

metric<br />

active<br />

stereo<br />

3D points correlation SLAM tri-angulation<br />

(>= 3)<br />

3D (6DOF)<br />

Bosse<br />

et al. [10]<br />

sparse<br />

metric<br />

omnidirectional<br />

3D points<br />

+ 3D lines<br />

+ vanishing points<br />

nearest<br />

neighbor<br />

batch<br />

map matching<br />

(approx.30)<br />

3D (6DOF)<br />

Goedeme<br />

et al. [39]<br />

topological<br />

monocular,<br />

omnidirectional<br />

2D lines<br />

+ color descriptor<br />

+ intensity descriptor<br />

feature<br />

matching<br />

batch<br />

line matching <strong>and</strong><br />

voting<br />

topological<br />

location<br />

Kosaka<br />

et al. [59]<br />

sparse<br />

metric,<br />

CAD-model<br />

monocular 3D lines nearest<br />

neighbor<br />

manual - 2D (3DOF)<br />

Hayet<br />

et al. [45]<br />

sparse<br />

metric<br />

monocular<br />

quadrangular 3D planes<br />

+ PCA descriptor<br />

feature<br />

matching<br />

batch<br />

3D-2D<br />

(1)<br />

3D (6DOF)<br />

Fraundorfer<br />

piece-wise<br />

planar metric<br />

monocular<br />

unconstraint 3D planes<br />

+ SIFT<br />

+ appearance<br />

feature<br />

matching,<br />

registration<br />

batch<br />

3D-2D<br />

(1)<br />

3D (6DOF)<br />

Table 9.1: Main characteristics of the current state-of-the-art approaches compared to the proposed<br />

approach. ( ∗ necessary l<strong>and</strong>mark matches <strong>for</strong> robust pose estimation)<br />

9.1 Future work<br />

The methods developed in this thesis provide a strong basis to carry on future work in several<br />

interesting directions. Let me describe four of them in more detail. The first interesting direction<br />

is to couple visual odometry with global localization <strong>and</strong> the second much more challenging is<br />

to integrate the presented method <strong>for</strong> global localization into a complete probabilistic SLAM<br />

framework. As a third interesting direction we discuss the integration of point <strong>and</strong> line features<br />

into the piece-wise planar map. Finally we will discuss the use of geometric constraints <strong>for</strong><br />

l<strong>and</strong>mark matching provided by the piece-wise planar map.<br />

Coupling with visual odometry: A mobile robot relying solely on the presented global localization<br />

will come into trouble if the robot faces an environment which lacks of features<br />

which can be used as l<strong>and</strong>marks. This case easily occurs e.g. when the robot comes<br />

close to a plain wall. In such a case no distinct l<strong>and</strong>marks can be extracted <strong>and</strong> thus no<br />

l<strong>and</strong>mark matches are available <strong>for</strong> pose estimation. Such situations can be overcome by<br />

the use of visual odometry as described in [84]. Visual odometry computes the motion


9.1. Future work 155<br />

of the robot from two subsequent frames. Point correspondences <strong>for</strong> epipolar geometry<br />

estimation between two subsequent frames can easily be detected by tracking e.g. KLT<br />

tracking [107]. The current robot position will then be computed from the last known<br />

position obtained from global localization <strong>and</strong> the frame-to-frame motion sequence from<br />

visual odometry. The motion sequence from visual odometry is computed by adding up<br />

all the small frame-to-frame motions. This will inevitably accumulate small errors of the<br />

frame-to-frame motion estimation <strong>and</strong> will result in an error proportional to the length<br />

of the motion sequence. However visual odometry is only necessary to navigate the robot<br />

back to a position where a l<strong>and</strong>mark <strong>for</strong> global localization can be spotted again. And <strong>for</strong><br />

such a short time the visual odometry will be accurate enough. Furthermore, global localization<br />

<strong>and</strong> visual odometry could simply run in parallel <strong>and</strong> the robot position could be<br />

computed by fusing both measurements. Global localization only needs to be carried out<br />

from time to time to correct the pose estimate from visual odometry. For fusing odometry<br />

<strong>and</strong> global localization poses the method developed by Smith et al. [100] can be used. In<br />

their approach the pose is represented using exponential maps. This representation eases<br />

the probabilistic propagation <strong>and</strong> fusion of different measurements. The fusion of global<br />

localization <strong>and</strong> visual odometry in such a way will result in a very robust localization<br />

method.<br />

Integration into SLAM framework: Integration of the global localization into a probabilistic<br />

SLAM framework is a straight<strong>for</strong>ward but challenging task. The first challenge is to<br />

implement map <strong>and</strong> l<strong>and</strong>mark updates into the so far off-line map building process. This<br />

requires an uncertainty representation <strong>for</strong> the map structures. Planes in 3D are described<br />

by a 3D point <strong>and</strong> a 3D normal-vector, in sum 6 parameters. The uncertainty of the 3D<br />

point can be described by a 3 × 3 covariance matrix. The uncertainty of the 3D normalvector<br />

can be described similar, a 3 × 3 covariance matrix can represented the angular<br />

uncertainties of the vector. The needed representation is identical to the case of uncertainty<br />

propagation <strong>for</strong> a camera position as a camera is defined by the image plane <strong>and</strong><br />

the principal point. The uncertainty propagation developed <strong>for</strong> cameras can there<strong>for</strong>e be<br />

applied directly to the plane structures of the map. In [100] the exponential map is used to<br />

propagate the uncertainty of camera positions <strong>and</strong> to update the positions with additional<br />

measurements. An initial uncertainty of the reconstructed planes can be derived from the<br />

distribution of the 3D points defining the 3D plane. Beside the uncertainty representation<br />

of the planes the uncertainty propagation of the pose estimation algorithm has to be derived.<br />

However, having solved these problems the piece-wise planar world representation<br />

<strong>and</strong> global localization can be integrated into a probabilistic SLAM framework as proposed<br />

in [78].<br />

Integrating points, lines <strong>and</strong> planes into a common world map: Integration of points,<br />

lines <strong>and</strong> planes into a common world representation would prove very beneficial. Point<br />

<strong>and</strong> line features will add extra value to map areas where no planar l<strong>and</strong>marks have been<br />

detected. Global localization can there<strong>for</strong>e make use of the points <strong>and</strong> lines <strong>for</strong> pose<br />

estimate. However, the integration of points <strong>and</strong> lines can be much more than simply<br />

storing them in the map database. One requirement of the integration would be the<br />

consistency of the different feature types. When adding a line feature to the world map<br />

it can be checked if the line originates from the intersection of two planes. This allows a<br />

refinement of the line position. On the contrary line features can be used to get an exact<br />

delineation of the map planes. For point features which are located on a map plane this


9.1. Future work 156<br />

in<strong>for</strong>mation can be used to refine the point so that it is positioned exactly on the plane.<br />

Global localization can then use either points, lines or planes or much more interesting,<br />

combinations of the feature types <strong>for</strong> pose estimation. Map planes also introduce a visibility<br />

criterium which can be used to detect l<strong>and</strong>mark mis-matches. Tentative l<strong>and</strong>mark matches<br />

which are behind a scene plane <strong>and</strong> thus are not visible from the current robot position can<br />

be discarded as mis-matches. One of the biggest benefits of such an integration however<br />

is, that it allows localization from another feature type, when e.g. no planes are visible.<br />

Geometric constraints <strong>for</strong> l<strong>and</strong>mark matching: As already stated the detection of l<strong>and</strong>mark<br />

correspondences is a key problem <strong>and</strong> very hard to solve. In the presented approach<br />

corresponding l<strong>and</strong>marks are identified based on the appearance. Although this approach<br />

is very reliable still mis-matches may occur, especially when multiple l<strong>and</strong>marks with<br />

identical appearance exist. However, geometric constraints allow to predict the location<br />

of neighboring l<strong>and</strong>marks <strong>for</strong> a tentative l<strong>and</strong>mark match. For planar l<strong>and</strong>marks also the<br />

appearance of a l<strong>and</strong>mark from a different viewpoint can be computed by a projective<br />

trans<strong>for</strong>mation. Thus a tentative l<strong>and</strong>mark match can be verified by checking the location<br />

<strong>and</strong> appearance of the neighboring l<strong>and</strong>marks. This will significantly improve the<br />

reliability of l<strong>and</strong>mark matching.


Appendix A<br />

Projective trans<strong>for</strong>mation of ellipses<br />

A.1 Projective ellipse transfer<br />

The method <strong>for</strong> detector evaluation described in Chapter 4 needs to transfer ellipses from one<br />

image to an image viewed from a different viewpoint. In this section we will thus discuss how<br />

an ellipse trans<strong>for</strong>ms if it is viewed from a different viewpoint. We consider the planar case, i.e.<br />

the ellipse lies on a plane in the scene. In such a case we can analytically calculate the shape of<br />

the ellipse (determine the ellipse parameters) <strong>for</strong> the image from a different viewpoint. We will<br />

see that it comes down to apply a projective trans<strong>for</strong>mation in <strong>for</strong>m of a matrix multiplication<br />

to our ellipse representation.<br />

The mapping between two different views of a plane is described by a perspectivity [11]. In<br />

computer vision a perspectivity is usually denoted as a homography, sometimes collineation or<br />

projective trans<strong>for</strong>mation. All the terms are synonymous. A homography can be represented as<br />

a 3 × 3 matrix. Consider two image planes I <strong>and</strong> I ′ . Let the mapping between both planes be<br />

the homography H. One can now calculate the position of a point x ′ in I ′ by<br />

x ′ = Hx<br />

(A.1)<br />

where x is the point position in I. x <strong>and</strong> x ′ are homogenous 3-vectors of the <strong>for</strong>m x = [x y 1] T ,<br />

composed of the x, y-coordinates in the image coordinate system. Based on this trans<strong>for</strong>mation<br />

rule we can deduce the trans<strong>for</strong>mation rule <strong>for</strong> an ellipse.<br />

An ellipse is a conic as well as a parabola <strong>and</strong> a hyperbola. Conics arise as conic sections when<br />

a conic is intersected by a plane. A conic can be represented by the following inhomogeneous<br />

equation:<br />

ax 2 + bxy + cy 2 + dx + ey + f = 0.<br />

(A.2)<br />

Putting this into homogeneous <strong>for</strong>m, i.e by replacing x → x 1 /x 3 , y → x 2 /x 3 it writes as follows:<br />

ax 2 1 + bx 1 x 2 + cx 2 2 + dx 1 x 3 + ex 2 x 3 + fx 2 3 = 0.<br />

The conic equation can also be written in matrix <strong>for</strong>m<br />

x T Cx = 0.<br />

We call C the conic coefficient matrix <strong>and</strong> it is given by<br />

⎡<br />

a b/2<br />

⎤<br />

d/2<br />

C = ⎣ b/2 c e/2 ⎦ .<br />

d/2 e/2 f<br />

(A.3)<br />

(A.4)<br />

(A.5)<br />

157


A.1. Projective ellipse transfer 158<br />

If we now apply the trans<strong>for</strong>mation x ′ = Hx to the conic C this will result into the conic<br />

C ′ = H −T CH −1 . This trans<strong>for</strong>mation rule can be shown easily:<br />

x T Cx = x ′T [H −1 ] T CH −1 x ′ (A.6)<br />

= x ′T H −T CH −1 x ′ (A.7)<br />

Writing C ′ = H −T CH −1 reduces the relation to x ′T C ′ x ′ which is the trans<strong>for</strong>mation rule<br />

<strong>for</strong> a conic.<br />

The most important fact <strong>for</strong> us is here that a conic trans<strong>for</strong>med by a projective trans<strong>for</strong>mation<br />

H still results in a conic. That means that (except <strong>for</strong> degenerate trans<strong>for</strong>ms) an ellipse<br />

transferred into another image still stays an ellipse.<br />

This property is illustrated in Figure A.1. The figure shows the effects of trans<strong>for</strong>ming the<br />

original ellipse in Figure A.1(a) by various perspective trans<strong>for</strong>mations with increasing perspectivity.<br />

The original image also contains two tangents to the ellipse, which are also trans<strong>for</strong>med<br />

by the same perspectivity. They are helping to make the effects of the projective trans<strong>for</strong>mation<br />

better visible. In addition the ellipse is overlayed with single points moreless equally distributed.<br />

The points are trans<strong>for</strong>m with the same projective trans<strong>for</strong>mation as the ellipse <strong>and</strong> the lines.<br />

In the trans<strong>for</strong>med image the points still lie on the ellipse but the initial uni<strong>for</strong>m distribution<br />

has changed. The points moved along the ellipse perimeter in direction of the <strong>for</strong>-shortening.<br />

This is observable also <strong>for</strong> the intersection with the tangents.<br />

An important observation is that the center point of the original <strong>and</strong> the center point trans<strong>for</strong>med<br />

ellipses are not connected by the trans<strong>for</strong>ming perspectivity. In other words, applying a<br />

point trans<strong>for</strong>m (homography) to the original ellipse center does not yield the ellipse center <strong>for</strong><br />

the trans<strong>for</strong>med ellipse.<br />

To calculate a projectively trans<strong>for</strong>med conic with the previous method the conic must be<br />

represented in his matrix <strong>for</strong>m. In the case of ellipses however two other representations are<br />

very common, the parameter <strong>for</strong>m <strong>and</strong> the second moment matrix. In the parameter <strong>for</strong>m one<br />

usually specifies the ellipse by a 5-vector E = [x, y, a, b, α]. In this representation x, y are the<br />

point coordinates of the ellipse center. The values a, b are the length of the major <strong>and</strong> minor<br />

semi-axes. And t is the ellipses rotation. One can convert this representation into the matrix<br />

<strong>for</strong>m by first setting up a matrix <strong>for</strong> the canonic ellipse <strong>for</strong>m <strong>and</strong> then applying a translation<br />

<strong>for</strong> the center point <strong>and</strong> a rotation <strong>for</strong> the angle. The canonic representation C C can be set up<br />

as follows:<br />

⎡<br />

a −2 ⎤<br />

0 0<br />

C C = ⎣ 0 b −2 0 ⎦ .<br />

(A.8)<br />

0 0 −1<br />

Applying rotation <strong>and</strong> translation leads to the matrix <strong>for</strong>m C,<br />

C = T T R T C C RT<br />

where R is a 3 × 3 2D rotation matrix<br />

⎡<br />

cos α − sin α<br />

⎤<br />

0<br />

R = ⎣ sin α cos α 0 ⎦<br />

0 0 1<br />

<strong>and</strong> T is a 3 × 3 2D translation matrix<br />

⎡<br />

T = ⎣<br />

1 0 −x<br />

0 1 −y<br />

0 0 1<br />

⎤<br />

⎦ .<br />

(A.9)<br />

(A.10)<br />

(A.11)


A.1. Projective ellipse transfer 159<br />

(a)<br />

(b)<br />

(c)<br />

(d)<br />

(e)<br />

Figure A.1: (a) Original ellipse. (b-e) Trans<strong>for</strong>med ellipses<br />

The conic representation in matrix <strong>for</strong>m is independent of scaling. That means multiplying<br />

the conic matrix C by some non-zero scalar s represents still the same conic.<br />

To calculate the ellipse parameters E = [x, y, a, b, α] from the matrix representation one can<br />

go the inverse way of the construction. For this let us write down the conic construction in more


A.1. Projective ellipse transfer 160<br />

detail.<br />

C = T T R T C C RT (A.12)<br />

[ ] [ ] [ ] [ ] [ ]<br />

I 0 r<br />

T<br />

0 cc 0 r 0 I t<br />

=<br />

t T 1 0 T 1 0 T −1 0 T 1 0 T (A.13)<br />

1<br />

[ r<br />

=<br />

T c c r r T ]<br />

c c rt<br />

t T r T c c r t T r T (A.14)<br />

c c rt − 1<br />

⎡<br />

⎤<br />

c 11 c 12 c 13<br />

= ⎣ c 21 c 22 c 23<br />

⎦ .<br />

(A.15)<br />

c 31 c 32 c 33<br />

I is a 2 × 2 identity<br />

[<br />

matrix. t =<br />

]<br />

[−x − y] T is a 2-vector representing<br />

[<br />

the translation<br />

]<br />

to the<br />

cos α − sin α<br />

a<br />

−2<br />

0<br />

ellipse center. r =<br />

is a 2 × 2 rotation matrix. c<br />

sin α cos α<br />

c =<br />

0 b −2 is the upper<br />

2 × 2 part of the canonic conic matrix.<br />

From Eq. (A.12) it is evident that the translation vector t can be calculated from the conic<br />

matrix C with simple matrix arithmetic.<br />

( )<br />

c13<br />

= r T c<br />

c c rt (A.16)<br />

23<br />

( )<br />

t = r T c c r −1 c13<br />

(A.17)<br />

c 23<br />

[ ] −1 ( )<br />

c11 c<br />

=<br />

12 c13<br />

(A.18)<br />

c 22 c 23<br />

c 21<br />

It is a very nice property that the translation can be extracted from an arbitrarily scaled<br />

conic matrix. Consider a scaling s, Eq. (A.18) rewrites as follows,<br />

[ ] −1 ( )<br />

sc11 sc<br />

t =<br />

12 sc13<br />

sc 21 sc 22 sc 23<br />

= 1 [ ] −1 ( )<br />

c11 c 12 c13<br />

s<br />

s c 22 c 23<br />

c 21<br />

(A.19)<br />

(A.20)<br />

<strong>and</strong> one can see that the scaling s is cancelled out. The ellipse angle α can be extracted<br />

from the conic matrix using the following equation 1 :<br />

α = 1 2 arctan 2c 21<br />

c 22 − c 11<br />

(A.21)<br />

Eq. (A.21) can be verified by taking a closer look at the coefficients of the conic matrix.<br />

c 11 = 1 a 2 cos2 α + 1 b 2 sin2 α<br />

(A.22)<br />

c 22 = 1 a 2 sin2 α + 1 b 2 cos2 α<br />

(A.23)<br />

1 This <strong>for</strong>mula returns the correct angles <strong>for</strong> an interval from 0 to π/2. By returning α modulo π the angle is<br />

correct <strong>for</strong> an interval from 0 to π. That is sufficient because a conic in matrix <strong>for</strong>m is defined uniquely only <strong>for</strong><br />

an interval from 0 to π. Constructing a conic matrix <strong>for</strong> an angle α + π gives the same matrix as <strong>for</strong> α.


A.1. Projective ellipse transfer 161<br />

c 21 = − 1 a 2 cos α sin α + 1 cos α sin α (A.24)<br />

b2 = − 1 1<br />

a 2 2 sin 2α + 1 1<br />

b 2 sin 2α (A.25)<br />

2<br />

= 1 ( 1<br />

2 sin 2α b 2 − 1 )<br />

a 2 (A.26)<br />

( 1<br />

c 22 − c 11 = cos 2 α<br />

b 2 − 1 )<br />

a 2<br />

( 1<br />

+ sin 2 α<br />

a 2 − 1 )<br />

b 2<br />

(A.27)<br />

=<br />

=<br />

=<br />

( 1<br />

b 2 − 1 a 2 ) (cos 2 α − sin 2 α ) (A.28)<br />

( 1<br />

b 2 − 1 a 2 ) ( 1<br />

2 + 1 2 cos 2α − 1 2 + 1 2 cos 2α )<br />

( 1<br />

b 2 − 1 a 2 )<br />

cos 2α<br />

(A.29)<br />

(A.30)<br />

c 21<br />

c 22 − c 11<br />

=<br />

1<br />

2 sin 2α ( 1<br />

− 1 )<br />

b 2 a 2<br />

cos 2α ( 1<br />

− 1 ) = 1 tan 2α<br />

2 (A.31)<br />

b 2 a 2<br />

( )<br />

arctan 2c21<br />

c 22 −c 11<br />

α =<br />

(A.32)<br />

2<br />

With α we can calculate the rotation matrix R which is needed to extract the last parameters<br />

a <strong>and</strong> b. But be<strong>for</strong>e we come to this it is necessary to remove the arbitrary scale from the conic<br />

matrix. Unlike translation <strong>and</strong> rotation the calculation of the axis is sensitive to arbitrary<br />

scaling. From Eq. (A.14) we can see, that the upper 2 × 2 matrix of the conic matrix is equal to<br />

r T c c r. The matrix c c contains the desired values of the axis a <strong>and</strong> b <strong>and</strong> they can be recovered<br />

by removing the applied rotations. However, an arbitrary scaling directly multiplies into the<br />

axis length <strong>and</strong> there<strong>for</strong>e we have to first calculate a scaling factor <strong>and</strong> remove the scaling (if it<br />

is not equal to 1). It is possible to recover the scaling s from coefficient c 33 of the conic matrix.<br />

The equation <strong>for</strong> c 33 with an unknown scaling factor s is<br />

c 33 = ( st T r T c c rt − 1 ) (A.33)<br />

(<br />

= t T 1 [ ] )<br />

c11 c 12<br />

t − 1 s<br />

(A.34)<br />

s c 21 c 22<br />

( [ ] )<br />

= t T c11 c 12<br />

t − s<br />

(A.35)<br />

c 21 c 22<br />

which leads to the following equation <strong>for</strong> s<br />

[ ]<br />

s = t T c11 c 12<br />

t − c<br />

c 21 c 33 .<br />

22<br />

Now we have all ingredients to recover the matrix c c .<br />

c c = 1 [ ]<br />

s r c11 c 12<br />

r T<br />

c 21 c 22<br />

(A.36)<br />

(A.37)


A.2. Affine approximation of ellipse transfer 162<br />

Finally the axis lengths a <strong>and</strong> b are<br />

a =<br />

b =<br />

1<br />

√<br />

cc11<br />

1<br />

√ .<br />

cc22<br />

(A.38)<br />

(A.39)<br />

A.2 Affine approximation of ellipse transfer<br />

In this section we discuss the ellipse transfer method used in the evaluation method of Mikolajczyk<br />

<strong>and</strong> Schmid [74]. It approximates the projective trans<strong>for</strong>mation with an affine trans<strong>for</strong>mation.<br />

The method works by trans<strong>for</strong>ming the ellipse shape with an affine trans<strong>for</strong>mation <strong>and</strong><br />

centering the ellipse around a new ellipse center. The new ellipse center is gained by transferring<br />

the ellipse center to the other image by homography trans<strong>for</strong>m. To obtain the new ellipse shape<br />

the second moment matrix of the ellipse is trans<strong>for</strong>med with an affine trans<strong>for</strong>mation which is<br />

an approximate estimate of the true projective trans<strong>for</strong>mation.<br />

Such an approximation was chosen in [74] because the authors were interested in establishing<br />

a corresponding center point. However, one must be aware that there can be quite large approximation<br />

errors. Figure A.2 shows a comparison of the projective <strong>and</strong> affine ellipse transfer. The<br />

ellipse resulting from the projective transfer is drawn in black, the results from the affine transfer<br />

in green. In Figure A.2(b-e) one can see the differences between both methods trans<strong>for</strong>ming the<br />

original ellipse in Figure A.2(a). The ellipse centers <strong>for</strong> the green ones are at the position of the<br />

original ellipse center trans<strong>for</strong>m by the homography.


A.2. Affine approximation of ellipse transfer 163<br />

(a)<br />

(b)<br />

(c)<br />

(d)<br />

(e)<br />

Figure A.2: Comparison of projective (in black) <strong>and</strong> affine (in green) ellipse transfer. (a) Original<br />

ellipse. (b-e) Trans<strong>for</strong>med ellipses


Appendix B<br />

The trifocal tensor <strong>and</strong> point<br />

transfer<br />

B.1 The trifocal tensor<br />

The trifocal tensor encapsulates the geometry between three images. It is the analogue to the<br />

fundamental matrix <strong>for</strong> the three view case. The trifocal tensor, its computation <strong>and</strong> properties<br />

are described in detail in [41, 42, 101, 109]. The trifocal tensor consists of three 3×3 matrices. It<br />

has thus 27 elements. However, the tensor has only 18 DOF. It is determined up to an arbitrary<br />

scale factor. The trifocal tensor defines various relationships between points <strong>and</strong> lines in three<br />

views. This incidence relations are trilinear equations <strong>and</strong> thus often denoted as trilinearities.<br />

The incidence relations are listed in Table B.1. T jk<br />

i<br />

is the trifocal tensor in tensor notation. Point<br />

correspondences between three views are given as x ↔ x ′ ↔ x ′′ . Similarly line correspondences<br />

are given as l ↔ l ′ ↔ l ′′ . The trilinearities are the basic equations <strong>for</strong> the computation of the<br />

trifocal tensor. The trifocal tensor can be computed from point or line correspondences between<br />

three views. With the use of the trilinearities an equation system can be generated of the <strong>for</strong>m<br />

At = 0, where t are the 27 entries of the trifocal tensor. To solve <strong>for</strong> the 27 entries of T jk<br />

i<br />

up to<br />

scale 26 equations are necessary. With more than 26 equations a least squares solution can be<br />

computed. Using point correspondences (point-point-point incidence relation) at least 7 point<br />

correspondences 1 are necessary as each point-point-point incidence gives 4 linear independent<br />

equations. Each of the trilinearities can be used to generate the equation system. The different<br />

methods <strong>for</strong> the computation of the trifocal tensor are listed <strong>and</strong> described in detail in [44].<br />

B.2 Point transfer<br />

Knowing the trifocal tensor <strong>for</strong> a set of three images it is possible to transfer point locations<br />

known in two of the images into the third one. This is often denoted as the point transfer<br />

property. The trilinearities again provide the basis <strong>for</strong> the point transfer property. The transfer<br />

property also holds <strong>for</strong> line correspondences. In the following the algorithm to transfer a point<br />

given in the first two views into the third view is outlined in detail (as described in [44]):<br />

1. Extract the fundamental matrix F 21 from the trifocal tensor <strong>and</strong> correct x ↔ x ′ to the<br />

exact correspondence ̂x ↔ ̂x ′ (by optimal triangulation <strong>and</strong> re-projection into the images).<br />

1 A six-point algorithm has been proposed in [108], which produces up to three possible solutions.<br />

164


165<br />

Trilinearities<br />

Line-line-line correspondence<br />

Point-line-line correspondence<br />

Point-line-point correspondence<br />

(l r ɛ ris )l ′ j l′′ T jk<br />

k i<br />

= 0 s<br />

x i l ′ j l′′ k T jk<br />

i<br />

= 0<br />

x i l ′ j (x′′k ɛ kqs )T jq<br />

i<br />

= 0 s<br />

Point-point-line correspondence x i (x ′j ɛ jpr )l ′′ T pk<br />

k i<br />

= 0 r<br />

Point-point-point correspondence<br />

x i (x ′j ɛ jpr )(x ′′k ɛ kqs )T pq<br />

i<br />

= 0 rs<br />

Table B.1: Summary of the incidence relations (trilinearities) imposed by the trifocal tensor.<br />

2. Next compute the line l’ through ̂x ′ which should be perpendicular to the epipolar line of<br />

̂x which is defined by l ′ e = F 21̂x. Then l ′ = (l 2 , −l 1 , −̂x 1 l 2 + ̂x 2 l 1 ) T with l ′ e = (l 1 , l 2 , l 3 ) T<br />

<strong>and</strong> ̂x ′ = (̂x 1 , ̂x 2 , 1) T .<br />

3. The transferred point is x ′′k = ̂x i l ′ j T jk<br />

i<br />

where l ′ j T jk<br />

i<br />

is the homography mapping H k i =H 13(l ′ ).<br />

The point transfer <strong>for</strong> the other views works similar <strong>and</strong> the corresponding equations are given<br />

in Table B.2.<br />

view 2,3→1 view 1,3→2 view 1,2→3<br />

l ′ e = F T 21̂x′′ l ′′<br />

e = F 31̂x<br />

l ′ e = F 21̂x<br />

l ′ = (l 2 , −l 1 , −̂x ′′<br />

1 l 2 + ̂x ′′<br />

2 l 1) T l ′′ = (l 2 , −l 1 , −̂x 1 l 2 + ̂x 2 l 1 ) T l ′ = (l 2 , −l 1 , −̂x 1 l 2 + ̂x 2 l 1 ) T<br />

H 13 (l ′ ) = H k i<br />

=l ′ j T jk<br />

i<br />

H 12 (l ′′ ) = H j i =l′′ k T jk<br />

i<br />

H 13 (l ′ ) = Hi k =l j ′ T jk<br />

i<br />

x = H 13 (l ′ ) −1 x ′′ x ′ = H 12 (l ′′ )x x ′′ = H 13 (l ′ )x<br />

Table B.2: Relations to transfer a point into each view using the trifocal tensor.


Bibliography<br />

[1] S. Atiya <strong>and</strong> G. Hager. Real-time vision-based robot localization. IEEE Transactions on<br />

Robotics <strong>and</strong> Automation, 9:785–800, 1993.<br />

[2] N. Ayache <strong>and</strong> O. Faugeras. Maintaining representations of the environment of a mobile<br />

robot. IEEE Transactions on Robotics <strong>and</strong> Automation, 5(6):804–819, 1989.<br />

[3] C. Baillard <strong>and</strong> A. Zisserman. A plane-sweep strategy <strong>for</strong> the 3d reconstruction of buildings<br />

from multiple images. In International Archives of Photogrammetry <strong>and</strong> Remote Sensing,<br />

volume 32, pages 56–62, 2000.<br />

[4] C. Baillard, C. Schmid, A. Zisserman, <strong>and</strong> A. W. Fitzgibbon. Automatic line matching<br />

<strong>and</strong> 3d reconstruction of buildings from multiple views. In Proc. ISPRS Conference on<br />

Automatic Extraction of GIS Objects from Digital Imagery, Munich, pages 69–80, 1999.<br />

[5] J. Bauer, K. Karner, <strong>and</strong> K. Schindler. Plane parameter estimation by edge set matching.<br />

In Proc. 26th Workshop of the Austrian Association <strong>for</strong> Pattern Recognition, <strong>Graz</strong>,<br />

Austria, pages 29–36, 2002.<br />

[6] A. Baumberg. Reliable feature matching across widely separated views. In Proc. IEEE<br />

Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition, Hilton Head, South Carolina,<br />

pages 774–781, 2000.<br />

[7] P. R. Beaudet. Rotationally invariant image operators. International Joint Conference on<br />

Pattern Recognition, pages 579–583, 1978.<br />

[8] P. Besl <strong>and</strong> N. McKay. A method <strong>for</strong> registration of 3-d shapes. IEEE Transactions on<br />

Pattern Analysis <strong>and</strong> Machine Intelligence, 14(2):239–256, 1992.<br />

[9] J. Bigün <strong>and</strong> G. H. Granlund. Optimal orientation detection of linear symmetry. In Proc.<br />

1st International Conference on <strong>Computer</strong> <strong>Vision</strong>, London, UK, pages 433–438, 1987.<br />

[10] M. Bosse, P. Newman, J. Leonard, <strong>and</strong> S. Teller. An atlas framework <strong>for</strong> scalable mapping.<br />

In IEEE International Conference on Robotics <strong>and</strong> Automation, pages 1234–1240, 2003.<br />

[11] D. A. Brannan, M. F. Esplen, <strong>and</strong> J. J. Gray. Geometry. Cambridge University Press,<br />

1999.<br />

[12] R. A. Brooks. Intelligence without representation. Artificial Intelligence, 47(1-3):139–159,<br />

1991.<br />

[13] M. Brown <strong>and</strong> D. Lowe. Invariant features from interest point groups. In Proc. 13th<br />

British Machine <strong>Vision</strong> Conference, Cardiff, UK, pages 253–262, 2002.<br />

166


167<br />

[14] J. Buhmann, W. Burgard, A. Cremers, D. Fox, T. Hofmann, F. Schneider, J. Strikos, <strong>and</strong><br />

S. Thrun. The mobile robot Rhino. AI Magazine, 16(1), 1995.<br />

[15] J. Canny. Finding edges <strong>and</strong> lines in images. In MIT AI-TR, 1983.<br />

[16] G. Carneiro <strong>and</strong> A. Jepson. Phase-based local features. In Proc. 7th European Conference<br />

on <strong>Computer</strong> <strong>Vision</strong>, Copenhagen, Denmark, pages I: 282–296, 2002.<br />

[17] H. Christensen, N. Kirkeby, S. Kristensen, <strong>and</strong> L. Knudsen. Model-driven vision <strong>for</strong> in-door<br />

navigation. Robotics <strong>and</strong> Autonomous Systems, 12:199–207, 1994.<br />

[18] D. Comaniciu <strong>and</strong> P. Meer. Mean shift analysis <strong>and</strong> applications. In Proc. 7th International<br />

Conference on <strong>Computer</strong> <strong>Vision</strong>, Kerkyra, Greece, pages 1197–1203, 1999.<br />

[19] T. Cormen, C. Leiserson, <strong>and</strong> R. Rivest. Introduction to Algorithms. MIT Press, Cambridge<br />

MA, 1990.<br />

[20] I. Cox. Blanche: An experiment in guidance <strong>and</strong> navigation of an autonomous robot<br />

vehicle. IEEE Transactions on Robotics <strong>and</strong> Automation, 7(2):193–204, 1991.<br />

[21] A. Davison <strong>and</strong> D. Murray. Simultaneous localization <strong>and</strong> map-building using active vision.<br />

IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence, 24(7):865–880, 2002.<br />

[22] A. J. Davison. Mobile Robot Navigation Using Active <strong>Vision</strong>. PhD thesis, University of<br />

Ox<strong>for</strong>d, 1999.<br />

[23] G. de Souza <strong>and</strong> A. Kak. <strong>Vision</strong> <strong>for</strong> mobile robot navigation: A survey. IEEE Transactions<br />

on Pattern Analysis <strong>and</strong> Machine Intelligence, 24(2):237–267, 2002.<br />

[24] D. DeMenthon <strong>and</strong> L. Davis. Model-based object pose in 25 lines of code. International<br />

Journal of <strong>Computer</strong> <strong>Vision</strong>, 15(1-2):123–141, 1995.<br />

[25] R. Duda, P. Hart, <strong>and</strong> D. Stork. Pattern Classification. Wiley, 2001.<br />

[26] P. J. Elbischger. Circular data - About the representation <strong>and</strong> calculation of oriented <strong>and</strong><br />

directed 2D data. Technical Report ICG-TR-1, Institue <strong>for</strong> <strong>Computer</strong> Graphic <strong>and</strong> <strong>Vision</strong>,<br />

<strong>Graz</strong> University of Technology, 2003.<br />

[27] S. P. Engelson <strong>and</strong> D. V. McDermott. Error correction in mobile robot map learning.<br />

In Proc. IEEE International Conference on Robotics <strong>and</strong> Automation, Washington D.C.,<br />

US, pages 2555–2560, 1992.<br />

[28] M. A. Fischler <strong>and</strong> R. C. Bolles. RANSAC r<strong>and</strong>om sampling concensus: A paradigm <strong>for</strong><br />

model fitting with applications to image analysis <strong>and</strong> automated cartography. Communications<br />

of ACM, 26:381–395, 1981.<br />

[29] J. Folkesson, P. Jensfelt, <strong>and</strong> H. Christensen. <strong>Vision</strong> slam in the measurement subspace.<br />

In Proc. IEEE International Conference on Robotics <strong>and</strong> Automation, Barcelona, Spain,<br />

pages 30–35, 2005.<br />

[30] W. Förstner <strong>and</strong> E. Gülch. A fast operator <strong>for</strong> detection <strong>and</strong> precise location of distinct<br />

points, corners <strong>and</strong> centres of circular features. In ISPRS Intercommission Workshop,<br />

Interlaken, June 1987.


168<br />

[31] F. Fraundorfer <strong>and</strong> H. Bischof. Detecting distinguished regions by saliency. In Proc. 13th<br />

Sc<strong>and</strong>inavian Conference on Image Analysis, Gotenborg, Sweden, pages 208–215, 2003.<br />

[32] F. Fraundorfer <strong>and</strong> H. Bischof. Evaluation of local detectors on non-planar scenes. In Proc.<br />

28th Workshop of the Austrian Association <strong>for</strong> Pattern Recognition, Hagenberg, Austria,<br />

pages 125–132, 2004.<br />

[33] F. Fraundorfer <strong>and</strong> H. Bischof. A novel per<strong>for</strong>mance evaluation method of local detectors<br />

on non-planar scenes. In Workshop Proceedings Empirical Evaluation Methods in <strong>Computer</strong><br />

<strong>Vision</strong>, IEEE Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition, San Diego,<br />

Cali<strong>for</strong>nia, 2005.<br />

[34] F. Fraundorfer <strong>and</strong> H. Bischof. Global localization from a single feature correspondence.<br />

In Proc. 30th Workshop of the Austrian Association <strong>for</strong> Pattern Recognition, Obergurgl,<br />

Austria, pages 151–160, 2006.<br />

[35] F. Fraundorfer, S. Ober, <strong>and</strong> H. Bischof. Natural, salient image patches <strong>for</strong> robot localization.<br />

In Proc. International Conference an Pattern Recognition, Cambridge, UK, pages<br />

881–884, 2004.<br />

[36] F. Fraundorfer, M. Winter, <strong>and</strong> H. Bischof. Mscc: Maximally stable corner clusters. In<br />

Proc. 14th Sc<strong>and</strong>inavian Conference on Image Analysis, Joensuu, Finl<strong>and</strong>, pages 45–54,<br />

2005.<br />

[37] F. Fraundorfer, M. Winter, <strong>and</strong> H. Bischof. Maximally stable corner clusters: A novel distinguished<br />

region detector <strong>and</strong> descriptor. In Proc. 1st Austrian Cognitive <strong>Vision</strong> Workshop,<br />

Zell an der Pram, Austria, pages 59–66, 2005.<br />

[38] F. Fraundorfer, K. Schindler, <strong>and</strong> H. Bischof. Piecewise planar scene reconstruction from<br />

sparse correspondences. Image <strong>and</strong> <strong>Vision</strong> Computing, 24(4):395–406, 2006.<br />

[39] T. Goedeme, M. Nuttin, T. Tuytelaars, <strong>and</strong> L. Van Gool. Markerless computer vision<br />

based localization using automatically generated topological maps. In European Navigation<br />

Conference GNSS, Rotterdam, 2004.<br />

[40] C. Harris <strong>and</strong> M. Stephens. A combined corner <strong>and</strong> edge detector. In Alvey <strong>Vision</strong><br />

Conference, 1988.<br />

[41] R. Hartley. Projective reconstruction from line correspondences. In Proc. IEEE Conference<br />

on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition, Seattle, Washington, pages 903–907, 1994.<br />

[42] R. Hartley. Lines <strong>and</strong> points in three views <strong>and</strong> the trifocal tensor. International Journal<br />

of <strong>Computer</strong> <strong>Vision</strong>, 22(2):125–140, 1997.<br />

[43] R. Hartley. Theory <strong>and</strong> practice of projective rectification. International Journal of <strong>Computer</strong><br />

<strong>Vision</strong>, 35(2):115–127, 1999.<br />

[44] R. Hartley <strong>and</strong> A. Zisserman. Multiple View Geometry in <strong>Computer</strong> <strong>Vision</strong>. Cambridge,<br />

2000.<br />

[45] J. Hayet, F. Lerasle, <strong>and</strong> M. Devy. Planar l<strong>and</strong>marks to localize a mobile robot. In 8th<br />

International Symposium on Intelligent Robotic Systems, Reading, UK, pages 163–169,<br />

2000.


169<br />

[46] H. Hirschmüller. Accurate <strong>and</strong> efficient stereo processing by semi-global matching <strong>and</strong> mutual<br />

in<strong>for</strong>mation. In Proc. IEEE Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition,<br />

San Diego, Cali<strong>for</strong>nia, pages II: 807–814, 2005.<br />

[47] B. Horn. Closed <strong>for</strong>m solutions of absolute orientation using unit quaternions. Journal of<br />

the Optical Society of America, 4(4):629–642, 1987.<br />

[48] P. Hough. Method <strong>and</strong> means <strong>for</strong> recognizing complex patterns. 1962.<br />

[49] D. Huttenlocher, G. Kl<strong>and</strong>erman, <strong>and</strong> W. Rucklidge. Comparing images using the hausdorff<br />

distance. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence, 15(9):<br />

850–863, 1993.<br />

[50] ICG. Giplib - general image processing library, 2005. URL<br />

http://www.icg.tu-graz.ac.at/research/<strong>Computer</strong><strong>Vision</strong>/giplib.<br />

[51] M. Jogan, A. Leonardis, H. Wildenauer, <strong>and</strong> H. Bischof. Mobile robot localization under<br />

varying illumination. In Proc. International Conference an Pattern Recognition, Quebec<br />

City, Canada, pages II: 741–744, 2002.<br />

[52] T. Kadir <strong>and</strong> M. Brady. Saliency, scale <strong>and</strong> image description. International Journal of<br />

<strong>Computer</strong> <strong>Vision</strong>, 45(2):83–105, 2001.<br />

[53] T. Kadir, A. Zisserman, <strong>and</strong> M. Brady. An affine invariant salient region detector. In Proc.<br />

7th European Conference on <strong>Computer</strong> <strong>Vision</strong>, Prague, Czech Republic, pages I: 228–241,<br />

2004.<br />

[54] R. Kalman. A new approach to linear filtering <strong>and</strong> prediction problems. Transactions of<br />

the ASME: Journal of Basic Engineering, pages 35–45, 1960.<br />

[55] D. R. Karger, P. N. Klein, <strong>and</strong> R. E. Tarjan. A r<strong>and</strong>omized linear-time algorithm to find<br />

minimum spanning trees. Journal of the Association <strong>for</strong> Computing Machinery, 42(2):<br />

321–328, 1995.<br />

[56] N. Karlsson, E. Di Bernardo, J. Ostrowski, L. Goncalves, P. Pirjanian, <strong>and</strong> M. E. Munich.<br />

The vslam algorithm <strong>for</strong> robust localization <strong>and</strong> mapping. In Proc. IEEE International<br />

Conference on Robotics <strong>and</strong> Automation, Barcelona, Spain, pages 24–29, 2005.<br />

[57] L. Kitchen <strong>and</strong> A. Rosenfeld. Gray level corner detection. Pattern Recognition Letters, 1:<br />

95–102, 1982.<br />

[58] V. Kolmogorov <strong>and</strong> R. Zabih. Multi-camera scene reconstruction via graph cuts. In Proc.<br />

7th European Conference on <strong>Computer</strong> <strong>Vision</strong>, Copenhagen, Denmark, pages III: 82–96,<br />

2002.<br />

[59] A. Kosaka <strong>and</strong> A. Kak. Fast vision-guided mobile robot navigation using model-based reasoning<br />

<strong>and</strong> prediction of uncertainties. <strong>Computer</strong> <strong>Vision</strong>, <strong>Graphics</strong> <strong>and</strong> Image Processing,<br />

56(3):271–329, 1992.<br />

[60] J. Kosecka <strong>and</strong> X. Yang. Location recognition <strong>and</strong> global localization based on scale<br />

invariant features. In Wokshop on Statistical Learning in <strong>Computer</strong> <strong>Vision</strong>, Proc. 7th<br />

European Conference on <strong>Computer</strong> <strong>Vision</strong>, Prague, Czech Republic, 2004.


170<br />

[61] U. Köthe. Edge <strong>and</strong> junction detection with a improved structure tensor. Proc. 25th<br />

DAGM Pattern Recognition Symposium, Magdeburg, Germany, pages 25–32, 2003.<br />

[62] Z. Lan <strong>and</strong> R. Mohr. Direct linear sub-pixel correlation by incorporation of neighbor<br />

pixels in<strong>for</strong>mation <strong>and</strong> robust estimation of window trans<strong>for</strong>mation. Machine <strong>Vision</strong> <strong>and</strong><br />

Applications, 10(5-6):256–268, 1998.<br />

[63] T. Lindeberg. Scale-space theory: A basic tool <strong>for</strong> analysing structures at different scales.<br />

Journal of Applied Statistics, 21(2):224–270, 1994.<br />

[64] T. Lindeberg. Feature detection with automatic scale selection. International Journal of<br />

<strong>Computer</strong> <strong>Vision</strong>, 30(2):79–116, 1998.<br />

[65] S. Livatino. Acquisition <strong>and</strong> Recognition of Natural L<strong>and</strong>marks <strong>for</strong> <strong>Vision</strong>-Based Autonomous<br />

Robot Navigation. PhD thesis, Aalborg University, 2003.<br />

[66] D. Lowe. Object recognition from local scale-invariant features. In Proc. 7th International<br />

Conference on <strong>Computer</strong> <strong>Vision</strong>, Kerkyra, Greece, pages 1150–1157, 1999.<br />

[67] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal<br />

of <strong>Computer</strong> <strong>Vision</strong>, 60(2):91–110, 2004.<br />

[68] C. Lu, G. Hager, <strong>and</strong> E. Mjolsness. Fast <strong>and</strong> globally convergent pose estimation from<br />

video images. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence, 22(6):<br />

610–622, 2000.<br />

[69] Q.-T. Luong <strong>and</strong> T. Vieville. Canonical representations <strong>for</strong> the geometries of multiple<br />

projective views. <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Image Underst<strong>and</strong>ing, 64(2):193–229, 1996.<br />

[70] J. Matas, O. Chum, M. Urban, <strong>and</strong> T. Pajdla. Robust wide baseline stereo from maximally<br />

stable extremal regions. In Proc. 13th British Machine <strong>Vision</strong> Conference, Cardiff, UK,<br />

pages 384–393, 2002.<br />

[71] K. Mikolajczyk <strong>and</strong> C. Schmid. A per<strong>for</strong>mance evaluation of local descriptors. In Proc.<br />

IEEE Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition, Madison, Wisconsin,<br />

pages II: 257–263, 2003.<br />

[72] K. Mikolajczyk <strong>and</strong> C. Schmid. Indexing based on scale invariant interest points. In Proc.<br />

of the 8th International Conference on <strong>Computer</strong> <strong>Vision</strong>, Vancouver, Canada, pages 525–<br />

531, 2001.<br />

[73] K. Mikolajczyk <strong>and</strong> C. Schmid. An affine invariant interest point detector. In Proc. 7th<br />

European Conference on <strong>Computer</strong> <strong>Vision</strong>, Copenhagen, Denmark, pages I: 128–142, 2002.<br />

[74] K. Mikolajczyk <strong>and</strong> C. Schmid. Comparison of affine-invariant local detectors <strong>and</strong> descriptors.<br />

In Proc. 12th European Signal Processing Conference, Vienna, Austria, 2004.<br />

[75] K. Mikolajczyk <strong>and</strong> C. Schmid. Scale & affine invariant interest point detectors. International<br />

Journal of <strong>Computer</strong> <strong>Vision</strong>, 60(1):63–86, 2004.<br />

[76] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky,<br />

T. Kadir, <strong>and</strong> L. Van Gool. A comparison of affine region detectors. International Journal<br />

of <strong>Computer</strong> <strong>Vision</strong>, 65(1-2):43–72, 2005.


171<br />

[77] N. Molton, A. Davison, <strong>and</strong> I. Reid. Locally planar patch features <strong>for</strong> real-time structure<br />

from motion. In Proc. 14th British Machine <strong>Vision</strong> Conference, London, UK, 2004.<br />

[78] M. Montemerlo, S. Thrun, D. Koller, <strong>and</strong> B. Wegbreit. FastSLAM: A factored solution<br />

to the simultaneous localization <strong>and</strong> mapping problem. In Proc. of the AAAI National<br />

Conference on Artificial Intelligence, Edmonton, Canada, 2002.<br />

[79] H. Moravec. Towards automatic visual obstacle avoidance. In Proc. of the 5th International<br />

Joint Conference on Artificial Intelligence, page 584, 1977.<br />

[80] H. Moravec. Obstacle avoidance <strong>and</strong> navigation in the real world by a seeing robot rover. In<br />

tech. report CMU-RI-TR-80-03, Robotics <strong>Institute</strong>, Carnegie Mellon University. September<br />

1980. Available as Stan<strong>for</strong>d AIM-340, CS-80-813 <strong>and</strong> republished as a Carnegie Mellon<br />

University Robotics Institue Technical Report to increase availability.<br />

[81] H. Moravec <strong>and</strong> A. Elfes. High resolution maps from wide angle sonar. In Proc. IEEE<br />

International Conference on Intelligent Robots <strong>and</strong> Systems, pages 116–121, 1985.<br />

[82] J. Neira, M. I. Ribeiro, <strong>and</strong> J. D. Tardos. Mobile robot localisation <strong>and</strong> map building<br />

using monocular vision. In International Symposium On Intelligent Robotics Systems,<br />

Stockholm, Sweden, 1997.<br />

[83] D. Nister. An efficient solution to the five-point relative pose problem. In Proc. IEEE<br />

Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition, Madison, Wisconsin, pages II:<br />

195–202, 2003.<br />

[84] D. Nister, O. Naroditsky, <strong>and</strong> J. Bergen. Visual odometry. In Proc. IEEE Conference on<br />

<strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition, Washington, DC, pages I: 652–659, 2004.<br />

[85] S. Obdrzalek <strong>and</strong> J. Matas. Object recognition using local affine frames on distinguished<br />

regions. In Proc. 13th British Machine <strong>Vision</strong> Conference, Cardiff, UK, pages 113–122,<br />

2002.<br />

[86] C. Olson, L. Matthies, M. Schoppers, <strong>and</strong> M. Maimone. Robust stereo ego-motion <strong>for</strong><br />

long distance navigation. In Proc. IEEE Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern<br />

Recognition, Hilton Head, South Carolina, pages II: 453–458, 2000.<br />

[87] R. Perko. <strong>Computer</strong> <strong>Vision</strong> For Large Format Digital Aerial Cameras. PhD thesis, <strong>Graz</strong><br />

University of Technology, 2004.<br />

[88] M. Pollefeys, R. Koch, <strong>and</strong> L. Van Gool. Self calibration <strong>and</strong> metric reconstruction in<br />

spite of varying <strong>and</strong> unknown internal camera parameters. In Proc. 6th International<br />

Conference on <strong>Computer</strong> <strong>Vision</strong>, Bombay, India, pages 90–96, 1998.<br />

[89] P. Pritchett <strong>and</strong> A. Zisserman. Wide baseline stereo matching. In Proc. 6th International<br />

Conference on <strong>Computer</strong> <strong>Vision</strong>, Bombay, India, pages 754–760, 1998.<br />

[90] V. Ramach<strong>and</strong>ran. In Spatial vision in humans <strong>and</strong> robotics, L. Harris, editor, Cambridge<br />

University Press, 1991.<br />

[91] K. Rohr. Localization properties of direct corner detectors. Journal of Mathematical<br />

Imaging <strong>and</strong> <strong>Vision</strong>, 4:139–150, 1994.


172<br />

[92] F. Schaffalitzky <strong>and</strong> A. Zisserman. Multi-view matching <strong>for</strong> unordered image sets, or ’how<br />

do i organize my holiday snaps?’. In Proc. 7th European Conference on <strong>Computer</strong> <strong>Vision</strong>,<br />

Copenhagen, Denmark, pages I: 414–431, 2002.<br />

[93] D. Scharstein <strong>and</strong> R. Szeliski. A taxonomy <strong>and</strong> evaluation of dense two-frame stereo<br />

correspondence algorithms. International Journal of <strong>Computer</strong> <strong>Vision</strong>, 47(1-3):7–42, 2002.<br />

[94] K. Schindler. Generalized use of homographies <strong>for</strong> piecewise planar reconstruction. In<br />

Proc. 13th Sc<strong>and</strong>inavian Conference on Image Analysis, Gotenborg, Sweden, pages 470–<br />

476, 2003.<br />

[95] C. Schmid, R. Mohr, <strong>and</strong> C. Bauckhage. Comparing <strong>and</strong> evaluating interest points. In<br />

Proc. 6th International Conference on <strong>Computer</strong> <strong>Vision</strong>, Bombay, India, pages 230–235,<br />

1998.<br />

[96] S. Se, D. G. Lowe, <strong>and</strong> J. J. Little. <strong>Vision</strong>-based global localization <strong>and</strong> mapping <strong>for</strong><br />

mobile robots. IEEE Transactions on Robotics, 21(3):364–375, 2005.<br />

[97] R. Sedgewick. Algorithms. Addison-Wesley, 2nd edition, 1988.<br />

[98] R. Sim <strong>and</strong> G. Dudek. Mobile robot localization from learned l<strong>and</strong>marks. In Proc. of<br />

the IEEE/RSJ Conference on Intelligent Robots <strong>and</strong> Systems, pages 1060–1065, Victoria,<br />

Canada, 1998.<br />

[99] J. Sivic <strong>and</strong> A. Zisserman. Video Google: A text retrieval approach to object matching in<br />

videos. In Proc. 9th IEEE International Conference on <strong>Computer</strong> <strong>Vision</strong>, Nice, France,<br />

pages 1470–1477, 2003.<br />

[100] P. Smith, T. Drummond, <strong>and</strong> K. Roussopoulos. Computing map trajectories by representing,<br />

propagating <strong>and</strong> combining pdfs over groups. In Proc. 9th IEEE International<br />

Conference on <strong>Computer</strong> <strong>Vision</strong>, Nice, France, pages 1275–1282, 2003.<br />

[101] M. Spetsakis <strong>and</strong> Y. Aloimonos. A multi-frame approach to visual motion perception.<br />

International Journal of <strong>Computer</strong> <strong>Vision</strong>, 6(3):245–255, 1991.<br />

[102] J. Sun, Y. Li, S. Kang, <strong>and</strong> H. Shum. Symmetric stereo matching <strong>for</strong> occlusion h<strong>and</strong>ling.<br />

In Proc. IEEE Conference on <strong>Computer</strong> <strong>Vision</strong> <strong>and</strong> Pattern Recognition, San Diego, Cali<strong>for</strong>nia,<br />

pages II: 399–406, 2005.<br />

[103] S. Thrun. Learning metric-topological maps <strong>for</strong> indoor mobile robot navigation. Artificial<br />

Intelligence, 99(1):21–71, 1998.<br />

[104] S. Thrun, M. Bennewitz, W. Burgard, A. Cremers, F. Dellaert, D. Fox, D. Hähnel,<br />

C. Rosenberg, N. Roy, J. Schulte, <strong>and</strong> D. Schulz. MINERVA: A second generation mobile<br />

tour-guide robot. In Proc. IEEE International Conference on Robotics <strong>and</strong> Automation,<br />

Detroit, US, pages 1999–2005, 1999.<br />

[105] S. Thrun, D. Hähnel, D. Ferguson, M. Montemerlo, R. Triebel, W. Burgard, C. Baker,<br />

Z. Omohundro, S. Thayer, <strong>and</strong> W. Whittaker. A system <strong>for</strong> volumetric robotic mapping of<br />

ab<strong>and</strong>oned mines. In Proc. IEEE International Conference on Robotics <strong>and</strong> Automation,<br />

Taipei, Taiwan, pages 4270–4275, 2003.


173<br />

[106] C. Tomasi <strong>and</strong> T. Kanade. Shape <strong>and</strong> motion from image streams under orthography: A<br />

factorization method. International Journal of <strong>Computer</strong> <strong>Vision</strong>, 9(2):137–154, 1992.<br />

[107] C. Tomasi <strong>and</strong> T. Kanade. Detection <strong>and</strong> tracking of point features. Technical Report<br />

CMU-CS-91-132, Carnegie Mellon University, 1991.<br />

[108] P. Torr <strong>and</strong> A. Zisserman. Robust parameterization <strong>and</strong> computation of the trifocal tensor.<br />

In Proc. 7th British Machine <strong>Vision</strong> Conference, Edinburgh, UK, 1996.<br />

[109] B. Triggs. Matching constraints <strong>and</strong> the joint image. In Proc. 5th International Conference<br />

on <strong>Computer</strong> <strong>Vision</strong>, Boston, Massachusetts, pages 338–343, 1995.<br />

[110] T. Tsubouchi <strong>and</strong> S. Yuta. Map assisted vision system of mobile robots <strong>for</strong> reckoning in a<br />

building environment. In Proc. IEEE International Conference on Robotics <strong>and</strong> Automation,<br />

Raleigh, US, pages 1978–1984, 1987.<br />

[111] T. Tuytelaars <strong>and</strong> L. Van Gool. Content-based image retrieval based on local affinely<br />

invariant regions. In Visual In<strong>for</strong>mation <strong>and</strong> In<strong>for</strong>mation Systems, pages 493–500, 1999.<br />

[112] T. Tuytelaars <strong>and</strong> L. Van Gool. Matching widely separated views based on affine invariant<br />

regions. International Journal of <strong>Computer</strong> <strong>Vision</strong>, 1(59):61–85, 2004.<br />

[113] T. Tuytelaars <strong>and</strong> L. Van Gool. Wide baseline stereo matching based on local, affinely<br />

invariant regions. In Proc. 11th British Machine <strong>Vision</strong> Conference, Bristol, UK, pages<br />

412–422, 2000.<br />

[114] M. Vincze, M. Ayromlou, C. Beltran, A. Gasteratos, S. Hoffgaard, O. Madsen, W. Ponweiser,<br />

<strong>and</strong> M. Zillich. A system to navigate a robot into a ship structure. Machine <strong>Vision</strong><br />

<strong>and</strong> Applications, 14(1):15–25, 2003.<br />

[115] E. W. Weisstein. Hessian. Eric Weisstein’s World of Mathematics.<br />

http://mathworld.wolfram.com/Hessian.html, 1999-2003.<br />

[116] T. Werner <strong>and</strong> A. Zisserman. New techniques <strong>for</strong> automated architecture reconstruction<br />

from photographs. In Proc. 7th European Conference on <strong>Computer</strong> <strong>Vision</strong>, Copenhagen,<br />

Denmark, pages 541–555, 2002.<br />

[117] A. Witkin. Scale-space filtering. In International Joint Conference on Artificial Intelligence,<br />

pages 1019–1022, 1983.<br />

[118] D. C. Yuen <strong>and</strong> B. A. MacDonald. Considerations <strong>for</strong> the mobile robot implementation<br />

of panoramic stereo vision system with a single optical centre. In Proc. Image <strong>and</strong> <strong>Vision</strong><br />

Computing New Zeal<strong>and</strong>, Auckl<strong>and</strong>, pages 335–340, 2002.<br />

[119] A. Zisserman, T. Werner, <strong>and</strong> F. Schaffalitzky. Towards automated reconstruction of architectural<br />

scenes from multiple images. In Proc. 25th Workshop of the Austrian Association<br />

<strong>for</strong> Pattern Recognition, Berchtesgaden, Germany, pages 9–23, 2001.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!