Download report - Student Projects

Master-Thesis 

Vision-based Trajectory 

Following for Boats and 

Ground Robots 

Autumn Term 2012 

Autonomous Systems Lab 

Prof. Roland Siegwart 

Supervised by: Author: 

Gregory Hitz Hendrik Erckens 

Dr. Cédric Pradalier

Abstract 

This thesis addresses the problem of vision-based mobile robot navigation 

from an image memory of a previously driven path where the robot was 

controlled by a human operator. The presented solution is based on Image 

Based Visual Servoing (IBVS), where the required velocity is calculated 

by using BRISK features to compare a current camera image to previously 

recorded snapshot images. Two approaches – sparse and dense – are compared 

to each other, which differ in the number of snapshots used to remember 

the path. In the dense version, also the control commands given to 

the robot during path learning are saved. These are used as a feed-forward 

control during autonomous playback, while errors are corrected by IBVS. 

A technique is presented that allows the robot to localize itself on the taught 

path. Additionally, the system is able to recover from a failed localization 

by reversing back to the starting position. 

The system has successfully been implemented and results are presented 

of tests against ground truth on a non-holonomic, differential drive ground 

robot. Localization is shown to work even in situations where the current 

camera image is ambiguous. Additionally, the system is tested and evaluated 

for use on a robot boat. 

i

Acknowledgment 

I would like to thank Gregory Hitz for great supervision, valuable inputs and 

help during testing. Cedric Pradalier for intense bursts of brilliant ideas. 

Francis Colas for resurrecting BIBA after a flat mainboard battery. Great 

thanks to Gion-Andri Büsser and Jakob Buchheim for pointing out many 

mistakes and incomprehensibilities. 

iii

Contents 

Abstract i 

1 Introduction 9 

1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 

1.2 Objectives of this Thesis . . . . . . . . . . . . . . . . . . . . . 10 

1.3 Conventions Used in this Thesis . . . . . . . . . . . . . . . . . 10 

1.3.1 Coordinate Systems . . . . . . . . . . . . . . . . . . . 10 

1.3.2 General Methodology and Terminology . . . . . . . . 10 

1.4 Problems to be Solved . . . . . . . . . . . . . . . . . . . . . . 11 

1.5 Structure of this Report . . . . . . . . . . . . . . . . . . . . . 12 

2 Literature Review 13 

2.1 Path Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 

2.2 Path Playback . . . . . . . . . . . . . . . . . . . . . . . . . . 14 

2.2.1 Navigate to Next Node (Visual Homing) . . . . . . . . 14 

2.2.2 Detect Arrival at Node . . . . . . . . . . . . . . . . . . 15 

3 Software Structure 17 

3.1 ROS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 

3.2 ROS Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 

3.3 Path Representation Approaches . . . . . . . . . . . . . . . . 18 

3.3.1 Sparse . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 

3.3.2 Dense . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

4 Visual Homing 23 

4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

4.2.1 Camera . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

4.2.2 Feature Detection and Matching . . . . . . . . . . . . 26 

4.2.3 RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . 27 

4.2.4 Image Jacobian . . . . . . . . . . . . . . . . . . . . . . 27 

4.2.5 Coordinate Transformation . . . . . . . . . . . . . . . 28 

5 Supervisor 29 

5.1 Path Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

5.1.1 Sparse . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

5.1.2 Dense . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 

5.2 Autonomous Path Playback . . . . . . . . . . . . . . . . . . . 30 

5.2.1 Sparse . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 

5.2.2 Dense . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 

5.3 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 

6 Arbiter 33 

v

6.1 Sparse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

6.1.1 Y-Offset Correction . . . . . . . . . . . . . . . . . . . 33 

6.1.2 Filtering and Clipping . . . . . . . . . . . . . . . . . . 35 

6.1.3 Correct Theta First . . . . . . . . . . . . . . . . . . . 35 

6.2 Dense . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 

6.2.1 Sum of the Two Velocity Vectors . . . . . . . . . . . . 36 

6.2.2 Y-Offset Correction . . . . . . . . . . . . . . . . . . . 37 

7 Experiments 39 

7.1 BIBA Ground Robot . . . . . . . . . . . . . . . . . . . . . . . 39 

7.1.1 Path Tracking . . . . . . . . . . . . . . . . . . . . . . 40 

7.1.2 Localization . . . . . . . . . . . . . . . . . . . . . . . . 41 

7.2 Lizhbeth ASV . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 

8 Conclusion and Outlook 45 

8.1 Future Improvements . . . . . . . . . . . . . . . . . . . . . . . 46 

Bibliography 48 

List of Figures 50 

vi

Symbols 

x x-axis 

y y-axis 

z z-axis 

u horizontal image axis 

v vertical image axis 

θ angle of rotation around robot’s z-axis 

ν velocity: ν = (v, ω) ∈ R6 v translational velocity: v = (vx, vy, vz) ∈ R3 ω rotational velocity: ω = (ωx, ωy, ωz) ∈ R3 J p image Jacobian, or feature sensitivity matrix of feature point p. J p ∈ R2×6 f focal length of the camera 

ρ pixel size of the camera 

Indices 

x x-axis 

y y-axis 

z z-axis 

C camera frame 

R robot frame 

W world frame 

Acronyms and Abbreviations 

ASV Autonomous Surface Vessel 

AUV Autonomous Underwater Vehicle 

BRISK Binary Robust Invariant Scalable Keypoints 

DID Descent in Image Distance 

ETH Eidgenössische Technische Hochschule 

GPS Global Positioning System 

IBVS Image Based Visual Servoing 

MFDID Matched-Filter Descent in Image Distance 

PBVS Position Based Visual Servoing 

ROS Robot Operating System 

SIFT Scale Invariant Feature Transform 

VSRR View Sequenced Route Representation 

ZNCC Zero Mean Energy Normalized Cross Correlation 

vii

Chapter 1 

Introduction 

1.1 Context 

Autonomous robots designed to operate in an outside environment typically 

use satellite based GPS and a magnetic compass to localize themselves in 

the environment and to navigate to a given goal position and orientation. 

This works well in open spaces with good satellite visibility, but becomes 

increasingly inaccurate when the sky is occluded or when magnetic fields 

disturb the compass. Even in optimal conditions, this method of localization 

can be insufficient when great accuracy is required, for example when the 

robot has to maneuver around obstacles. GPS becomes completely unusable, 

when the sky is entirely invisible, most notably in indoor or underwater 

environments. 

To overcome this issue, many researchers are working on localization using 

other sensors like camera, laser range finder, or sonar. This localization 

can be done either in a metric Cartesian space, or directly in the sensor 

space. During metric localization, the pose of the robot is described by 

not more than six coordinates: Three coordinates for position and three for 

orientation. In sensor space, the robot’s pose can be described by as many 

coordinates as there are sensor measurements available. For example, consider 

a robot that is equipped with only one sensor measuring the brightness 

of the ground under the robot. Then, the sensor space is one-dimensional 

and the robot’s pose can be described by this one measurement. Note that, 

as there are likely many points on the ground that have the same brightness, 

this localization can be ambiguous. 

Additionally to localization, the navigation can also be done in sensor space. 

If the sensor to be used is a camera, the target pose and the robot’s current 

pose can be described by camera images taken at the respective pose. Then, 

correlations between the target image and the current image can be found 

and it can be calculated how the robot has to move to eventually reach 

the target. Consequently, for this kind of navigation, the robot does not 

necessarily need to know its pose with respect to the target pose in Cartesian 

coordinates. All the robot needs to know is how to move to make its camera 

image more similar to the image acquired at the target pose. 

9

10 1.2. Objectives of this Thesis 

1.2 Objectives of this Thesis 

This project aims to implement vision-based trajectory following. A robot 

should be able to record a path during a learning phase during which it is 

controlled by a human operator. It should then be able to navigate along 

the previously recorded path. All localization and navigation should be 

done in sensor space, using a camera as the primary sensor. Specifically, no 

metric localization is to be done. 

In a first step, this method is to be implemented and tested on a differential 

drive ground robot operating in an indoor office environment. 

As a second step, the suitability of this method for use on the limnological 

sampling boat Lizhbeth [1] should be determined. This autonomous surface 

vessel (ASV) is used for sampling the water in lakes. Under normal conditions, 

it can navigate via GPS. However, GPS is not accurate enough for 

Lizhbeth to navigate into the harbor autonomously. For this, a vision-based 

trajectory following method could be helpful. 

Another motivation for this thesis is that, in the future, a similar approach 

could also be used for navigation of the AUV Noptilus 1 . Instead of vision, 

sonar could be used as the main sensor. However, this is a long term goal 

and not part of this thesis. 

1.3 Conventions Used in this Thesis 

1.3.1 Coordinate Systems 

Figure 1.1 shows a top view of the robot with position and orientation 

of the body coordinate frames of the robot (xR, yR, zR) and the camera 

(xC, yC, zC) with respect to the world frame. 

1.3.2 General Methodology and Terminology 

A long path can be subdivided into small portions. Each of these portions 

has its own goal, called a node. Consequently, the whole path can be completed 

by driving from one node to the next, eventually arriving at the last 

node. Thus, each node describes a pose (position and orientation) of the 

robot in the environment. As explained in section 1.1, this pose can be 

represented by a camera image taken at the respective pose. This camera 

image is called a snapshot or target image. Each node is associated with one 

snapshot. The process of autonomously guiding the robot towards a node 

by comparing the node’s snapshot image with the current image is called 

visual homing. It is easy to see that successful visual homing is only possible 

if the robot is already in the vicinity of the respective snapshot pose, 

because at least some parts of the snapshot image have to be visible in the 

current view. The area around a node from where the node is reachable is 

called the node’s catchment area. The size of the catchment area depends 

on the visual homing method used and also on environmental factors, such 

1 http://www.noptilus-fp7.eu

Chapter 1. Introduction 11 

yR 

yC 

rRC 

zR 

zC 

xR 

xC 

Figure 1.1: Coordinate frames of robot (light gray) and camera (dark gray). 

as lighting conditions and the number of distinct visual features that are 

detectable. The whole path consisting of many nodes connected to each 

other by arcs is referred to as a topological map. An arc between two nodes 

means that these nodes lie in each others catchment area. 

1.4 Problems to be Solved 

The whole problem stated in section 1.2 can be divided up into smaller 

sub-problems for which a solution has to be found. 

Node Creation The main problem in the path learning phase is to know 

when and how often a new node has to be created. 

Visual Homing During the playback phase, a method is needed that can 

drive the robot towards the next node. 

Arrival Detection When visual homing has brought the robot close to a 

node, the arrival has to be detected so that a transition to the next 

node can be made. 

Localization When playback is not started at the same pose that path 

learning was started at, the robot needs the ability to localize itself 

on the path. 

ξC 

ξR 

zW 

yW 

xW

12 1.5. Structure of this Report 

1.5 Structure of this Report 

Chapter 2 presents related work in the literature. While chapter 3 gives 

an overview of the whole software structure, chapters 4, 5 and 6 provide a 

more detailed description of the individual parts of the software. Tests and 

experiments of the proposed system are presented in chapter 7. Finally, a 

conclusion and suggestions for future improvements are given in chapter 8.

Chapter 2 

Literature Review 

In the beginning of this thesis, some literature research was done to get 

familiar with the topic and to gather ideas of how to approach the task. 

This section summarizes the results of this literature research. First, section 

2.1 shows ways to approach the task of teaching the robot a path. Then, 

section 2.2 introduces some ideas about techniques which enable the robot 

to autonomously follow the previously taught path. 

2.1 Path Learning 

While there is extensive literature that precisely describes visual homing, 

details about path learning techniques are often only vaguely described. 

The decision on which path learning technique is appropriate depends on 

the method used for path playback and specifically what kind of path representation 

is needed. For example, different visual homing methods have 

a different size of catchment area around each snapshot position. The most 

critical part of the path learning technique is that it has to guarantee that 

a new node lies well inside the catchment area of the previous node. So, 

a procedure is needed that decides at which point a new node has to be 

created. 

Jones et al. constantly compare the robot’s current view image to the last 

saved snapshot image using Zero Mean Energy Normalized Cross Correlation 

(ZNCC) [2]. A new node is created as soon as the ZNCC value drops 

below a certain threshold. 

An interesting way to detect when a new node has to be created is introduced 

by Vardy et al. [3]: While learning the map, the robot constantly computes 

the homing angle to the last node. If this differs too much from the angle 

estimated by odometry, a new node is created. 

A sophisticated approach has been presented by Motard et al. [4]. The 

authors introduce a method for incremental online topological map learning. 

The robot can explore the environment autonomously and build its own 

map where each node can be connected to many other nodes, providing the 

possibility to close loops and get to the destination by more than one way. 

13

14 2.2. Path Playback 

2.2 Path Playback 

First, a technique is needed that guides the robot towards the next node 

in the topological map. This is described in section 2.2.1. Subsequently, 

when the robot gets close to this node’s snapshot position, arrival has to 

be detected to initiate the topological transition to the following node. Approaches 

to this problem are shown in section 2.2.2. 

2.2.1 Navigate to Next Node (Visual Homing) 

Generally, procedures for navigating the robot to the next node can be divided 

into two categories: Methods that require extraction of features or 

identification of landmarks and methods that require neither feature extraction 

nor identification of landmarks [5]. 

Not Landmark-Based 

Descent in Image Distance (DID) Zeil et al. showed that a snapshot 

position is defined by a clear, sharp minimum in a smooth three-dimensional 

image distance function, which is calculated from individual pixel intensity 

differences [5]. Thus, the snapshot position can be reached by a simple 

gradient descent method. However, to estimate the gradient of the image 

distance function at the robot’s current position, it has to make exploratory 

movements to sample the function at different places around its current 

position. Möller gets around this problem by warping the current image 

as if the robot had made the exploratory movement [6]. This has been 

originally introduced as Matched-Filter DID (MFDID) by Möller et al. [7]. 

ZNCC Peak This is an appearance-based method, where a peak in the 

Zero Mean Energy Normalized Cross Correlation (ZNCC) function on the 

horizontal image axis defines the homing direction. Jones et al. use two 

divergent cameras to extract a forward velocity component alongside this 

directional information [2]. 

Landmark-Based 

Homing in Scale Space is used by Churchill and Vardy in [8]. The 

authors make use of the scale information in the SIFT feature descriptor. 

By the assumption of uniform distribution of keypoints in the image plane 

of a panoramic image, they deduce that the home direction is the center of 

the image region where feature scale is greater in the snapshot view than in 

the current view. 

Visual Servoing A very detailed tutorial on visual servoing has been 

published by Corke in [9]. He differentiates between Position Based Visual 

Servoing (PBVS) and Image Based Visual Servoing (IBVS). In a PBVS 

system, the pose of the target with respect to the camera is estimated, assuming 

a geometrical model of the target is known. In an IBVS system, this

Chapter 2. Literature Review 15 

pose estimation step is omitted and the whole control problem is formulated 

in image coordinate space R 2 . The camera’s pose change required to get 

from the current position to the snapshot position is defined implicitly by 

the required change of feature locations in the image plane. 

Cherubini and Chaumette present a visual path following technique using 

visual servoing, but they use only one dimension of the generally 3dimensional 

image jacobian matrix used in IBVS to extract the homing 

direction to the next snapshot on the path [10]. Instead of at least three 

matched features necessary to compute the full 6-dimensional linear and 

angular camera velocity required to move to the snapshot, only one single 

feature is needed. For this, the authors use the abscissa of the centroid 

of the features matched between snapshot and current view. In their approach, 

the robot’s forward velocity is determined by a safety context value 

and the velocity of tracked features in the image plane between consecutive 

snapshots on the path. Thus, the robot will slow down when the context 

is deemed unsafe, i.e. when the laser range finder detects an obstacle. Furthermore, 

the authors use an actuated camera to maintain scene visibility 

while avoiding the obstacles. 

Metric Error Estimation Ohno et al. use vertical lines as features [11]. 

Since they assume the motion constraints of a ground robot with a rigidly 

attached camera, it is sufficient to take into account only the horizontal 

position of these features in the image plane. They present an algorithm to 

estimate x and y position error as well as θ orientation error with respect to 

the snapshot image. Blanc et al. similarly estimate the camera replacement 

by uncoupling translation and rotation components of a homography matrix 

[12]. 

2.2.2 Detect Arrival at Node 

When the visual homing process has successfully driven the robot to a position 

close to the current snapshot position, the system must detect arrival at 

the current node to make the topological transition to the next node. There 

are several ways to detect arrival and the practicality of some of them depends 

on the visual homing method used. Vardy et al. describe four ways 

to detect arrival at a snapshot [3]. 

1. If the distance between the nodes is known, odometry can be used to 

detect when the robot has exceeded the distance to the next node. 

2. The image difference between snapshot and current image can be measured 

by computing the sum of squared error (SSE). Arrival is declared 

when this value falls below a certain threshold. 

3. The magnitude of the computed homing vector can be used as an 

arrival detection measure. Arrival can be declared when the homing 

vector’s magnitude falls below a certain threshold. In [3], the authors 

use MFDID to compute their homing vector, because this is the 

method they use to drive the robot towards the next node. However, 

this method can obviously be used together with any other way of 

producing the homing vector.

16 2.2. Path Playback 

4. Instead of only looking at the magnitude of the homing vector, this 

method looks at the time derivative of the sum of all magnitudes over 

time while the robot is driving towards the node. The sum of all 

magnitudes is considered as a function over time. After filtering this 

function with a length-3 triangle filter, the derivative is taken. Since 

the magnitude of the homing vector tends to zero, the sum will stop 

increasing once the robot gets close to the snapshot position, so the 

curve flattens. Arrival is declared when the value of the derivative 

falls below a certain threshold.

Chapter 3 

Software Structure 

This chapter gives an overview over the software structure and how the 

individual processes work together. Chapters 4, 5 and 6 provide a more 

detailed discussion of the three main parts of the software. 

3.1 ROS 

The Robot Operating System (ROS) 1 has been developed and is actively 

maintained by Willow Garage 2 . At its core, ROS provides a middleware 

that allows processes to communicate. In ROS terminology, the individual 

processes, called nodes, pass messages to one another via topics. One node 

publishes messages to a particular topic and another node can subscribe 

to this topic in order to receive the messages. Since messages are passed 

via TCP network, nodes can communicate with each other even if they run 

on different machines. Apart from the middleware, ROS comes with many 

tools to help with the development of robotics software. 

The software for this work has been written in C++ and Python. For all 

computer vision related parts of the software, the computer vision library 

OpenCV 3 is used. OpenCV is integrated into ROS. The linear algebra 

library Eigen is used for most matrix and vector algebra. 

3.2 ROS Nodes 

The software has been divided up into three separate processes, or nodes: 

Visual Homing is at the center of the whole structure. It takes a current 

view from the camera and a snapshot image. From this input it 

computes a desired velocity, which will move the robot closer to the 

snapshot position. Visual Homing outputs a velocity vector with full 

1 http://ros.org 

2 http://www.willowgarage.com 

3 http://opencv.willowgarage.com/wiki/ 

17

18 3.3. Path Representation Approaches 

6 degrees of freedom, not making any assumption on the type of robot 

to be controlled. 

Supervisor has the main purpose of building and maintaining the map 

that represents the taught path. It localizes the robot in the map 

and provides the Visual Homing node with the next snapshot that 

the robot should drive towards. 

Arbiter is the process that actually gives commands to the robot’s motor 

controller. It is called Arbiter, because it decides which commands 

can be safely forwarded to the robot. Also, if commands from the 

joystick come in, it decides which process is allowed to command the 

robot. The Arbiter filters and clips the velocity commands and adapts 

them to the particular type of robot that is used. 

3.3 Path Representation Approaches 

Over the course of this work, two approaches have been developed and 

compared. They are called dense and sparse. The Visual Homing process is 

the same for both approaches. The difference between the dense and sparse 

versions is the way the path is learned and represented. 

3.3.1 Sparse 

Learning Phase 

During the learning phase in the sparse version, the Supervisor tries to save 

as few snapshot images as are really necessary to guarantee that each node 

lies in the catchment area of its preceding node. Hence the name sparse. 

Figure 3.1 illustrates such a learned path with six snapshot images. 

Figure 3.1: Illustration of the learned path in the sparse version. 

When the learning phase is initiated, the Supervisor saves the current camera 

view as the first snapshot. Then, it immediately passes this snapshot 

image to the Visual Homing process. While the human operator steers the 

robot away from this snapshot via joystick control, Visual Homing constantly 

compares the current camera view to this snapshot and computes 

the desired velocity that would drive the robot back to the snapshot and 

passes it back to the Supervisor. If the magnitude of the desired velocity 

vector rises above a certain threshold, the Supervisor saves a new snapshot.

Chapter 3. Software Structure 19 

Now, this new snapshot is passed to Visual Homing and so on. Figure 3.2a 

shows the data flow between the processes during the learning phase. 

Playback Phase 

During the playback phase, the Supervisor passes the first snapshot of the 

path to Visual Homing. Comparing this snapshot to the current image from 

the camera, Visual Homing computes the desired velocity required to guide 

the robot to the snapshot pose. Arbiter filters the desired velocity to avoid 

jerky behavior, accounts for the non-holonomic properties of the particular 

robot and passes a velocity command to the robot’s motor controllers. Detection 

of arrival is the same as presented by Vardy in [3]. While the robot 

is driving towards the snapshot, the Supervisor looks at the desired velocity 

that is being produced by Visual Homing. If the magnitude of the desired 

velocity vector falls below a certain threshold, the Supervisor assumes that 

the robot is close to the snapshot, transitions to the next node on the path 

and publishes that node’s snapshot image to Visual Homing. This way, 

the robot drives from node to node until it has completed the whole path. 

Figure 3.2b shows the data flow between the processes during the playback 

phase. 

(a) Learning (b) Playback 

Figure 3.2: Data flow between the ROS nodes in the sparse version. 

3.3.2 Dense 

The dense version has been inspired by the notion of a Sensori-Motor Trajectory 

in the work of Pradalier et al. [13].

20 3.3. Path Representation Approaches 

Learning Phase 

In contrast to the path learning technique in the sparse version, the Supervisor 

in the dense version tries to save as many snapshots as it can get from 

the camera. The result is a dense array of nodes as illustrated in figure 3.3. 

Additionally to the snapshot images, the Supervisor also saves the velocity 

command that has been given to the robot by the human operator during 

the learning phase. Thus each node is associated with a snapshot image 

and a velocity command. The Visual Homing process is idle. Figure 3.4a 

shows the data flow between the processes during the learning phase. 

Figure 3.3: Illustration of the learned path in the dense version. 

Playback Phase 

During the playback phase, the supervisor passes a snapshot image to Visual 

Homing and the corresponding recorded velocity command to the Arbiter. 

Visual Homing again computes a desired velocity that guides the robot towards 

the snapshot. The Arbiter then passes a superposition of the recorded 

velocity command and the Visual Homing’s desired velocity to the robot. 

The Supervisor advances through the path nodes with the same speed as 

they were saved during the learning phase. The idea is that by passing the 

same velocity commands to the robot as during the learning phase, the robot 

should already drive a similar trajectory. If it happens to be perfectly on 

the taught path, then Visual Homing will produce a desired velocity close 

to zero. When outside disturbances cause the robot to deviate from the 

taught path, Visual Homing will compute a desired velocity that corrects 

the robot’s deviation from the path. The data flow is illustrated in figure 

3.4b.

Chapter 3. Software Structure 21 

(a) Learning (b) Playback 

Figure 3.4: Data flow between the ROS nodes in the dense version.

22 3.3. Path Representation Approaches

Chapter 4 

Visual Homing 

The Visual Homing process compares the current camera image to the snapshot 

image. From this it computes a desired velocity that guides the robot 

towards the pose that the snapshot image was acquired at. To do this, it 

uses Image Based Visual Servoing (IBVS). 

Section 4.1 describes the theory behind IBVS. The implementation in C++ 

developed in this work is described in section 4.2. 

4.1 Theory 

For the purpose of introducing the IBVS algorithm, the control problem 

used in this section is defined as finding a velocity that drives the camera 

to a target pose. Under the assumption that the camera is rigidly attached 

to the robot, the transition to the problem of driving the robot to a target 

pose is a coordinate transformation and is covered in section 4.2.5. 

A great tutorial on IBVS and a more detailed deduction is given by Corke 

in [9]. Most of the explanation and notation in this section follows the 

derivations in Corke’s book. 

In IBVS, the control problem of driving the camera to a target pose in 

world coordinate space (R 6 for position and orientation) is reformulated 

completely in the image coordinate space R 2 . Instead of estimating the 

camera’s pose with respect to the target and controlling the pose change 

directly, the desired camera pose is defined implicitly by the image feature 

positions at this desired camera pose. The control problem can therefore 

be reformulated to finding a camera velocity that moves the feature points 

to desired positions in the image plane. 

Consider a camera with intrinsic properties of focal length f, pixel width 

ρu, pixel height ρv and principal point (u0, v0). The camera moves with a 

velocity ν = (v, ω) = (vx, vy, vz, ωx, ωy, ωz) and observes a world point P 

with coordinates P = (X, Y, Z) relative to the camera. P is projected onto 

the image plane at pixel coordinates p = (u, v) and can be considered as 

an image feature. Corke shows in [9] that with ū = u − u0 and ¯v = v − v0 

the feature velocity in the image plane can be written in terms of pixel 

23

24 4.1. Theory 

coordinates with respect to the principal point as 

˙p = 

� � � 

f 

˙u − ρuZ = 

˙v 

0 ū 

Z 

ρuū¯v 

f − f 2 +ρ 2 

uū2 0 − 

ρuf ¯v 

f 

ρvZ 

¯v 

Z 

f 2 +ρ 2 

v ¯v2 

ρvf − ρvū¯v 

� �� 

J p 

f 

� ⎜ 

−ū ⎜ 

� ⎝ 

⎛ 

vx 

vy 

vz 

ωx 

ωy 

ωz 

⎞ 

⎟ (4.1) 

⎟ 

⎠ 

where J p is the image Jacobian matrix for a point feature, also sometimes 

called feature sensitivity matrix. The image Jacobian matrix essentially 

indicates how a feature point (u, v) would move in the image plane if the 

camera moves with velocity ν. The image Jacobian can be calculated for 

any pixel in the image plane with the only assumption that the depth Z 

from the camera to the world point P is known. See section 4.2.1 for details 

on the depth estimation. 

For more than one feature point the Jacobians can be stacked on top of 

each other. In particular, for the case of three points 

⎛ ⎞ 

˙u1 

⎜ ˙v1 ⎟ ⎛ ⎞ 

⎜ ⎟ 

⎜ ˙u2 ⎟ J p1 

⎜ ⎟ 

⎜ ˙v2 ⎟ = ⎝J 

⎠ p2 ν (4.2) 

⎜ ⎟ 

⎝ ˙u3 ⎠ J p3 

˙v3 

the stacked Jacobian matrix is of dimension 6 × 6 and is non-singular so 

long as the three points are neither coincident nor collinear. Thus, to solve 

the control problem of driving the camera towards the target pose, equation 

4.2 can be inverted 

⎛ ⎞ 

˙u1 

⎛ ⎞−1 

⎜ ˙v1 ⎟ 

J ⎜ ⎟ 

p1 ⎜ ˙u2 ⎟ 

ν = ⎝J 

⎠ ⎜ ⎟ 

p2 ⎜ ˙v2 ⎟ 

J ⎜ ⎟ 

p3 ⎝ ˙u3 ⎠ 

˙v3 

(4.3) 

to solve for the desired camera velocity given the feature velocities of three 

points. To determine suitable feature velocities ˙p ∗ from the known desired 

feature positions p ∗ in the snapshot image and the actual feature positions 

p in the current view image, a simple proportional gain controller 

˙p ∗ = λ(p ∗ − p) (4.4) 

can be used, where λ is the proportional gain factor. Combined with equation 

4.3 the control law can be written as 

⎛ ⎞ 

ν = λ ⎝ J p1 

⎠ 

J p2 

J p3 

−1 

(p ∗ − p) (4.5) 

This controller will drive the camera in such a way that the feature points 

move to their desired positions in the image plane. 

For the general case of more than three features, all Jacobians can be stacked

Chapter 4. Visual Homing 25 

and the camera velocity can be calculated by taking the pseudo-inverse 

⎛ ⎞+ 

J p1 

⎜ ⎟ 

ν = λ ⎝ . ⎠ (p 

J pN 

∗ − p) (4.6) 

which minimizes the position error of all feature points. 

4.2 Implementation 

Figure 4.1 gives an overview of the data flow inside the Visual Homing 

process. First, the features are detected and feature descriptors extracted 

from the current camera image. The process then tries to match these 

features with features in the snapshot image coming from the Supervisor. 

More details on feature detection and matching are given in section 4.2.2. 

The process can only continue if there are more than three matches found. If 

this is the case, a RANSAC algorithm is used to get rid of any wrong matches 

that would result in a false velocity computation. The RANSAC algorithm 

is described in section 4.2.3. With the remaining matched features, the 

stacked image Jacobian is composed and the feature position errors are 

computed. This is inverted to compute the desired camera velocity (section 

4.2.4). After a coordinate transformation from the camera frame to the 

robot frame (section 4.2.5) the desired velocity is returned. 

4.2.1 Camera 

In section 4.1 it was noted that the feature depth Z is needed to compute the 

image Jacobian. It has been shown by some researchers that visual servoing 

is remarkably tolerant of errors in depth in many real world applications 

[9, 10]. Most of these researchers suggest setting Z to a constant value 

tuned by the user depending on workspace configuration. However, results 

from experiments conducted during this thesis suggest that the depth is a 

crucial factor that significantly affects performance, especially in outdoor 

environments with great variation in depth (see chapter 7). 

For this work, the Microsoft Kinect camera was used. It was chosen, because 

in addition to the camera image it also provides a depth estimation and 

because of its good integration in the ROS framework. 

Limitations 

The Kinect sensor is a cheap and easy solution to get a camera image and 

a depth estimate for each pixel in the image. However, there are a few 

limitations: 

Depth The depth estimation only works in a range of up to a few meters. 

Anything farther away has an undefined depth value. This limited 

range is reasonable for a typical indoor environment, but not very 

helpful in most outdoor environments. The constant depth value that

26 4.2. Implementation 

Figure 4.1: Data flow in the Visual Homing process. 

has to be assumed for any point that the Kinect cannot estimate a 

depth for can significantly affect performance. 

Automatic Exposure Like most consumer-grade webcams, the Kinect’s 

RGB camera has a built-in automatic exposure time adjustment. In 

very bright scenes, the camera automatically reduces the exposure 

time. This can result in degraded matching performance, because 

features may have a significantly different appearance. 

4.2.2 Feature Detection and Matching 

There is currently a great effort in the research community to find faster and 

more reliable algorithms for feature detection and extraction of a descriptor 

for the detected features. Notable developments are the Scale Invariant 

Feature Transform (SIFT) by Lowe [14] and Speeded Up Robust Features 

(SURF) by Bay et al. [15]. A more recent development is the Binary Robust

Chapter 4. Visual Homing 27 

Figure 4.2: The Microsoft Kinect sensor. 

Invariant Scalable Keypoints algorithm (BRISK) by Leutenegger et al. [16]. 

In this work, BRISK is used, because it promises better computational 

efficiency than SIFT and SURF while still being comparable in terms of 

matching performance [16]. The reference implementation 1 in C++ that 

Leutenegger et al. published alongside their paper was used here. 

4.2.3 RANSAC 

Sometimes the feature matching process fails and matches two features that 

only have similar appearance but do not represent the same point in the 

environment. This wrong match leads to a wrong desired velocity calculated 

by the visual servo algorithm. The effect gets worse if there are only 

few other correctly matched features or if the feature position error for the 

falsely matched feature is large compared to the correctly matched features. 

For this reason it is desirable to detect and ignore these incorrect matches. 

One way to do this is the Random Sample Consensus algorithm (RANSAC) 

[17]. RANSAC can be used to estimate parameters of a mathematical model 

from a set of observed data which contains outliers. In this case, the mathematical 

model that is to be found is simply the desired velocity produced by 

the visual servo algorithm. The observed data are matched feature points 

between current view and snapshot and outliers are incorrect matches. 

4.2.4 Image Jacobian 

The image Jacobian is computed for all matched feature points as described 

in section 4.1 and then stacked to get a combined image Jacobian of dimension 

2N × 6, where N is the number of matched features. Then, the desired 

velocity can be computed by taking the pseudo-inverse of the stacked Jacobian 

as in equation 4.6. The pseudo-inverse is computed via singular value 

decomposition using the method 

jacobiSvd(Eigen::ComputeThinU | Eigen::ComputeThinV).solve() 

which is a member of the Eigen::Matrix class. 

1 http://www.asl.ethz.ch/people/lestefan/personal/BRISK

28 4.2. Implementation 

4.2.5 Coordinate Transformation 

By default, the visual servo algorithm as it is written in section 4.1 calculates 

the desired velocity CνC of the camera expressed in the camera’s coordinate 

frame. This has to be transformed into desired velocity RνR of the robot 

expressed in the robot frame, which can be done under the assumption that 

the camera is rigidly attached to the robot. Figure 1.1 shows the coordinate 

frames of camera and robot. The rotation matrix that transforms vectors 

expressed in the camera frame into vectors in the robot frame is denoted as 

ARC. The translational velocity v and rotational velocity ω are transformed 

separately. First, both are transformed from the camera into the robot 

frame: 

and 

RvC = ARC CvC 

RωC = ARC CωC 

(4.7) 

(4.8) 

Now, since the rotational velocity is the same in all points of a rigid body, 

the rotational velocity of the robot is the same as the rotational velocity of 

the camera: 

(4.9) 

RωR =R ωC 

The translational velocity of the robot can be computed using the rigid 

body equation for velocities 

RvR =R vC +R ωC ×R rRC 

(4.10) 

where RrRC is the vector from the origin of the robot frame R to the origin 

of the camera frame C expressed in the robot frame. Finally, the desired 

velocity of the robot expressed in the robot frame is 

� � 

RvR 

RνR = 

(4.11) 

RωR

Chapter 5 

Supervisor 

The Supervisor builds and maintains the topological map that represents 

the learned path. During the learning phase, it decides when to create a 

new topological node in the map. This will be further described in section 

5.1. During autonomous playback, the Supervisor passes snapshot images 

to the Visual Homing process. It has to decide when a snapshot image has 

been successfully reached and transition to the next snapshot in line. The 

behavior during Playback is described in section 5.2. Finally, when playback 

is first initiated, the Supervisor has the ability to localize the robot on the 

path to determine at which snapshot it should start the playback (see section 

5.3). 

Instead of saving the full images, the Supervisor internally stores the snapshot 

images as feature vectors. A feature vector describes one image and 

contains the descriptors of all BRISK features detected in the image. This 

is done to both decrease the memory needed to store the map and also to 

save the computation time for detecting and extracting the features from 

the same image multiple times. 

5.1 Path Learning 

5.1.1 Sparse 

In the sparse version, the Supervisor tries to save as few snapshots as possible 

while still being able to guarantee that the snapshots are reachable from 

one another, i.e. that they lie in each other’s catchment area. When the human 

operator starts driving the robot along the path, the Supervisor saves 

the camera image as the first snapshot. It then passes this first snapshot 

image to the Visual Homing process. Now, while the robot is driven away 

from the first snapshot, the magnitude of the homing vector produced by 

the Visual Homing process steadily increases. Once it is bigger than a certain 

threshold, the Supervisor saves a new snapshot image the the process 

repeats itself. 

29

30 5.2. Autonomous Path Playback 

5.1.2 Dense 

In contrast to the sparse version, the Supervisor in the dense version saves a 

lot more data in the map. It saves as many snapshots as technically possible 

for the camera. This is typically around 10 Hz 1 . Along with these snapshots 

it also saves the velocity commands which are given by the human operator. 

Also, a timestamp is saved for each snapshot, so that during playback, the 

Supervisor knows when to publish the next snapshot. 

5.2 Autonomous Path Playback 

5.2.1 Sparse 

The trickiest part of playback in the sparse version is to know when to make 

the transition to the next snapshot. When playback is first initiated, the 

Supervisor passes the first snapshot image to Visual Homing. Subsequently, 

the robot will start moving towards this snapshot image. Similarly as in 

path learning, the Supervisor looks at the magnitude of the homing vector, 

or desired velocity, produced by the Visual Homing process. This magnitude 

decreases when the robot gets closer to the snapshot. The next snapshot is 

published as soon as the magnitude falls below a certain threshold. 

5.2.2 Dense 

During playback in the dense version, the Supervisor publishes the snapshots 

at the same rate as they were acquired. This behavior is much like 

a video player would play back a movie. The recorded velocity commands 

that were stored alongside the snapshot images are passed to the Arbiter, 

where they will be combined with the output from Visual Homing (see section 

6.2). Thus, the recorded control inputs are used as a feed forward 

control. On top of that, the Visual Homing provides a closed loop control 

to correct small errors. That means, if the robot is too slow, the velocity is 

increased so it can catch up to the desired position on the path. If it is too 

fast, velocity is decreased. 

A problem arises if, for any reason, the robot lags behind too much. Since 

the robot’s speed is limited, it might not be able to catch up. Even worse, 

if the system fails to find any correspondences between current view and 

snapshot image, it will not be able to correct the error at all. The Supervisor 

detects this risk by – again – looking at the magnitude of the homing vector 

produced by Visual Homing. If and only if the magnitude rises above a 

certain threshold and the homing vector points into the same direction as 

the recorded velocity command, the Supervisor pauses the playback, i.e. it 

delays the transition to the next snapshot. This is much like hitting the 

pause button on the video player: The image – in this case the snapshot 

image – is still there, only the next image in line is not shown. Thus, the 

robot still drives with the same recorded velocity command associated with 

1 While the Kinect camera could provide 640 × 480 images with a frequency of 30 Hz, 

the extra time is needed to detect and extract the BRISK features from the camera 

images. Therefore, the frequency depends on the number of features that are detected.

Chapter 5. Supervisor 31 

this snapshot and Visual Homing still tries to reduce the error. Once the 

magnitude of the homing vector falls below the threshold, path progression 

is continued. 

5.3 Localization 

So far, the Supervisor has assumed that when path playback is first initiated, 

the robot starts at the same place where path learning was started before. 

Localization functionality was added to the Supervisor to be able to put the 

robot at any position on the path and let it go from there. When playback 

is initiated, the Supervisor first tries to find the most likely position of the 

robot on the path and then starts playback from this position. For this, the 

current camera image is compared to each snapshot of the learned path. 

This is done by passing one snapshot image after the other to the Visual 

Homing process. For each of them, Visual Homing produces a homing 

vector. The snapshot for which it produces the homing vector with the 

smallest non-zero magnitude is assumed to be the most likely position of 

the robot. Figure 5.1 shows the magnitudes of all homing vectors with a 

clear minimum that represents the most likely position of the robot. 

Magnitude of homing vector 

5 

4 

3 

2 

1 

Localization on the path. 

../../../../bagfiles/2012-03-02_localization/localizationDoublecheck_stellwand_dc1.bag 

0 

0 50 100 150 200 

Snapshot index 

Figure 5.1: Magnitude of the homing vectors from current position to each 

snapshot on the path with one distinct minumum. 

One could think of a situation where two locations on the path have a similar 

appearance. If the robot is standing at one of these locations, there will be 

two local minima as seen in figure 5.2. 

If the Supervisor ends up picking the wrong of the two minima, the robot 

might cause a collision. To avoid this, a double check has been implemented. 

During the initial localization, the two most likely positions are saved. Then, 

it is first assumed that the robot is at the first most likely position. From 

there, the Supervisor starts playing back the path. After a certain number 

of snapshots the robot is stopped and localization is done again. If the

32 5.3. Localization 


1.2 

1.0 

0.8 

0.6 

0.4 

0.2 

0.0 


../../../../bagfiles/2012-03-02_localization/localization_stellwand_dc5.bag 

50 100 150 200 


Figure 5.2: Magnitude of the homing vectors from current position to each 

snapshot on the path with two minima. 

most likely position in this second localization points to approximately the 

expected position, then the first localization is assumed to be correct and 

playback is restarted. If, however, the second localization points to an 

unexpected location, the Supervisor makes the robot reverse back to the 

original starting position and starts playback from the second most likely 

position. 

To be able to reverse back from the double check location to the original 

starting position, two things have to be done: 

1. The path from the original starting position up to the point where the 

double check takes place has to be saved. Otherwise, the robot would 

not know how to get back. This is done by the same path learning 

technique as during the regular path learning. For the dense version 

that means also recording the velocity commands given to the robot, 

this time not given by the human operator, but by the Arbiter. 

2. The Supervisor has to play the path backwards, so the order of snapshots 

is reversed. Furthermore, in the dense version, the recorded 

velocity commands have to be inversed by changing the sign of each 

component of the velocity vector.

Chapter 6 

Arbiter 

The Arbiter process got its name, because it has the power to decide which 

desired velocity is allowed to be sent to the robot at a given time. For 

example, in path learning mode, the joystick’s command is sent to the robot, 

while in playback mode, the commands from the Visual Homing process are 

forwarded. However, before sending the velocity commands on to the robot, 

the Arbiter also adapts the velocity command to the particular robot. In 

the following sections, only the procedure during playback is shown, because 

the adaptation of the joystick’s command during path learning is trivial. 

As is the case for the Supervisor, there are again two versions of the Arbiter 

– a sparse and a dense version. These are detailed in sections 6.1 and 6.2, 

respectively. 

6.1 Sparse 

Figure 6.1a shows the data flow of the Arbiter in the sparse version. The 

individual steps are described in the following sections. 

6.1.1 Y-Offset Correction 

The desired velocity, or homing vector, produced by the Visual Homing process 

has full six degrees of freedom. Since the two kinds of robots considered 

in this work – a differential drive ground robot and a differential drive boat 

– cannot control all of these degrees of freedom, the dimensionality has to 

be reduced. Only vx (translation in x direction) and ωz (rotation around 

the z axis) can be controlled directly (see figure 1.1). This is the velocity 

command that can be sent to the robot. Additionally, vy (translation in y 

direction) can be controlled by a combination of vx and ωz. 

When the robot faces in a direction parallel to the path from the last to 

the next node, vy indicates the robot’s offset in y-direction from the path. 

To reduce this y-offset, the robot’s heading θ must be changed towards the 

desired path. However, since Visual Homing tries to control θ back to the 

value it thinks is right, a correction value cθ can be added to the desired 

33

34 6.1. Sparse 

(a) Sparse (b) Dense 

Figure 6.1: Data flow in the Arbiter process in the sparse and the dense 

version.

Chapter 6. Arbiter 35 

velocity ωz such that Visual Homing and cθ counteract each other. This 

can be done by the following algorithm: 

cθ ← 0 

while (vx, vy, ωz) ← new desired velocity do 

cθ ← clip (cθ + ki ∗ clip(vy, k1), k2) 

ωz ← ωz + cθ ∗ clip(k3 ∗ vx, k4) 

end while 

The constants ki, k1, k2, k3 and k4 are tuning parameters and clip(x, xmax) 

is a function that limits the absolute value of x to xmax: 

function clip(x, xmax) 

if x > xmax then 

return xmax 

else if x < −xmax then 

return −xmax 

else 

return x 

end if 

end function 

Two things should be noted here: 

1. cθ is an integral, because it keeps it’s value between iterations of the 

while-loop. This is necessary, because the vy value produced by Visual 

Homing decreases once the robot turns towards the path. ωz does not 

keep its value between iterations, but is recalculated every iteration 

by Visual Homing. 

2. In the update step of ωz, the correction value cθ is scaled by vx. This 

ensures that, when the robot gets close to the snapshot image, the 

influence of cθ decreases so that it lets Visual Homing reduce the θ 

error. This is necessary for the Supervisor to be able to detect arrival 

at the snapshot. 

6.1.2 Filtering and Clipping 

The desired velocity produced by Visual Homing can be very noisy and, 

when some matches are wrong, can have very large spikes. To make the 

robot drive more smoothly, the rate of change of the individual components 

of the velocity vector is limited. Furthermore, to keep the robot safe, the 

absolute value of each velocity component is limited by the clip() function. 

6.1.3 Correct Theta First 

While the Visual Servo algorithm guarantees that the camera eventually 

reaches the target pose, it does not guarantee to get there via the shortest 

path. This is most noticed when only an error in θ is present. Instead 

of just giving a ωz command, visual servoing also produces a forward or 

backward velocity. This problem is known in literature as the camera retreat 

problem. To get around this problem, the Arbiter ensures that any error in

36 6.2. Dense 

θ is controlled first, so forward velocity vx is suppressed when ωz is large. 

This is done by updating vx with 

vx ← vx ∗ cvx (ωz) 

The coefficient function cvx (ωz) is described by 

cvx (ωz) = 

1 

1 + exp(k5 · (|ωz| − k6)) 

(6.1) 

where k5 and k6 are tuning parameters that determine up to which error in 

θ the robot is still allowed to drive in x direction. The resulting function is 

depicted in figure 6.2. 

cvx (ωz) 

1 

0.75 

0.5 

0.25 

0 0.25 0.5 0.75 1 

Figure 6.2: The coefficient function that suppresses forward velocity when 

a θ error is present. Here, k5 = 30 and k6 = 0.45. 

6.2 Dense 

Figure 6.1b shows the data flow of the Arbiter in the dense version. Most 

steps in the dense version are the same as in the sparse version. The steps 

that are different are described in the following sections. 

6.2.1 Sum of the Two Velocity Vectors 

Additionally to the desired velocity input from the Visual Homing process, 

the dense Arbiter also receives the recorded velocity commands from the 

Supervisor. These two velocities have to be fused in order to get a single 

velocity command that can be passed on to the robot. This is done by 

adding the two velocity vectors together. Thus, if the two vectors point 

into the same direction, the robot accelerates, if they point into opposite 

directions, it slows down. 

|ωz|

Chapter 6. Arbiter 37 

6.2.2 Y-Offset Correction 

As in the sparse version, a correction value cθ needs to be added to ωz 

to turn the robot towards the path. However, there are two reasons why 

y-offset correction is less complicated in the dense version: 

1. Generally, the error in x direction is far smaller in the dense version. 

This helps correction of y-offset, because it greatly reduces the issue 

where the vy produced by visual servoing decreases when the robot 

turns towards the path. Therefore, cθ can be calculated anew in every 

iteration and does not need to be integrating. 

2. Since the Supervisor’s decision to transition to the next snapshot is 

independent of the homing vector, the robot does not need to reduce 

the heading error when it gets close to a snapshot. 

Thus, the ωz-update step simplifies to 

while (vx, vy, ωz) ← new desired velocity do 

cθ ← clip (k7 ∗ vx ∗ vy, k8) 

ωz ← ωz + cθ 

end while 

The constants k7 and k8 are tuning parameters. This step happens after 

the combination of the two velocity commands, so vx, vy and ωz already 

contain both the desired velocity from Visual Homing and the recorded 

velocity command from the Supervisor.

38 6.2. Dense

Chapter 7 

Experiments 

7.1 BIBA Ground Robot 

The whole system was developed and tested on the differential drive ground 

robot BIBA, shown in figure 7.1. Tests against ground truth were done with 

a Vicon optical motion capture system 1 . 

Figure 7.1: The differential drive ground robot BIBA with the Kinect sensor 

mounted on top. 

During all tests, the robot was first steered by a human operator via a 

joystick to teach a path. Where compared against each other, both the 

sparse and the dense Supervisor learned the same path. 

1 http://www.vicon.com 

39

40 7.1. BIBA Ground Robot 

7.1.1 Path Tracking 

Figure 7.2 shows the trajectories viewed from above of the robot during 

path learning, sparse playback and dense playback. It also shows where 

topological nodes were created by the Supervisor and where the Supervisor 

actually transitioned from one node to the next during sparse playback. It 

reveals the problems of the sparse playback. While tracking of θ and x is 

very accurate, y-offset correction is very inefficient, especially on a curvy 

path like this. This is due to the fact that θ is controlled first: The robot 

first aligns its heading and then it drives forward. A different problem of 

the sparse version is only noticeable when looking at the robot in real life: It 

drives in a stop-and-go motion. The robot has to slow down until it almost 

stops at each node, before the Supervisor detects arrival and it can head on 

to the next node. So, while the sparse version gets close to the destination, 

the way it gets there is not very pretty and potentially dangerous if the 

path is close to obstacles. The dense version tracks the learned path a lot 

better. When observing the robot in real life it is almost impossible to tell 

the difference between the human operator driving during path learning and 

autonomous playback. 

y [m] 

2.0 

1.5 

1.0 

0.5 

Trajectory 

../../../../bagfiles/2011-12-01_ViconTest/s1_2011-12-01-15-04-50.bag 

1.5 1.0 0.5 0.0 0.5 1.0 

x [m] 

Learned Path 

Playback Dense 

Playback Sparse 

Sparse Nodes 

Sparse Transitions 

Figure 7.2: Path tracking of the sparse and dense version compared against 

the learned path. 

A more systematic evaluation of y-offset correction in the dense version is 

shown in figure 7.3. The learned path is a straight line and playback is 

started with an offset in y. The system is able to correct the offset quickly 

with little overshoot. 

In general, the dense version looks much nicer and it is much more likely to 

get the robot to the destination, especially in environments where very few 

features can be detected. Because of the feed-forward control via recorded 

velocity commands, the dense version is able to get across regions where 

visual servoing cannot produce a desired velocity. In those regions, the 

sparse version fails.

Chapter 7. Experiments 41 

y [m] 

0.5 

1.0 

1.5 

2.0 

2.5 

Trajectory 

../../../../bagfiles/2012-03-08_viconTest2/yoffset2_2012-03-08-17-16-42.bag 

2.0 1.5 1.0 0.5 0.0 0.5 1.0 

x [m] 

Figure 7.3: y-offset correction in the dense version. 

7.1.2 Localization 

Learned Path 


Generally, localization works very well, even when the playback is started 

with a large y-offset from the path. As expectable, it is less reliable in 

environments with a low number of visual features. However, it was very 

hard to test the recovery after a failed localization, since a procedure had 

to be found to reliably confuse the system. This was done by the setup 

shown in figure 7.4. Two identical movable walls were created with strong 

visual features so that a path could be recorded that has two locations with 

identical camera views. The taught path is pictured in figure 7.5. The robot 

was driven towards wall 1, turned right, driven towards wall 2 and turned 

left. During the learning stage, wall 2 was approached a little bit closer 

than wall 1. Then, the robot was put in front of wall 1, but a little bit 

closer than while recording the path. This procedure effectively tricks the 

system into thinking that it is standing in front of wall 2 when, in fact, it 

is standing in front of wall 1. 

Figure 7.5 shows the path during dense playback. Localization was initiated 

at the starting position. Since localization failed and the system thought 

it was in front of wall 2, the robot first turned left. After driving for 20 

snapshots, the Supervisor stops playback and localizes again. Since this 

does not yield the expected location of 20 snapshots behind the starting 

point, the system realizes that its first localization was wrong. It can now 

recover from this situation, because the path from the original starting 

position to the double check position was recorded. It reverses back to the 

starting position and starts playback from the second most likely solution 

of the initial localization. This one is correct and the robot can complete 

the path. 

As mentioned, it was very hard to confuse the localization. It is very unlikely 

to encounter a situation like this in real life. However, in the future, a similar 

approach could be used with a less reliable sensor than vision. This is why 

this test is to be seen as a proof of concept. The important point here is 

that, because the system is able to record the path from the original starting

42 7.2. Lizhbeth ASV 

Figure 7.4: Setup to test recovery from a failed localization with two identical 

movable walls. 

location up until the point where it notices that it is wrong, it is able to 

recover from this situation by reversing back to the start. 

7.2 Lizhbeth ASV 

The ASV Lizhbeth shown in figure 7.6 is a differential drive catamaran 

boat. Therefore, it has very similar motion characteristics to BIBA. The 

biggest difference is that motor speed does not directly translate to boat 

speed but to acceleration. This makes the feed-forward control part used 

in the dense version much less effective. The tests on the boat showed that 

the dense version performed much worse than the sparse version in this 

application. However, the sparse version, even though it was better than 

the dense version, did not work very well either. Only in perfect conditions 

– no wind and almost no waves – the system was able to complete paths. 

Nevertheless, the tests with Lizhbeth helped to identify problems that causes 

the system to perform worse than on land. 

Lighting Problems Figure 7.7 shows two camera images from the same 

video, taken about one second apart from each other. Due to the camera’s 

built-in exposure adjustment, details inside the boat-house are nicely visible 

in one image and completely underexposed in the other, vice versa for 

landmarks across the lake. This poses a great problem for visual homing, 

because image features that are used to control the boat might be completely 

invisible a moment later. This issue could be diminished by using a 

camera that allows manual control of the shutter time. That way, during 

the path learning phase, the Supervisor could record the shutter time used 

for capturing each snapshot. During playback, the camera exposure could 

always be set to the setting that corresponds to the current snapshot.

Chapter 7. Experiments 43 

Double Check Position 

Wall 1 

Starting Position 

Learned Path 


Wall 2 

Figure 7.5: Path of the robot with recovery from a failed localization with 

two identical movable walls. 

Depth Estimation Problems As mentioned in section 4.1, the visual 

servoing algorithm needs the depth of each feature point to be able to compute 

the image Jacobian matrix. The Kinect sensor gives this information 

only up to a range of a few meters. Points that the Kinect cannot estimate 

a depth for are assumed to be at a constant depth. This is enough for an 

indoor environment, but outdoor landmarks are often farther away than a 

few meters. While visual servoing is surprisingly tolerant of errors in the 

depth estimation, there are limitations. For example, if driving toward the 

boat house, a constant depth assumption of 10 meters might be sensible. 

But if the boat is to drive out of the harbor, points on the other side of the 

lake may easily be several kilometers away, so the assumption of 10 meters 

would grossly underestimate the depth. While this underestimation does 

not have an effect on the rotational velocity component ω of the homing 

vector, it decreases the magnitude of translational velocity v. This can be 

seen in equation 4.1. A solution to this problem could be to use stereo vision 

to estimate depth. While stereo vision cannot accurately estimate depth of 

points that are far away, either, at least it would be able to indicate that 

the points are far away.

44 7.2. Lizhbeth ASV 

Figure 7.6: The ASV Lizhbeth. 

(a) (b) 

Figure 7.7: Two frames from the same video, captured about 1 second apart 

from each other. Automatic exposure adjustment of the camera causes 

lighting problems.

Chapter 8 

Conclusion and Outlook 

In this thesis, a software system was developed that can learn a path from a 

stream of images. A human operator can steer the robot, thereby teaching 

a path to the robot. Later, the robot can be put anywhere near the taught 

path and the robot is able to complete the previously taught path. Two 

approaches based on Image Based Visual Servoing have been developed: 

A sparse version, where the system tries to save as few images as possible 

and drives on the path using only vision to control the robot, and a dense 

version, where the full video stream is saved alongside the control commands 

given during the path learning phase. Now, the recorded control commands 

drive the robot in a feed-forward control while IBVS is used to correct 

errors. Furthermore, a method has been presented that allows the robot 

to localize itself on the previously taught path. If localization fails and the 

robot starts driving into a wrong direction, the system is able to recover 

by, again, recording the incorrectly driven path and reversing it back to the 

original starting position. 

The entire system has been developed using a differential drive ground robot 

equipped with a Kinect camera and depth sensor. It was then tested against 

ground truth using a Vicon optical motion capture system. Tests show, 

that while both the sparse and the dense version get the robot closely to its 

destination, the dense version does so in a much nicer, safer, more accurate 

and more reliable way. The tests show the system’s ability to compensate 

for the non-holonomic properties of the differential drive robot, especially in 

the dense version. The recovery from a failed localization has been shown 

to work. 

Further tests of the system have been conducted on a differential drive 

catamaran ASV on the lake of Zurich. While the sparse version was able to 

complete paths in optimal environmental conditions, the tests have shown 

that the presented approach will not work reliably on the boat with the 

Kinect sensor. The main reasons why the Kinect is insufficient are lighting 

problems due to the camera’s automatic exposure adjustment and the 

limited range of depth estimation. 

45

46 8.1. Future Improvements 

8.1 Future Improvements 

For outdoor use, a stereo vision approach could be more suitable for providing 

depth information. Although not being very accurate at long distance, 

at least it gives the information that a point is far away. The Kinect can 

only report that it does not have any depth information for a far away point. 

Feature matching performance could be improved by using a camera that 

allows manual control of exposure. This would reduce problems of lighting 

changes in the image. 

Localization reliability could be improved by suitable filtering of the homing 

vector magnitudes that are used for localization. Consider the plot shown 

in figure 8.1. The correct localization in this situation would be at snapshot 

number 33. Indeed, snapshot 33 does produce the homing vector with the 

lowest magnitude. However, the localization almost failed due to some 

outliers around snapshot number 230. For example, a median filter could 

be used to remove the worst outliers. 


../../../../bagfiles/2012-03-02_localization/localizationDoublecheck_stellwand_dc4.bag 


2.0 

1.5 

1.0 

0.5 

0.0 

50 100 150 200 


Figure 8.1: Magnitudes of homing vectors to each snapshot used for localization.

Bibliography 

[1] G. Hitz, F. Pomerleau, M. Garneau, C. Pradalier, T. Posch, J. Pernthaler, 

and R. Siegwart, “Design and Application of a Surface Vessel 

for Autonomous Inland Water Monitoring,” Robotics & Automation 

Magazine, IEEE, no. 99, p. 1, 2012. 

[2] S. D. Jones, C. Andresen, and J. L. Crowley, “Appearance based processes 

for visual navigation,” in Intelligent Robots and Systems, 1997. 

IROS ’97., Proceedings of the 1997 IEEE/RSJ International Conference 

on, pp. 551–557, 1997. 

[3] A. Vardy, “Long-Range Visual Homing,” in 2006 IEEE International 

Conference on Robotics and Biomimetics, pp. 220–226, IEEE, Dec. 

2006. 

[4] E. Motard, B. Raducanu, V. Cadenat, and J. Vitria, “Incremental 

On-Line Topological Map Learning for A Visual Homing Application,” 

in Robotics and Automation, 2007 IEEE International Conference on, 

pp. 2049–2054, 2007. 

[5] J. Zeil, M. I. Hofmann, and J. S. Chahl, “Catchment areas of panoramic 

snapshots in outdoor scenes,” Journal of the Optical Society of America 

A, vol. 20, no. 3, pp. 450–469, 2003. 

[6] R. Möller, “Local visual homing by warping of two-dimensional images,” 

Robotics and Autonomous Systems, vol. 57, pp. 87–101, Jan. 

2009. 

[7] R. Möller and A. Vardy, “Local visual homing by matched-filter descent 

in image distances,” Biological Cybernetics, vol. 95, pp. 413–430, Sept. 

2006. 

[8] D. Churchill and A. Vardy, “Homing in scale space,” in Intelligent 

Robots and Systems, 2008. IROS 2008. IEEE/RSJ International Conference 

on, pp. 1307–1312, 2008. 

[9] P. Corke, “Vision-Based Control,” in Robotics, Vision and Control, 

Springer Berlin / Heidelberg, 2011. 

[10] A. Cherubini and F. Chaumette, “Visual Navigation With Obstacle 

Avoidance,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 

IROS’11, (San Francisco, USA), pp. 1593–1598, Sept. 2011. 

[11] Ohno, “Autonomous navigation for mobile robots referring prerecorded 

image sequence,” in Intelligent Robots and Systems ’96, IROS 

96, Proceedings of the 1996 IEEE/RSJ International Conference on, 

pp. 672–679, 1996. 

47

48 Bibliography 

[12] G. Blanc, Y. Mezouar, and P. Martinet, “Indoor Navigation of a 

Wheeled Mobile Robot along Visual Routes,” in 2005 IEEE International 

Conference on Robotics and Automation, pp. 3354–3359, IEEE, 

2005. 

[13] C. Pradalier, P. Bessière, and C. Laugier, Driving on a Known Sensori- 

Motor Trajectory with a Car-like Robot, vol. 21 of Springer Tracts in 

Advanced Robotics. Berlin/Heidelberg: Springer-Verlag, 2006. 

[14] D. Lowe, “Object recognition from local scale-invariant features,” in 

Proceedings of the Seventh IEEE International Conference on Computer 

Vision, (Kerkyra , Greece), pp. 1150–1157 vol.2, IEEE, 1999. 

[15] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded Up Robust 

Features,” Computer Vision–ECCV 2006, 2006. 

[16] S. Leutenegger and M. Chli, “BRISK: Binary Robust Invariant Scalable 

Keypoints,” in Proceedings of the IEEE International Conference on 

Computer Vision (ICCV), 2011. 

[17] M. A. Fischler and R. C. Bolles, “Random sample consensus: a 

paradigm for model fitting with applications to image analysis and automated 

cartography,” Communications of the ACM, vol. 24, pp. 381– 

395, June 1981.

List of Figures 

1.1 Coordinate frames of robot (light gray) and camera (dark 

gray). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 

3.1 Illustration of the learned path in the sparse version. . . . . . 18 

3.2 Data flow between the ROS nodes in the sparse version. . . . 19 

(a) Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

(b) Playback . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

3.3 Illustration of the learned path in the dense version. . . . . . 20 

3.4 Data flow between the ROS nodes in the dense version. . . . 21 

(a) Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

(b) Playback . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

4.1 Data flow in the Visual Homing process. . . . . . . . . . . . . 26 

4.2 The Microsoft Kinect sensor. . . . . . . . . . . . . . . . . . . 27 

5.1 Magnitude of the homing vectors from current position to 

each snapshot on the path with one distinct minumum. . . . 31 

5.2 Magnitude of the homing vectors from current position to 

each snapshot on the path with two minima. . . . . . . . . . 32 

6.1 Data flow in the Arbiter process in the sparse and the dense 

version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

(a) Sparse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

(b) Dense . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

6.2 The coefficient function that suppresses forward velocity when 

a θ error is present. Here, k5 = 30 and k6 = 0.45. . . . . . . . 36 

7.1 The differential drive ground robot BIBA with the Kinect 

sensor mounted on top. . . . . . . . . . . . . . . . . . . . . . 39 

7.2 Path tracking of the sparse and dense version compared against 

the learned path. . . . . . . . . . . . . . . . . . . . . . . . . . 40 

7.3 y-offset correction in the dense version. . . . . . . . . . . . . . 41 

7.4 Setup to test recovery from a failed localization with two 

identical movable walls. . . . . . . . . . . . . . . . . . . . . . 42 

7.5 Path of the robot with recovery from a failed localization 

with two identical movable walls. . . . . . . . . . . . . . . . . 43 

7.6 The ASV Lizhbeth. . . . . . . . . . . . . . . . . . . . . . . . . 44 

7.7 Two frames from the same video, captured about 1 second 

apart from each other. Automatic exposure adjustment of 

the camera causes lighting problems. . . . . . . . . . . . . . . 44 

(a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 

(b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 

49

50 List of Figures 

8.1 Magnitudes of homing vectors to each snapshot used for localization. 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Download report - Student Projects

Create successful ePaper yourself

Delete template?

Save as template?