Download report - Student Projects
Download report - Student Projects
Download report - Student Projects
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Master-Thesis<br />
Vision-based Trajectory<br />
Following for Boats and<br />
Ground Robots<br />
Autumn Term 2012<br />
Autonomous Systems Lab<br />
Prof. Roland Siegwart<br />
Supervised by: Author:<br />
Gregory Hitz Hendrik Erckens<br />
Dr. Cédric Pradalier
Abstract<br />
This thesis addresses the problem of vision-based mobile robot navigation<br />
from an image memory of a previously driven path where the robot was<br />
controlled by a human operator. The presented solution is based on Image<br />
Based Visual Servoing (IBVS), where the required velocity is calculated<br />
by using BRISK features to compare a current camera image to previously<br />
recorded snapshot images. Two approaches – sparse and dense – are compared<br />
to each other, which differ in the number of snapshots used to remember<br />
the path. In the dense version, also the control commands given to<br />
the robot during path learning are saved. These are used as a feed-forward<br />
control during autonomous playback, while errors are corrected by IBVS.<br />
A technique is presented that allows the robot to localize itself on the taught<br />
path. Additionally, the system is able to recover from a failed localization<br />
by reversing back to the starting position.<br />
The system has successfully been implemented and results are presented<br />
of tests against ground truth on a non-holonomic, differential drive ground<br />
robot. Localization is shown to work even in situations where the current<br />
camera image is ambiguous. Additionally, the system is tested and evaluated<br />
for use on a robot boat.<br />
i
Acknowledgment<br />
I would like to thank Gregory Hitz for great supervision, valuable inputs and<br />
help during testing. Cedric Pradalier for intense bursts of brilliant ideas.<br />
Francis Colas for resurrecting BIBA after a flat mainboard battery. Great<br />
thanks to Gion-Andri Büsser and Jakob Buchheim for pointing out many<br />
mistakes and incomprehensibilities.<br />
iii
Contents<br />
Abstract i<br />
1 Introduction 9<br />
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />
1.2 Objectives of this Thesis . . . . . . . . . . . . . . . . . . . . . 10<br />
1.3 Conventions Used in this Thesis . . . . . . . . . . . . . . . . . 10<br />
1.3.1 Coordinate Systems . . . . . . . . . . . . . . . . . . . 10<br />
1.3.2 General Methodology and Terminology . . . . . . . . 10<br />
1.4 Problems to be Solved . . . . . . . . . . . . . . . . . . . . . . 11<br />
1.5 Structure of this Report . . . . . . . . . . . . . . . . . . . . . 12<br />
2 Literature Review 13<br />
2.1 Path Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br />
2.2 Path Playback . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />
2.2.1 Navigate to Next Node (Visual Homing) . . . . . . . . 14<br />
2.2.2 Detect Arrival at Node . . . . . . . . . . . . . . . . . . 15<br />
3 Software Structure 17<br />
3.1 ROS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />
3.2 ROS Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />
3.3 Path Representation Approaches . . . . . . . . . . . . . . . . 18<br />
3.3.1 Sparse . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />
3.3.2 Dense . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
4 Visual Homing 23<br />
4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />
4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
4.2.1 Camera . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
4.2.2 Feature Detection and Matching . . . . . . . . . . . . 26<br />
4.2.3 RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />
4.2.4 Image Jacobian . . . . . . . . . . . . . . . . . . . . . . 27<br />
4.2.5 Coordinate Transformation . . . . . . . . . . . . . . . 28<br />
5 Supervisor 29<br />
5.1 Path Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />
5.1.1 Sparse . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />
5.1.2 Dense . . . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />
5.2 Autonomous Path Playback . . . . . . . . . . . . . . . . . . . 30<br />
5.2.1 Sparse . . . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />
5.2.2 Dense . . . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />
5.3 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
6 Arbiter 33<br />
v
6.1 Sparse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
6.1.1 Y-Offset Correction . . . . . . . . . . . . . . . . . . . 33<br />
6.1.2 Filtering and Clipping . . . . . . . . . . . . . . . . . . 35<br />
6.1.3 Correct Theta First . . . . . . . . . . . . . . . . . . . 35<br />
6.2 Dense . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />
6.2.1 Sum of the Two Velocity Vectors . . . . . . . . . . . . 36<br />
6.2.2 Y-Offset Correction . . . . . . . . . . . . . . . . . . . 37<br />
7 Experiments 39<br />
7.1 BIBA Ground Robot . . . . . . . . . . . . . . . . . . . . . . . 39<br />
7.1.1 Path Tracking . . . . . . . . . . . . . . . . . . . . . . 40<br />
7.1.2 Localization . . . . . . . . . . . . . . . . . . . . . . . . 41<br />
7.2 Lizhbeth ASV . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />
8 Conclusion and Outlook 45<br />
8.1 Future Improvements . . . . . . . . . . . . . . . . . . . . . . . 46<br />
Bibliography 48<br />
List of Figures 50<br />
vi
Symbols<br />
x x-axis<br />
y y-axis<br />
z z-axis<br />
u horizontal image axis<br />
v vertical image axis<br />
θ angle of rotation around robot’s z-axis<br />
ν velocity: ν = (v, ω) ∈ R6 v translational velocity: v = (vx, vy, vz) ∈ R3 ω rotational velocity: ω = (ωx, ωy, ωz) ∈ R3 J p image Jacobian, or feature sensitivity matrix of feature point p. J p ∈ R2×6 f focal length of the camera<br />
ρ pixel size of the camera<br />
Indices<br />
x x-axis<br />
y y-axis<br />
z z-axis<br />
C camera frame<br />
R robot frame<br />
W world frame<br />
Acronyms and Abbreviations<br />
ASV Autonomous Surface Vessel<br />
AUV Autonomous Underwater Vehicle<br />
BRISK Binary Robust Invariant Scalable Keypoints<br />
DID Descent in Image Distance<br />
ETH Eidgenössische Technische Hochschule<br />
GPS Global Positioning System<br />
IBVS Image Based Visual Servoing<br />
MFDID Matched-Filter Descent in Image Distance<br />
PBVS Position Based Visual Servoing<br />
ROS Robot Operating System<br />
SIFT Scale Invariant Feature Transform<br />
VSRR View Sequenced Route Representation<br />
ZNCC Zero Mean Energy Normalized Cross Correlation<br />
vii
Chapter 1<br />
Introduction<br />
1.1 Context<br />
Autonomous robots designed to operate in an outside environment typically<br />
use satellite based GPS and a magnetic compass to localize themselves in<br />
the environment and to navigate to a given goal position and orientation.<br />
This works well in open spaces with good satellite visibility, but becomes<br />
increasingly inaccurate when the sky is occluded or when magnetic fields<br />
disturb the compass. Even in optimal conditions, this method of localization<br />
can be insufficient when great accuracy is required, for example when the<br />
robot has to maneuver around obstacles. GPS becomes completely unusable,<br />
when the sky is entirely invisible, most notably in indoor or underwater<br />
environments.<br />
To overcome this issue, many researchers are working on localization using<br />
other sensors like camera, laser range finder, or sonar. This localization<br />
can be done either in a metric Cartesian space, or directly in the sensor<br />
space. During metric localization, the pose of the robot is described by<br />
not more than six coordinates: Three coordinates for position and three for<br />
orientation. In sensor space, the robot’s pose can be described by as many<br />
coordinates as there are sensor measurements available. For example, consider<br />
a robot that is equipped with only one sensor measuring the brightness<br />
of the ground under the robot. Then, the sensor space is one-dimensional<br />
and the robot’s pose can be described by this one measurement. Note that,<br />
as there are likely many points on the ground that have the same brightness,<br />
this localization can be ambiguous.<br />
Additionally to localization, the navigation can also be done in sensor space.<br />
If the sensor to be used is a camera, the target pose and the robot’s current<br />
pose can be described by camera images taken at the respective pose. Then,<br />
correlations between the target image and the current image can be found<br />
and it can be calculated how the robot has to move to eventually reach<br />
the target. Consequently, for this kind of navigation, the robot does not<br />
necessarily need to know its pose with respect to the target pose in Cartesian<br />
coordinates. All the robot needs to know is how to move to make its camera<br />
image more similar to the image acquired at the target pose.<br />
9
10 1.2. Objectives of this Thesis<br />
1.2 Objectives of this Thesis<br />
This project aims to implement vision-based trajectory following. A robot<br />
should be able to record a path during a learning phase during which it is<br />
controlled by a human operator. It should then be able to navigate along<br />
the previously recorded path. All localization and navigation should be<br />
done in sensor space, using a camera as the primary sensor. Specifically, no<br />
metric localization is to be done.<br />
In a first step, this method is to be implemented and tested on a differential<br />
drive ground robot operating in an indoor office environment.<br />
As a second step, the suitability of this method for use on the limnological<br />
sampling boat Lizhbeth [1] should be determined. This autonomous surface<br />
vessel (ASV) is used for sampling the water in lakes. Under normal conditions,<br />
it can navigate via GPS. However, GPS is not accurate enough for<br />
Lizhbeth to navigate into the harbor autonomously. For this, a vision-based<br />
trajectory following method could be helpful.<br />
Another motivation for this thesis is that, in the future, a similar approach<br />
could also be used for navigation of the AUV Noptilus 1 . Instead of vision,<br />
sonar could be used as the main sensor. However, this is a long term goal<br />
and not part of this thesis.<br />
1.3 Conventions Used in this Thesis<br />
1.3.1 Coordinate Systems<br />
Figure 1.1 shows a top view of the robot with position and orientation<br />
of the body coordinate frames of the robot (xR, yR, zR) and the camera<br />
(xC, yC, zC) with respect to the world frame.<br />
1.3.2 General Methodology and Terminology<br />
A long path can be subdivided into small portions. Each of these portions<br />
has its own goal, called a node. Consequently, the whole path can be completed<br />
by driving from one node to the next, eventually arriving at the last<br />
node. Thus, each node describes a pose (position and orientation) of the<br />
robot in the environment. As explained in section 1.1, this pose can be<br />
represented by a camera image taken at the respective pose. This camera<br />
image is called a snapshot or target image. Each node is associated with one<br />
snapshot. The process of autonomously guiding the robot towards a node<br />
by comparing the node’s snapshot image with the current image is called<br />
visual homing. It is easy to see that successful visual homing is only possible<br />
if the robot is already in the vicinity of the respective snapshot pose,<br />
because at least some parts of the snapshot image have to be visible in the<br />
current view. The area around a node from where the node is reachable is<br />
called the node’s catchment area. The size of the catchment area depends<br />
on the visual homing method used and also on environmental factors, such<br />
1 http://www.noptilus-fp7.eu
Chapter 1. Introduction 11<br />
yR<br />
yC<br />
rRC<br />
zR<br />
zC<br />
xR<br />
xC<br />
Figure 1.1: Coordinate frames of robot (light gray) and camera (dark gray).<br />
as lighting conditions and the number of distinct visual features that are<br />
detectable. The whole path consisting of many nodes connected to each<br />
other by arcs is referred to as a topological map. An arc between two nodes<br />
means that these nodes lie in each others catchment area.<br />
1.4 Problems to be Solved<br />
The whole problem stated in section 1.2 can be divided up into smaller<br />
sub-problems for which a solution has to be found.<br />
Node Creation The main problem in the path learning phase is to know<br />
when and how often a new node has to be created.<br />
Visual Homing During the playback phase, a method is needed that can<br />
drive the robot towards the next node.<br />
Arrival Detection When visual homing has brought the robot close to a<br />
node, the arrival has to be detected so that a transition to the next<br />
node can be made.<br />
Localization When playback is not started at the same pose that path<br />
learning was started at, the robot needs the ability to localize itself<br />
on the path.<br />
ξC<br />
ξR<br />
zW<br />
yW<br />
xW
12 1.5. Structure of this Report<br />
1.5 Structure of this Report<br />
Chapter 2 presents related work in the literature. While chapter 3 gives<br />
an overview of the whole software structure, chapters 4, 5 and 6 provide a<br />
more detailed description of the individual parts of the software. Tests and<br />
experiments of the proposed system are presented in chapter 7. Finally, a<br />
conclusion and suggestions for future improvements are given in chapter 8.
Chapter 2<br />
Literature Review<br />
In the beginning of this thesis, some literature research was done to get<br />
familiar with the topic and to gather ideas of how to approach the task.<br />
This section summarizes the results of this literature research. First, section<br />
2.1 shows ways to approach the task of teaching the robot a path. Then,<br />
section 2.2 introduces some ideas about techniques which enable the robot<br />
to autonomously follow the previously taught path.<br />
2.1 Path Learning<br />
While there is extensive literature that precisely describes visual homing,<br />
details about path learning techniques are often only vaguely described.<br />
The decision on which path learning technique is appropriate depends on<br />
the method used for path playback and specifically what kind of path representation<br />
is needed. For example, different visual homing methods have<br />
a different size of catchment area around each snapshot position. The most<br />
critical part of the path learning technique is that it has to guarantee that<br />
a new node lies well inside the catchment area of the previous node. So,<br />
a procedure is needed that decides at which point a new node has to be<br />
created.<br />
Jones et al. constantly compare the robot’s current view image to the last<br />
saved snapshot image using Zero Mean Energy Normalized Cross Correlation<br />
(ZNCC) [2]. A new node is created as soon as the ZNCC value drops<br />
below a certain threshold.<br />
An interesting way to detect when a new node has to be created is introduced<br />
by Vardy et al. [3]: While learning the map, the robot constantly computes<br />
the homing angle to the last node. If this differs too much from the angle<br />
estimated by odometry, a new node is created.<br />
A sophisticated approach has been presented by Motard et al. [4]. The<br />
authors introduce a method for incremental online topological map learning.<br />
The robot can explore the environment autonomously and build its own<br />
map where each node can be connected to many other nodes, providing the<br />
possibility to close loops and get to the destination by more than one way.<br />
13
14 2.2. Path Playback<br />
2.2 Path Playback<br />
First, a technique is needed that guides the robot towards the next node<br />
in the topological map. This is described in section 2.2.1. Subsequently,<br />
when the robot gets close to this node’s snapshot position, arrival has to<br />
be detected to initiate the topological transition to the following node. Approaches<br />
to this problem are shown in section 2.2.2.<br />
2.2.1 Navigate to Next Node (Visual Homing)<br />
Generally, procedures for navigating the robot to the next node can be divided<br />
into two categories: Methods that require extraction of features or<br />
identification of landmarks and methods that require neither feature extraction<br />
nor identification of landmarks [5].<br />
Not Landmark-Based<br />
Descent in Image Distance (DID) Zeil et al. showed that a snapshot<br />
position is defined by a clear, sharp minimum in a smooth three-dimensional<br />
image distance function, which is calculated from individual pixel intensity<br />
differences [5]. Thus, the snapshot position can be reached by a simple<br />
gradient descent method. However, to estimate the gradient of the image<br />
distance function at the robot’s current position, it has to make exploratory<br />
movements to sample the function at different places around its current<br />
position. Möller gets around this problem by warping the current image<br />
as if the robot had made the exploratory movement [6]. This has been<br />
originally introduced as Matched-Filter DID (MFDID) by Möller et al. [7].<br />
ZNCC Peak This is an appearance-based method, where a peak in the<br />
Zero Mean Energy Normalized Cross Correlation (ZNCC) function on the<br />
horizontal image axis defines the homing direction. Jones et al. use two<br />
divergent cameras to extract a forward velocity component alongside this<br />
directional information [2].<br />
Landmark-Based<br />
Homing in Scale Space is used by Churchill and Vardy in [8]. The<br />
authors make use of the scale information in the SIFT feature descriptor.<br />
By the assumption of uniform distribution of keypoints in the image plane<br />
of a panoramic image, they deduce that the home direction is the center of<br />
the image region where feature scale is greater in the snapshot view than in<br />
the current view.<br />
Visual Servoing A very detailed tutorial on visual servoing has been<br />
published by Corke in [9]. He differentiates between Position Based Visual<br />
Servoing (PBVS) and Image Based Visual Servoing (IBVS). In a PBVS<br />
system, the pose of the target with respect to the camera is estimated, assuming<br />
a geometrical model of the target is known. In an IBVS system, this
Chapter 2. Literature Review 15<br />
pose estimation step is omitted and the whole control problem is formulated<br />
in image coordinate space R 2 . The camera’s pose change required to get<br />
from the current position to the snapshot position is defined implicitly by<br />
the required change of feature locations in the image plane.<br />
Cherubini and Chaumette present a visual path following technique using<br />
visual servoing, but they use only one dimension of the generally 3dimensional<br />
image jacobian matrix used in IBVS to extract the homing<br />
direction to the next snapshot on the path [10]. Instead of at least three<br />
matched features necessary to compute the full 6-dimensional linear and<br />
angular camera velocity required to move to the snapshot, only one single<br />
feature is needed. For this, the authors use the abscissa of the centroid<br />
of the features matched between snapshot and current view. In their approach,<br />
the robot’s forward velocity is determined by a safety context value<br />
and the velocity of tracked features in the image plane between consecutive<br />
snapshots on the path. Thus, the robot will slow down when the context<br />
is deemed unsafe, i.e. when the laser range finder detects an obstacle. Furthermore,<br />
the authors use an actuated camera to maintain scene visibility<br />
while avoiding the obstacles.<br />
Metric Error Estimation Ohno et al. use vertical lines as features [11].<br />
Since they assume the motion constraints of a ground robot with a rigidly<br />
attached camera, it is sufficient to take into account only the horizontal<br />
position of these features in the image plane. They present an algorithm to<br />
estimate x and y position error as well as θ orientation error with respect to<br />
the snapshot image. Blanc et al. similarly estimate the camera replacement<br />
by uncoupling translation and rotation components of a homography matrix<br />
[12].<br />
2.2.2 Detect Arrival at Node<br />
When the visual homing process has successfully driven the robot to a position<br />
close to the current snapshot position, the system must detect arrival at<br />
the current node to make the topological transition to the next node. There<br />
are several ways to detect arrival and the practicality of some of them depends<br />
on the visual homing method used. Vardy et al. describe four ways<br />
to detect arrival at a snapshot [3].<br />
1. If the distance between the nodes is known, odometry can be used to<br />
detect when the robot has exceeded the distance to the next node.<br />
2. The image difference between snapshot and current image can be measured<br />
by computing the sum of squared error (SSE). Arrival is declared<br />
when this value falls below a certain threshold.<br />
3. The magnitude of the computed homing vector can be used as an<br />
arrival detection measure. Arrival can be declared when the homing<br />
vector’s magnitude falls below a certain threshold. In [3], the authors<br />
use MFDID to compute their homing vector, because this is the<br />
method they use to drive the robot towards the next node. However,<br />
this method can obviously be used together with any other way of<br />
producing the homing vector.
16 2.2. Path Playback<br />
4. Instead of only looking at the magnitude of the homing vector, this<br />
method looks at the time derivative of the sum of all magnitudes over<br />
time while the robot is driving towards the node. The sum of all<br />
magnitudes is considered as a function over time. After filtering this<br />
function with a length-3 triangle filter, the derivative is taken. Since<br />
the magnitude of the homing vector tends to zero, the sum will stop<br />
increasing once the robot gets close to the snapshot position, so the<br />
curve flattens. Arrival is declared when the value of the derivative<br />
falls below a certain threshold.
Chapter 3<br />
Software Structure<br />
This chapter gives an overview over the software structure and how the<br />
individual processes work together. Chapters 4, 5 and 6 provide a more<br />
detailed discussion of the three main parts of the software.<br />
3.1 ROS<br />
The Robot Operating System (ROS) 1 has been developed and is actively<br />
maintained by Willow Garage 2 . At its core, ROS provides a middleware<br />
that allows processes to communicate. In ROS terminology, the individual<br />
processes, called nodes, pass messages to one another via topics. One node<br />
publishes messages to a particular topic and another node can subscribe<br />
to this topic in order to receive the messages. Since messages are passed<br />
via TCP network, nodes can communicate with each other even if they run<br />
on different machines. Apart from the middleware, ROS comes with many<br />
tools to help with the development of robotics software.<br />
The software for this work has been written in C++ and Python. For all<br />
computer vision related parts of the software, the computer vision library<br />
OpenCV 3 is used. OpenCV is integrated into ROS. The linear algebra<br />
library Eigen is used for most matrix and vector algebra.<br />
3.2 ROS Nodes<br />
The software has been divided up into three separate processes, or nodes:<br />
Visual Homing is at the center of the whole structure. It takes a current<br />
view from the camera and a snapshot image. From this input it<br />
computes a desired velocity, which will move the robot closer to the<br />
snapshot position. Visual Homing outputs a velocity vector with full<br />
1 http://ros.org<br />
2 http://www.willowgarage.com<br />
3 http://opencv.willowgarage.com/wiki/<br />
17
18 3.3. Path Representation Approaches<br />
6 degrees of freedom, not making any assumption on the type of robot<br />
to be controlled.<br />
Supervisor has the main purpose of building and maintaining the map<br />
that represents the taught path. It localizes the robot in the map<br />
and provides the Visual Homing node with the next snapshot that<br />
the robot should drive towards.<br />
Arbiter is the process that actually gives commands to the robot’s motor<br />
controller. It is called Arbiter, because it decides which commands<br />
can be safely forwarded to the robot. Also, if commands from the<br />
joystick come in, it decides which process is allowed to command the<br />
robot. The Arbiter filters and clips the velocity commands and adapts<br />
them to the particular type of robot that is used.<br />
3.3 Path Representation Approaches<br />
Over the course of this work, two approaches have been developed and<br />
compared. They are called dense and sparse. The Visual Homing process is<br />
the same for both approaches. The difference between the dense and sparse<br />
versions is the way the path is learned and represented.<br />
3.3.1 Sparse<br />
Learning Phase<br />
During the learning phase in the sparse version, the Supervisor tries to save<br />
as few snapshot images as are really necessary to guarantee that each node<br />
lies in the catchment area of its preceding node. Hence the name sparse.<br />
Figure 3.1 illustrates such a learned path with six snapshot images.<br />
Figure 3.1: Illustration of the learned path in the sparse version.<br />
When the learning phase is initiated, the Supervisor saves the current camera<br />
view as the first snapshot. Then, it immediately passes this snapshot<br />
image to the Visual Homing process. While the human operator steers the<br />
robot away from this snapshot via joystick control, Visual Homing constantly<br />
compares the current camera view to this snapshot and computes<br />
the desired velocity that would drive the robot back to the snapshot and<br />
passes it back to the Supervisor. If the magnitude of the desired velocity<br />
vector rises above a certain threshold, the Supervisor saves a new snapshot.
Chapter 3. Software Structure 19<br />
Now, this new snapshot is passed to Visual Homing and so on. Figure 3.2a<br />
shows the data flow between the processes during the learning phase.<br />
Playback Phase<br />
During the playback phase, the Supervisor passes the first snapshot of the<br />
path to Visual Homing. Comparing this snapshot to the current image from<br />
the camera, Visual Homing computes the desired velocity required to guide<br />
the robot to the snapshot pose. Arbiter filters the desired velocity to avoid<br />
jerky behavior, accounts for the non-holonomic properties of the particular<br />
robot and passes a velocity command to the robot’s motor controllers. Detection<br />
of arrival is the same as presented by Vardy in [3]. While the robot<br />
is driving towards the snapshot, the Supervisor looks at the desired velocity<br />
that is being produced by Visual Homing. If the magnitude of the desired<br />
velocity vector falls below a certain threshold, the Supervisor assumes that<br />
the robot is close to the snapshot, transitions to the next node on the path<br />
and publishes that node’s snapshot image to Visual Homing. This way,<br />
the robot drives from node to node until it has completed the whole path.<br />
Figure 3.2b shows the data flow between the processes during the playback<br />
phase.<br />
(a) Learning (b) Playback<br />
Figure 3.2: Data flow between the ROS nodes in the sparse version.<br />
3.3.2 Dense<br />
The dense version has been inspired by the notion of a Sensori-Motor Trajectory<br />
in the work of Pradalier et al. [13].
20 3.3. Path Representation Approaches<br />
Learning Phase<br />
In contrast to the path learning technique in the sparse version, the Supervisor<br />
in the dense version tries to save as many snapshots as it can get from<br />
the camera. The result is a dense array of nodes as illustrated in figure 3.3.<br />
Additionally to the snapshot images, the Supervisor also saves the velocity<br />
command that has been given to the robot by the human operator during<br />
the learning phase. Thus each node is associated with a snapshot image<br />
and a velocity command. The Visual Homing process is idle. Figure 3.4a<br />
shows the data flow between the processes during the learning phase.<br />
Figure 3.3: Illustration of the learned path in the dense version.<br />
Playback Phase<br />
During the playback phase, the supervisor passes a snapshot image to Visual<br />
Homing and the corresponding recorded velocity command to the Arbiter.<br />
Visual Homing again computes a desired velocity that guides the robot towards<br />
the snapshot. The Arbiter then passes a superposition of the recorded<br />
velocity command and the Visual Homing’s desired velocity to the robot.<br />
The Supervisor advances through the path nodes with the same speed as<br />
they were saved during the learning phase. The idea is that by passing the<br />
same velocity commands to the robot as during the learning phase, the robot<br />
should already drive a similar trajectory. If it happens to be perfectly on<br />
the taught path, then Visual Homing will produce a desired velocity close<br />
to zero. When outside disturbances cause the robot to deviate from the<br />
taught path, Visual Homing will compute a desired velocity that corrects<br />
the robot’s deviation from the path. The data flow is illustrated in figure<br />
3.4b.
Chapter 3. Software Structure 21<br />
(a) Learning (b) Playback<br />
Figure 3.4: Data flow between the ROS nodes in the dense version.
22 3.3. Path Representation Approaches
Chapter 4<br />
Visual Homing<br />
The Visual Homing process compares the current camera image to the snapshot<br />
image. From this it computes a desired velocity that guides the robot<br />
towards the pose that the snapshot image was acquired at. To do this, it<br />
uses Image Based Visual Servoing (IBVS).<br />
Section 4.1 describes the theory behind IBVS. The implementation in C++<br />
developed in this work is described in section 4.2.<br />
4.1 Theory<br />
For the purpose of introducing the IBVS algorithm, the control problem<br />
used in this section is defined as finding a velocity that drives the camera<br />
to a target pose. Under the assumption that the camera is rigidly attached<br />
to the robot, the transition to the problem of driving the robot to a target<br />
pose is a coordinate transformation and is covered in section 4.2.5.<br />
A great tutorial on IBVS and a more detailed deduction is given by Corke<br />
in [9]. Most of the explanation and notation in this section follows the<br />
derivations in Corke’s book.<br />
In IBVS, the control problem of driving the camera to a target pose in<br />
world coordinate space (R 6 for position and orientation) is reformulated<br />
completely in the image coordinate space R 2 . Instead of estimating the<br />
camera’s pose with respect to the target and controlling the pose change<br />
directly, the desired camera pose is defined implicitly by the image feature<br />
positions at this desired camera pose. The control problem can therefore<br />
be reformulated to finding a camera velocity that moves the feature points<br />
to desired positions in the image plane.<br />
Consider a camera with intrinsic properties of focal length f, pixel width<br />
ρu, pixel height ρv and principal point (u0, v0). The camera moves with a<br />
velocity ν = (v, ω) = (vx, vy, vz, ωx, ωy, ωz) and observes a world point P<br />
with coordinates P = (X, Y, Z) relative to the camera. P is projected onto<br />
the image plane at pixel coordinates p = (u, v) and can be considered as<br />
an image feature. Corke shows in [9] that with ū = u − u0 and ¯v = v − v0<br />
the feature velocity in the image plane can be written in terms of pixel<br />
23
24 4.1. Theory<br />
coordinates with respect to the principal point as<br />
˙p =<br />
� � �<br />
f<br />
˙u − ρuZ =<br />
˙v<br />
0 ū<br />
Z<br />
ρuū¯v<br />
f − f 2 +ρ 2<br />
uū2 0 −<br />
ρuf ¯v<br />
f<br />
ρvZ<br />
¯v<br />
Z<br />
f 2 +ρ 2<br />
v ¯v2<br />
ρvf − ρvū¯v<br />
� ��<br />
J p<br />
f<br />
� ⎜<br />
−ū ⎜<br />
� ⎝<br />
⎛<br />
vx<br />
vy<br />
vz<br />
ωx<br />
ωy<br />
ωz<br />
⎞<br />
⎟ (4.1)<br />
⎟<br />
⎠<br />
where J p is the image Jacobian matrix for a point feature, also sometimes<br />
called feature sensitivity matrix. The image Jacobian matrix essentially<br />
indicates how a feature point (u, v) would move in the image plane if the<br />
camera moves with velocity ν. The image Jacobian can be calculated for<br />
any pixel in the image plane with the only assumption that the depth Z<br />
from the camera to the world point P is known. See section 4.2.1 for details<br />
on the depth estimation.<br />
For more than one feature point the Jacobians can be stacked on top of<br />
each other. In particular, for the case of three points<br />
⎛ ⎞<br />
˙u1<br />
⎜ ˙v1 ⎟ ⎛ ⎞<br />
⎜ ⎟<br />
⎜ ˙u2 ⎟ J p1<br />
⎜ ⎟<br />
⎜ ˙v2 ⎟ = ⎝J<br />
⎠ p2 ν (4.2)<br />
⎜ ⎟<br />
⎝ ˙u3 ⎠ J p3<br />
˙v3<br />
the stacked Jacobian matrix is of dimension 6 × 6 and is non-singular so<br />
long as the three points are neither coincident nor collinear. Thus, to solve<br />
the control problem of driving the camera towards the target pose, equation<br />
4.2 can be inverted<br />
⎛ ⎞<br />
˙u1<br />
⎛ ⎞−1<br />
⎜ ˙v1 ⎟<br />
J ⎜ ⎟<br />
p1 ⎜ ˙u2 ⎟<br />
ν = ⎝J<br />
⎠ ⎜ ⎟<br />
p2 ⎜ ˙v2 ⎟<br />
J ⎜ ⎟<br />
p3 ⎝ ˙u3 ⎠<br />
˙v3<br />
(4.3)<br />
to solve for the desired camera velocity given the feature velocities of three<br />
points. To determine suitable feature velocities ˙p ∗ from the known desired<br />
feature positions p ∗ in the snapshot image and the actual feature positions<br />
p in the current view image, a simple proportional gain controller<br />
˙p ∗ = λ(p ∗ − p) (4.4)<br />
can be used, where λ is the proportional gain factor. Combined with equation<br />
4.3 the control law can be written as<br />
⎛ ⎞<br />
ν = λ ⎝ J p1<br />
⎠<br />
J p2<br />
J p3<br />
−1<br />
(p ∗ − p) (4.5)<br />
This controller will drive the camera in such a way that the feature points<br />
move to their desired positions in the image plane.<br />
For the general case of more than three features, all Jacobians can be stacked
Chapter 4. Visual Homing 25<br />
and the camera velocity can be calculated by taking the pseudo-inverse<br />
⎛ ⎞+<br />
J p1<br />
⎜ ⎟<br />
ν = λ ⎝ . ⎠ (p<br />
J pN<br />
∗ − p) (4.6)<br />
which minimizes the position error of all feature points.<br />
4.2 Implementation<br />
Figure 4.1 gives an overview of the data flow inside the Visual Homing<br />
process. First, the features are detected and feature descriptors extracted<br />
from the current camera image. The process then tries to match these<br />
features with features in the snapshot image coming from the Supervisor.<br />
More details on feature detection and matching are given in section 4.2.2.<br />
The process can only continue if there are more than three matches found. If<br />
this is the case, a RANSAC algorithm is used to get rid of any wrong matches<br />
that would result in a false velocity computation. The RANSAC algorithm<br />
is described in section 4.2.3. With the remaining matched features, the<br />
stacked image Jacobian is composed and the feature position errors are<br />
computed. This is inverted to compute the desired camera velocity (section<br />
4.2.4). After a coordinate transformation from the camera frame to the<br />
robot frame (section 4.2.5) the desired velocity is returned.<br />
4.2.1 Camera<br />
In section 4.1 it was noted that the feature depth Z is needed to compute the<br />
image Jacobian. It has been shown by some researchers that visual servoing<br />
is remarkably tolerant of errors in depth in many real world applications<br />
[9, 10]. Most of these researchers suggest setting Z to a constant value<br />
tuned by the user depending on workspace configuration. However, results<br />
from experiments conducted during this thesis suggest that the depth is a<br />
crucial factor that significantly affects performance, especially in outdoor<br />
environments with great variation in depth (see chapter 7).<br />
For this work, the Microsoft Kinect camera was used. It was chosen, because<br />
in addition to the camera image it also provides a depth estimation and<br />
because of its good integration in the ROS framework.<br />
Limitations<br />
The Kinect sensor is a cheap and easy solution to get a camera image and<br />
a depth estimate for each pixel in the image. However, there are a few<br />
limitations:<br />
Depth The depth estimation only works in a range of up to a few meters.<br />
Anything farther away has an undefined depth value. This limited<br />
range is reasonable for a typical indoor environment, but not very<br />
helpful in most outdoor environments. The constant depth value that
26 4.2. Implementation<br />
Figure 4.1: Data flow in the Visual Homing process.<br />
has to be assumed for any point that the Kinect cannot estimate a<br />
depth for can significantly affect performance.<br />
Automatic Exposure Like most consumer-grade webcams, the Kinect’s<br />
RGB camera has a built-in automatic exposure time adjustment. In<br />
very bright scenes, the camera automatically reduces the exposure<br />
time. This can result in degraded matching performance, because<br />
features may have a significantly different appearance.<br />
4.2.2 Feature Detection and Matching<br />
There is currently a great effort in the research community to find faster and<br />
more reliable algorithms for feature detection and extraction of a descriptor<br />
for the detected features. Notable developments are the Scale Invariant<br />
Feature Transform (SIFT) by Lowe [14] and Speeded Up Robust Features<br />
(SURF) by Bay et al. [15]. A more recent development is the Binary Robust
Chapter 4. Visual Homing 27<br />
Figure 4.2: The Microsoft Kinect sensor.<br />
Invariant Scalable Keypoints algorithm (BRISK) by Leutenegger et al. [16].<br />
In this work, BRISK is used, because it promises better computational<br />
efficiency than SIFT and SURF while still being comparable in terms of<br />
matching performance [16]. The reference implementation 1 in C++ that<br />
Leutenegger et al. published alongside their paper was used here.<br />
4.2.3 RANSAC<br />
Sometimes the feature matching process fails and matches two features that<br />
only have similar appearance but do not represent the same point in the<br />
environment. This wrong match leads to a wrong desired velocity calculated<br />
by the visual servo algorithm. The effect gets worse if there are only<br />
few other correctly matched features or if the feature position error for the<br />
falsely matched feature is large compared to the correctly matched features.<br />
For this reason it is desirable to detect and ignore these incorrect matches.<br />
One way to do this is the Random Sample Consensus algorithm (RANSAC)<br />
[17]. RANSAC can be used to estimate parameters of a mathematical model<br />
from a set of observed data which contains outliers. In this case, the mathematical<br />
model that is to be found is simply the desired velocity produced by<br />
the visual servo algorithm. The observed data are matched feature points<br />
between current view and snapshot and outliers are incorrect matches.<br />
4.2.4 Image Jacobian<br />
The image Jacobian is computed for all matched feature points as described<br />
in section 4.1 and then stacked to get a combined image Jacobian of dimension<br />
2N × 6, where N is the number of matched features. Then, the desired<br />
velocity can be computed by taking the pseudo-inverse of the stacked Jacobian<br />
as in equation 4.6. The pseudo-inverse is computed via singular value<br />
decomposition using the method<br />
jacobiSvd(Eigen::ComputeThinU | Eigen::ComputeThinV).solve()<br />
which is a member of the Eigen::Matrix class.<br />
1 http://www.asl.ethz.ch/people/lestefan/personal/BRISK
28 4.2. Implementation<br />
4.2.5 Coordinate Transformation<br />
By default, the visual servo algorithm as it is written in section 4.1 calculates<br />
the desired velocity CνC of the camera expressed in the camera’s coordinate<br />
frame. This has to be transformed into desired velocity RνR of the robot<br />
expressed in the robot frame, which can be done under the assumption that<br />
the camera is rigidly attached to the robot. Figure 1.1 shows the coordinate<br />
frames of camera and robot. The rotation matrix that transforms vectors<br />
expressed in the camera frame into vectors in the robot frame is denoted as<br />
ARC. The translational velocity v and rotational velocity ω are transformed<br />
separately. First, both are transformed from the camera into the robot<br />
frame:<br />
and<br />
RvC = ARC CvC<br />
RωC = ARC CωC<br />
(4.7)<br />
(4.8)<br />
Now, since the rotational velocity is the same in all points of a rigid body,<br />
the rotational velocity of the robot is the same as the rotational velocity of<br />
the camera:<br />
(4.9)<br />
RωR =R ωC<br />
The translational velocity of the robot can be computed using the rigid<br />
body equation for velocities<br />
RvR =R vC +R ωC ×R rRC<br />
(4.10)<br />
where RrRC is the vector from the origin of the robot frame R to the origin<br />
of the camera frame C expressed in the robot frame. Finally, the desired<br />
velocity of the robot expressed in the robot frame is<br />
� �<br />
RvR<br />
RνR =<br />
(4.11)<br />
RωR
Chapter 5<br />
Supervisor<br />
The Supervisor builds and maintains the topological map that represents<br />
the learned path. During the learning phase, it decides when to create a<br />
new topological node in the map. This will be further described in section<br />
5.1. During autonomous playback, the Supervisor passes snapshot images<br />
to the Visual Homing process. It has to decide when a snapshot image has<br />
been successfully reached and transition to the next snapshot in line. The<br />
behavior during Playback is described in section 5.2. Finally, when playback<br />
is first initiated, the Supervisor has the ability to localize the robot on the<br />
path to determine at which snapshot it should start the playback (see section<br />
5.3).<br />
Instead of saving the full images, the Supervisor internally stores the snapshot<br />
images as feature vectors. A feature vector describes one image and<br />
contains the descriptors of all BRISK features detected in the image. This<br />
is done to both decrease the memory needed to store the map and also to<br />
save the computation time for detecting and extracting the features from<br />
the same image multiple times.<br />
5.1 Path Learning<br />
5.1.1 Sparse<br />
In the sparse version, the Supervisor tries to save as few snapshots as possible<br />
while still being able to guarantee that the snapshots are reachable from<br />
one another, i.e. that they lie in each other’s catchment area. When the human<br />
operator starts driving the robot along the path, the Supervisor saves<br />
the camera image as the first snapshot. It then passes this first snapshot<br />
image to the Visual Homing process. Now, while the robot is driven away<br />
from the first snapshot, the magnitude of the homing vector produced by<br />
the Visual Homing process steadily increases. Once it is bigger than a certain<br />
threshold, the Supervisor saves a new snapshot image the the process<br />
repeats itself.<br />
29
30 5.2. Autonomous Path Playback<br />
5.1.2 Dense<br />
In contrast to the sparse version, the Supervisor in the dense version saves a<br />
lot more data in the map. It saves as many snapshots as technically possible<br />
for the camera. This is typically around 10 Hz 1 . Along with these snapshots<br />
it also saves the velocity commands which are given by the human operator.<br />
Also, a timestamp is saved for each snapshot, so that during playback, the<br />
Supervisor knows when to publish the next snapshot.<br />
5.2 Autonomous Path Playback<br />
5.2.1 Sparse<br />
The trickiest part of playback in the sparse version is to know when to make<br />
the transition to the next snapshot. When playback is first initiated, the<br />
Supervisor passes the first snapshot image to Visual Homing. Subsequently,<br />
the robot will start moving towards this snapshot image. Similarly as in<br />
path learning, the Supervisor looks at the magnitude of the homing vector,<br />
or desired velocity, produced by the Visual Homing process. This magnitude<br />
decreases when the robot gets closer to the snapshot. The next snapshot is<br />
published as soon as the magnitude falls below a certain threshold.<br />
5.2.2 Dense<br />
During playback in the dense version, the Supervisor publishes the snapshots<br />
at the same rate as they were acquired. This behavior is much like<br />
a video player would play back a movie. The recorded velocity commands<br />
that were stored alongside the snapshot images are passed to the Arbiter,<br />
where they will be combined with the output from Visual Homing (see section<br />
6.2). Thus, the recorded control inputs are used as a feed forward<br />
control. On top of that, the Visual Homing provides a closed loop control<br />
to correct small errors. That means, if the robot is too slow, the velocity is<br />
increased so it can catch up to the desired position on the path. If it is too<br />
fast, velocity is decreased.<br />
A problem arises if, for any reason, the robot lags behind too much. Since<br />
the robot’s speed is limited, it might not be able to catch up. Even worse,<br />
if the system fails to find any correspondences between current view and<br />
snapshot image, it will not be able to correct the error at all. The Supervisor<br />
detects this risk by – again – looking at the magnitude of the homing vector<br />
produced by Visual Homing. If and only if the magnitude rises above a<br />
certain threshold and the homing vector points into the same direction as<br />
the recorded velocity command, the Supervisor pauses the playback, i.e. it<br />
delays the transition to the next snapshot. This is much like hitting the<br />
pause button on the video player: The image – in this case the snapshot<br />
image – is still there, only the next image in line is not shown. Thus, the<br />
robot still drives with the same recorded velocity command associated with<br />
1 While the Kinect camera could provide 640 × 480 images with a frequency of 30 Hz,<br />
the extra time is needed to detect and extract the BRISK features from the camera<br />
images. Therefore, the frequency depends on the number of features that are detected.
Chapter 5. Supervisor 31<br />
this snapshot and Visual Homing still tries to reduce the error. Once the<br />
magnitude of the homing vector falls below the threshold, path progression<br />
is continued.<br />
5.3 Localization<br />
So far, the Supervisor has assumed that when path playback is first initiated,<br />
the robot starts at the same place where path learning was started before.<br />
Localization functionality was added to the Supervisor to be able to put the<br />
robot at any position on the path and let it go from there. When playback<br />
is initiated, the Supervisor first tries to find the most likely position of the<br />
robot on the path and then starts playback from this position. For this, the<br />
current camera image is compared to each snapshot of the learned path.<br />
This is done by passing one snapshot image after the other to the Visual<br />
Homing process. For each of them, Visual Homing produces a homing<br />
vector. The snapshot for which it produces the homing vector with the<br />
smallest non-zero magnitude is assumed to be the most likely position of<br />
the robot. Figure 5.1 shows the magnitudes of all homing vectors with a<br />
clear minimum that represents the most likely position of the robot.<br />
Magnitude of homing vector<br />
5<br />
4<br />
3<br />
2<br />
1<br />
Localization on the path.<br />
../../../../bagfiles/2012-03-02_localization/localizationDoublecheck_stellwand_dc1.bag<br />
0<br />
0 50 100 150 200<br />
Snapshot index<br />
Figure 5.1: Magnitude of the homing vectors from current position to each<br />
snapshot on the path with one distinct minumum.<br />
One could think of a situation where two locations on the path have a similar<br />
appearance. If the robot is standing at one of these locations, there will be<br />
two local minima as seen in figure 5.2.<br />
If the Supervisor ends up picking the wrong of the two minima, the robot<br />
might cause a collision. To avoid this, a double check has been implemented.<br />
During the initial localization, the two most likely positions are saved. Then,<br />
it is first assumed that the robot is at the first most likely position. From<br />
there, the Supervisor starts playing back the path. After a certain number<br />
of snapshots the robot is stopped and localization is done again. If the
32 5.3. Localization<br />
Magnitude of homing vector<br />
1.2<br />
1.0<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0.0<br />
Localization on the path.<br />
../../../../bagfiles/2012-03-02_localization/localization_stellwand_dc5.bag<br />
50 100 150 200<br />
Snapshot index<br />
Figure 5.2: Magnitude of the homing vectors from current position to each<br />
snapshot on the path with two minima.<br />
most likely position in this second localization points to approximately the<br />
expected position, then the first localization is assumed to be correct and<br />
playback is restarted. If, however, the second localization points to an<br />
unexpected location, the Supervisor makes the robot reverse back to the<br />
original starting position and starts playback from the second most likely<br />
position.<br />
To be able to reverse back from the double check location to the original<br />
starting position, two things have to be done:<br />
1. The path from the original starting position up to the point where the<br />
double check takes place has to be saved. Otherwise, the robot would<br />
not know how to get back. This is done by the same path learning<br />
technique as during the regular path learning. For the dense version<br />
that means also recording the velocity commands given to the robot,<br />
this time not given by the human operator, but by the Arbiter.<br />
2. The Supervisor has to play the path backwards, so the order of snapshots<br />
is reversed. Furthermore, in the dense version, the recorded<br />
velocity commands have to be inversed by changing the sign of each<br />
component of the velocity vector.
Chapter 6<br />
Arbiter<br />
The Arbiter process got its name, because it has the power to decide which<br />
desired velocity is allowed to be sent to the robot at a given time. For<br />
example, in path learning mode, the joystick’s command is sent to the robot,<br />
while in playback mode, the commands from the Visual Homing process are<br />
forwarded. However, before sending the velocity commands on to the robot,<br />
the Arbiter also adapts the velocity command to the particular robot. In<br />
the following sections, only the procedure during playback is shown, because<br />
the adaptation of the joystick’s command during path learning is trivial.<br />
As is the case for the Supervisor, there are again two versions of the Arbiter<br />
– a sparse and a dense version. These are detailed in sections 6.1 and 6.2,<br />
respectively.<br />
6.1 Sparse<br />
Figure 6.1a shows the data flow of the Arbiter in the sparse version. The<br />
individual steps are described in the following sections.<br />
6.1.1 Y-Offset Correction<br />
The desired velocity, or homing vector, produced by the Visual Homing process<br />
has full six degrees of freedom. Since the two kinds of robots considered<br />
in this work – a differential drive ground robot and a differential drive boat<br />
– cannot control all of these degrees of freedom, the dimensionality has to<br />
be reduced. Only vx (translation in x direction) and ωz (rotation around<br />
the z axis) can be controlled directly (see figure 1.1). This is the velocity<br />
command that can be sent to the robot. Additionally, vy (translation in y<br />
direction) can be controlled by a combination of vx and ωz.<br />
When the robot faces in a direction parallel to the path from the last to<br />
the next node, vy indicates the robot’s offset in y-direction from the path.<br />
To reduce this y-offset, the robot’s heading θ must be changed towards the<br />
desired path. However, since Visual Homing tries to control θ back to the<br />
value it thinks is right, a correction value cθ can be added to the desired<br />
33
34 6.1. Sparse<br />
(a) Sparse (b) Dense<br />
Figure 6.1: Data flow in the Arbiter process in the sparse and the dense<br />
version.
Chapter 6. Arbiter 35<br />
velocity ωz such that Visual Homing and cθ counteract each other. This<br />
can be done by the following algorithm:<br />
cθ ← 0<br />
while (vx, vy, ωz) ← new desired velocity do<br />
cθ ← clip (cθ + ki ∗ clip(vy, k1), k2)<br />
ωz ← ωz + cθ ∗ clip(k3 ∗ vx, k4)<br />
end while<br />
The constants ki, k1, k2, k3 and k4 are tuning parameters and clip(x, xmax)<br />
is a function that limits the absolute value of x to xmax:<br />
function clip(x, xmax)<br />
if x > xmax then<br />
return xmax<br />
else if x < −xmax then<br />
return −xmax<br />
else<br />
return x<br />
end if<br />
end function<br />
Two things should be noted here:<br />
1. cθ is an integral, because it keeps it’s value between iterations of the<br />
while-loop. This is necessary, because the vy value produced by Visual<br />
Homing decreases once the robot turns towards the path. ωz does not<br />
keep its value between iterations, but is recalculated every iteration<br />
by Visual Homing.<br />
2. In the update step of ωz, the correction value cθ is scaled by vx. This<br />
ensures that, when the robot gets close to the snapshot image, the<br />
influence of cθ decreases so that it lets Visual Homing reduce the θ<br />
error. This is necessary for the Supervisor to be able to detect arrival<br />
at the snapshot.<br />
6.1.2 Filtering and Clipping<br />
The desired velocity produced by Visual Homing can be very noisy and,<br />
when some matches are wrong, can have very large spikes. To make the<br />
robot drive more smoothly, the rate of change of the individual components<br />
of the velocity vector is limited. Furthermore, to keep the robot safe, the<br />
absolute value of each velocity component is limited by the clip() function.<br />
6.1.3 Correct Theta First<br />
While the Visual Servo algorithm guarantees that the camera eventually<br />
reaches the target pose, it does not guarantee to get there via the shortest<br />
path. This is most noticed when only an error in θ is present. Instead<br />
of just giving a ωz command, visual servoing also produces a forward or<br />
backward velocity. This problem is known in literature as the camera retreat<br />
problem. To get around this problem, the Arbiter ensures that any error in
36 6.2. Dense<br />
θ is controlled first, so forward velocity vx is suppressed when ωz is large.<br />
This is done by updating vx with<br />
vx ← vx ∗ cvx (ωz)<br />
The coefficient function cvx (ωz) is described by<br />
cvx (ωz) =<br />
1<br />
1 + exp(k5 · (|ωz| − k6))<br />
(6.1)<br />
where k5 and k6 are tuning parameters that determine up to which error in<br />
θ the robot is still allowed to drive in x direction. The resulting function is<br />
depicted in figure 6.2.<br />
cvx (ωz)<br />
1<br />
0.75<br />
0.5<br />
0.25<br />
0 0.25 0.5 0.75 1<br />
Figure 6.2: The coefficient function that suppresses forward velocity when<br />
a θ error is present. Here, k5 = 30 and k6 = 0.45.<br />
6.2 Dense<br />
Figure 6.1b shows the data flow of the Arbiter in the dense version. Most<br />
steps in the dense version are the same as in the sparse version. The steps<br />
that are different are described in the following sections.<br />
6.2.1 Sum of the Two Velocity Vectors<br />
Additionally to the desired velocity input from the Visual Homing process,<br />
the dense Arbiter also receives the recorded velocity commands from the<br />
Supervisor. These two velocities have to be fused in order to get a single<br />
velocity command that can be passed on to the robot. This is done by<br />
adding the two velocity vectors together. Thus, if the two vectors point<br />
into the same direction, the robot accelerates, if they point into opposite<br />
directions, it slows down.<br />
|ωz|
Chapter 6. Arbiter 37<br />
6.2.2 Y-Offset Correction<br />
As in the sparse version, a correction value cθ needs to be added to ωz<br />
to turn the robot towards the path. However, there are two reasons why<br />
y-offset correction is less complicated in the dense version:<br />
1. Generally, the error in x direction is far smaller in the dense version.<br />
This helps correction of y-offset, because it greatly reduces the issue<br />
where the vy produced by visual servoing decreases when the robot<br />
turns towards the path. Therefore, cθ can be calculated anew in every<br />
iteration and does not need to be integrating.<br />
2. Since the Supervisor’s decision to transition to the next snapshot is<br />
independent of the homing vector, the robot does not need to reduce<br />
the heading error when it gets close to a snapshot.<br />
Thus, the ωz-update step simplifies to<br />
while (vx, vy, ωz) ← new desired velocity do<br />
cθ ← clip (k7 ∗ vx ∗ vy, k8)<br />
ωz ← ωz + cθ<br />
end while<br />
The constants k7 and k8 are tuning parameters. This step happens after<br />
the combination of the two velocity commands, so vx, vy and ωz already<br />
contain both the desired velocity from Visual Homing and the recorded<br />
velocity command from the Supervisor.
38 6.2. Dense
Chapter 7<br />
Experiments<br />
7.1 BIBA Ground Robot<br />
The whole system was developed and tested on the differential drive ground<br />
robot BIBA, shown in figure 7.1. Tests against ground truth were done with<br />
a Vicon optical motion capture system 1 .<br />
Figure 7.1: The differential drive ground robot BIBA with the Kinect sensor<br />
mounted on top.<br />
During all tests, the robot was first steered by a human operator via a<br />
joystick to teach a path. Where compared against each other, both the<br />
sparse and the dense Supervisor learned the same path.<br />
1 http://www.vicon.com<br />
39
40 7.1. BIBA Ground Robot<br />
7.1.1 Path Tracking<br />
Figure 7.2 shows the trajectories viewed from above of the robot during<br />
path learning, sparse playback and dense playback. It also shows where<br />
topological nodes were created by the Supervisor and where the Supervisor<br />
actually transitioned from one node to the next during sparse playback. It<br />
reveals the problems of the sparse playback. While tracking of θ and x is<br />
very accurate, y-offset correction is very inefficient, especially on a curvy<br />
path like this. This is due to the fact that θ is controlled first: The robot<br />
first aligns its heading and then it drives forward. A different problem of<br />
the sparse version is only noticeable when looking at the robot in real life: It<br />
drives in a stop-and-go motion. The robot has to slow down until it almost<br />
stops at each node, before the Supervisor detects arrival and it can head on<br />
to the next node. So, while the sparse version gets close to the destination,<br />
the way it gets there is not very pretty and potentially dangerous if the<br />
path is close to obstacles. The dense version tracks the learned path a lot<br />
better. When observing the robot in real life it is almost impossible to tell<br />
the difference between the human operator driving during path learning and<br />
autonomous playback.<br />
y [m]<br />
2.0<br />
1.5<br />
1.0<br />
0.5<br />
Trajectory<br />
../../../../bagfiles/2011-12-01_ViconTest/s1_2011-12-01-15-04-50.bag<br />
1.5 1.0 0.5 0.0 0.5 1.0<br />
x [m]<br />
Learned Path<br />
Playback Dense<br />
Playback Sparse<br />
Sparse Nodes<br />
Sparse Transitions<br />
Figure 7.2: Path tracking of the sparse and dense version compared against<br />
the learned path.<br />
A more systematic evaluation of y-offset correction in the dense version is<br />
shown in figure 7.3. The learned path is a straight line and playback is<br />
started with an offset in y. The system is able to correct the offset quickly<br />
with little overshoot.<br />
In general, the dense version looks much nicer and it is much more likely to<br />
get the robot to the destination, especially in environments where very few<br />
features can be detected. Because of the feed-forward control via recorded<br />
velocity commands, the dense version is able to get across regions where<br />
visual servoing cannot produce a desired velocity. In those regions, the<br />
sparse version fails.
Chapter 7. Experiments 41<br />
y [m]<br />
0.5<br />
1.0<br />
1.5<br />
2.0<br />
2.5<br />
Trajectory<br />
../../../../bagfiles/2012-03-08_viconTest2/yoffset2_2012-03-08-17-16-42.bag<br />
2.0 1.5 1.0 0.5 0.0 0.5 1.0<br />
x [m]<br />
Figure 7.3: y-offset correction in the dense version.<br />
7.1.2 Localization<br />
Learned Path<br />
Playback Dense<br />
Generally, localization works very well, even when the playback is started<br />
with a large y-offset from the path. As expectable, it is less reliable in<br />
environments with a low number of visual features. However, it was very<br />
hard to test the recovery after a failed localization, since a procedure had<br />
to be found to reliably confuse the system. This was done by the setup<br />
shown in figure 7.4. Two identical movable walls were created with strong<br />
visual features so that a path could be recorded that has two locations with<br />
identical camera views. The taught path is pictured in figure 7.5. The robot<br />
was driven towards wall 1, turned right, driven towards wall 2 and turned<br />
left. During the learning stage, wall 2 was approached a little bit closer<br />
than wall 1. Then, the robot was put in front of wall 1, but a little bit<br />
closer than while recording the path. This procedure effectively tricks the<br />
system into thinking that it is standing in front of wall 2 when, in fact, it<br />
is standing in front of wall 1.<br />
Figure 7.5 shows the path during dense playback. Localization was initiated<br />
at the starting position. Since localization failed and the system thought<br />
it was in front of wall 2, the robot first turned left. After driving for 20<br />
snapshots, the Supervisor stops playback and localizes again. Since this<br />
does not yield the expected location of 20 snapshots behind the starting<br />
point, the system realizes that its first localization was wrong. It can now<br />
recover from this situation, because the path from the original starting<br />
position to the double check position was recorded. It reverses back to the<br />
starting position and starts playback from the second most likely solution<br />
of the initial localization. This one is correct and the robot can complete<br />
the path.<br />
As mentioned, it was very hard to confuse the localization. It is very unlikely<br />
to encounter a situation like this in real life. However, in the future, a similar<br />
approach could be used with a less reliable sensor than vision. This is why<br />
this test is to be seen as a proof of concept. The important point here is<br />
that, because the system is able to record the path from the original starting
42 7.2. Lizhbeth ASV<br />
Figure 7.4: Setup to test recovery from a failed localization with two identical<br />
movable walls.<br />
location up until the point where it notices that it is wrong, it is able to<br />
recover from this situation by reversing back to the start.<br />
7.2 Lizhbeth ASV<br />
The ASV Lizhbeth shown in figure 7.6 is a differential drive catamaran<br />
boat. Therefore, it has very similar motion characteristics to BIBA. The<br />
biggest difference is that motor speed does not directly translate to boat<br />
speed but to acceleration. This makes the feed-forward control part used<br />
in the dense version much less effective. The tests on the boat showed that<br />
the dense version performed much worse than the sparse version in this<br />
application. However, the sparse version, even though it was better than<br />
the dense version, did not work very well either. Only in perfect conditions<br />
– no wind and almost no waves – the system was able to complete paths.<br />
Nevertheless, the tests with Lizhbeth helped to identify problems that causes<br />
the system to perform worse than on land.<br />
Lighting Problems Figure 7.7 shows two camera images from the same<br />
video, taken about one second apart from each other. Due to the camera’s<br />
built-in exposure adjustment, details inside the boat-house are nicely visible<br />
in one image and completely underexposed in the other, vice versa for<br />
landmarks across the lake. This poses a great problem for visual homing,<br />
because image features that are used to control the boat might be completely<br />
invisible a moment later. This issue could be diminished by using a<br />
camera that allows manual control of the shutter time. That way, during<br />
the path learning phase, the Supervisor could record the shutter time used<br />
for capturing each snapshot. During playback, the camera exposure could<br />
always be set to the setting that corresponds to the current snapshot.
Chapter 7. Experiments 43<br />
Double Check Position<br />
Wall 1<br />
Starting Position<br />
Learned Path<br />
Playback Dense<br />
Wall 2<br />
Figure 7.5: Path of the robot with recovery from a failed localization with<br />
two identical movable walls.<br />
Depth Estimation Problems As mentioned in section 4.1, the visual<br />
servoing algorithm needs the depth of each feature point to be able to compute<br />
the image Jacobian matrix. The Kinect sensor gives this information<br />
only up to a range of a few meters. Points that the Kinect cannot estimate<br />
a depth for are assumed to be at a constant depth. This is enough for an<br />
indoor environment, but outdoor landmarks are often farther away than a<br />
few meters. While visual servoing is surprisingly tolerant of errors in the<br />
depth estimation, there are limitations. For example, if driving toward the<br />
boat house, a constant depth assumption of 10 meters might be sensible.<br />
But if the boat is to drive out of the harbor, points on the other side of the<br />
lake may easily be several kilometers away, so the assumption of 10 meters<br />
would grossly underestimate the depth. While this underestimation does<br />
not have an effect on the rotational velocity component ω of the homing<br />
vector, it decreases the magnitude of translational velocity v. This can be<br />
seen in equation 4.1. A solution to this problem could be to use stereo vision<br />
to estimate depth. While stereo vision cannot accurately estimate depth of<br />
points that are far away, either, at least it would be able to indicate that<br />
the points are far away.
44 7.2. Lizhbeth ASV<br />
Figure 7.6: The ASV Lizhbeth.<br />
(a) (b)<br />
Figure 7.7: Two frames from the same video, captured about 1 second apart<br />
from each other. Automatic exposure adjustment of the camera causes<br />
lighting problems.
Chapter 8<br />
Conclusion and Outlook<br />
In this thesis, a software system was developed that can learn a path from a<br />
stream of images. A human operator can steer the robot, thereby teaching<br />
a path to the robot. Later, the robot can be put anywhere near the taught<br />
path and the robot is able to complete the previously taught path. Two<br />
approaches based on Image Based Visual Servoing have been developed:<br />
A sparse version, where the system tries to save as few images as possible<br />
and drives on the path using only vision to control the robot, and a dense<br />
version, where the full video stream is saved alongside the control commands<br />
given during the path learning phase. Now, the recorded control commands<br />
drive the robot in a feed-forward control while IBVS is used to correct<br />
errors. Furthermore, a method has been presented that allows the robot<br />
to localize itself on the previously taught path. If localization fails and the<br />
robot starts driving into a wrong direction, the system is able to recover<br />
by, again, recording the incorrectly driven path and reversing it back to the<br />
original starting position.<br />
The entire system has been developed using a differential drive ground robot<br />
equipped with a Kinect camera and depth sensor. It was then tested against<br />
ground truth using a Vicon optical motion capture system. Tests show,<br />
that while both the sparse and the dense version get the robot closely to its<br />
destination, the dense version does so in a much nicer, safer, more accurate<br />
and more reliable way. The tests show the system’s ability to compensate<br />
for the non-holonomic properties of the differential drive robot, especially in<br />
the dense version. The recovery from a failed localization has been shown<br />
to work.<br />
Further tests of the system have been conducted on a differential drive<br />
catamaran ASV on the lake of Zurich. While the sparse version was able to<br />
complete paths in optimal environmental conditions, the tests have shown<br />
that the presented approach will not work reliably on the boat with the<br />
Kinect sensor. The main reasons why the Kinect is insufficient are lighting<br />
problems due to the camera’s automatic exposure adjustment and the<br />
limited range of depth estimation.<br />
45
46 8.1. Future Improvements<br />
8.1 Future Improvements<br />
For outdoor use, a stereo vision approach could be more suitable for providing<br />
depth information. Although not being very accurate at long distance,<br />
at least it gives the information that a point is far away. The Kinect can<br />
only <strong>report</strong> that it does not have any depth information for a far away point.<br />
Feature matching performance could be improved by using a camera that<br />
allows manual control of exposure. This would reduce problems of lighting<br />
changes in the image.<br />
Localization reliability could be improved by suitable filtering of the homing<br />
vector magnitudes that are used for localization. Consider the plot shown<br />
in figure 8.1. The correct localization in this situation would be at snapshot<br />
number 33. Indeed, snapshot 33 does produce the homing vector with the<br />
lowest magnitude. However, the localization almost failed due to some<br />
outliers around snapshot number 230. For example, a median filter could<br />
be used to remove the worst outliers.<br />
Localization on the path.<br />
../../../../bagfiles/2012-03-02_localization/localizationDoublecheck_stellwand_dc4.bag<br />
Magnitude of homing vector<br />
2.0<br />
1.5<br />
1.0<br />
0.5<br />
0.0<br />
50 100 150 200<br />
Snapshot index<br />
Figure 8.1: Magnitudes of homing vectors to each snapshot used for localization.
Bibliography<br />
[1] G. Hitz, F. Pomerleau, M. Garneau, C. Pradalier, T. Posch, J. Pernthaler,<br />
and R. Siegwart, “Design and Application of a Surface Vessel<br />
for Autonomous Inland Water Monitoring,” Robotics & Automation<br />
Magazine, IEEE, no. 99, p. 1, 2012.<br />
[2] S. D. Jones, C. Andresen, and J. L. Crowley, “Appearance based processes<br />
for visual navigation,” in Intelligent Robots and Systems, 1997.<br />
IROS ’97., Proceedings of the 1997 IEEE/RSJ International Conference<br />
on, pp. 551–557, 1997.<br />
[3] A. Vardy, “Long-Range Visual Homing,” in 2006 IEEE International<br />
Conference on Robotics and Biomimetics, pp. 220–226, IEEE, Dec.<br />
2006.<br />
[4] E. Motard, B. Raducanu, V. Cadenat, and J. Vitria, “Incremental<br />
On-Line Topological Map Learning for A Visual Homing Application,”<br />
in Robotics and Automation, 2007 IEEE International Conference on,<br />
pp. 2049–2054, 2007.<br />
[5] J. Zeil, M. I. Hofmann, and J. S. Chahl, “Catchment areas of panoramic<br />
snapshots in outdoor scenes,” Journal of the Optical Society of America<br />
A, vol. 20, no. 3, pp. 450–469, 2003.<br />
[6] R. Möller, “Local visual homing by warping of two-dimensional images,”<br />
Robotics and Autonomous Systems, vol. 57, pp. 87–101, Jan.<br />
2009.<br />
[7] R. Möller and A. Vardy, “Local visual homing by matched-filter descent<br />
in image distances,” Biological Cybernetics, vol. 95, pp. 413–430, Sept.<br />
2006.<br />
[8] D. Churchill and A. Vardy, “Homing in scale space,” in Intelligent<br />
Robots and Systems, 2008. IROS 2008. IEEE/RSJ International Conference<br />
on, pp. 1307–1312, 2008.<br />
[9] P. Corke, “Vision-Based Control,” in Robotics, Vision and Control,<br />
Springer Berlin / Heidelberg, 2011.<br />
[10] A. Cherubini and F. Chaumette, “Visual Navigation With Obstacle<br />
Avoidance,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems,<br />
IROS’11, (San Francisco, USA), pp. 1593–1598, Sept. 2011.<br />
[11] Ohno, “Autonomous navigation for mobile robots referring prerecorded<br />
image sequence,” in Intelligent Robots and Systems ’96, IROS<br />
96, Proceedings of the 1996 IEEE/RSJ International Conference on,<br />
pp. 672–679, 1996.<br />
47
48 Bibliography<br />
[12] G. Blanc, Y. Mezouar, and P. Martinet, “Indoor Navigation of a<br />
Wheeled Mobile Robot along Visual Routes,” in 2005 IEEE International<br />
Conference on Robotics and Automation, pp. 3354–3359, IEEE,<br />
2005.<br />
[13] C. Pradalier, P. Bessière, and C. Laugier, Driving on a Known Sensori-<br />
Motor Trajectory with a Car-like Robot, vol. 21 of Springer Tracts in<br />
Advanced Robotics. Berlin/Heidelberg: Springer-Verlag, 2006.<br />
[14] D. Lowe, “Object recognition from local scale-invariant features,” in<br />
Proceedings of the Seventh IEEE International Conference on Computer<br />
Vision, (Kerkyra , Greece), pp. 1150–1157 vol.2, IEEE, 1999.<br />
[15] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded Up Robust<br />
Features,” Computer Vision–ECCV 2006, 2006.<br />
[16] S. Leutenegger and M. Chli, “BRISK: Binary Robust Invariant Scalable<br />
Keypoints,” in Proceedings of the IEEE International Conference on<br />
Computer Vision (ICCV), 2011.<br />
[17] M. A. Fischler and R. C. Bolles, “Random sample consensus: a<br />
paradigm for model fitting with applications to image analysis and automated<br />
cartography,” Communications of the ACM, vol. 24, pp. 381–<br />
395, June 1981.
List of Figures<br />
1.1 Coordinate frames of robot (light gray) and camera (dark<br />
gray). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11<br />
3.1 Illustration of the learned path in the sparse version. . . . . . 18<br />
3.2 Data flow between the ROS nodes in the sparse version. . . . 19<br />
(a) Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
(b) Playback . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
3.3 Illustration of the learned path in the dense version. . . . . . 20<br />
3.4 Data flow between the ROS nodes in the dense version. . . . 21<br />
(a) Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
(b) Playback . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
4.1 Data flow in the Visual Homing process. . . . . . . . . . . . . 26<br />
4.2 The Microsoft Kinect sensor. . . . . . . . . . . . . . . . . . . 27<br />
5.1 Magnitude of the homing vectors from current position to<br />
each snapshot on the path with one distinct minumum. . . . 31<br />
5.2 Magnitude of the homing vectors from current position to<br />
each snapshot on the path with two minima. . . . . . . . . . 32<br />
6.1 Data flow in the Arbiter process in the sparse and the dense<br />
version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />
(a) Sparse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />
(b) Dense . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />
6.2 The coefficient function that suppresses forward velocity when<br />
a θ error is present. Here, k5 = 30 and k6 = 0.45. . . . . . . . 36<br />
7.1 The differential drive ground robot BIBA with the Kinect<br />
sensor mounted on top. . . . . . . . . . . . . . . . . . . . . . 39<br />
7.2 Path tracking of the sparse and dense version compared against<br />
the learned path. . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />
7.3 y-offset correction in the dense version. . . . . . . . . . . . . . 41<br />
7.4 Setup to test recovery from a failed localization with two<br />
identical movable walls. . . . . . . . . . . . . . . . . . . . . . 42<br />
7.5 Path of the robot with recovery from a failed localization<br />
with two identical movable walls. . . . . . . . . . . . . . . . . 43<br />
7.6 The ASV Lizhbeth. . . . . . . . . . . . . . . . . . . . . . . . . 44<br />
7.7 Two frames from the same video, captured about 1 second<br />
apart from each other. Automatic exposure adjustment of<br />
the camera causes lighting problems. . . . . . . . . . . . . . . 44<br />
(a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />
(b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />
49
50 List of Figures<br />
8.1 Magnitudes of homing vectors to each snapshot used for localization.<br />
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46