Scarica (PDF – 6.19 MB)
Scarica (PDF – 6.19 MB)
Scarica (PDF – 6.19 MB)
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
UNIVERSITÀ DEGLI STUDI DI CATANIA<br />
facoltà di Ingegneria<br />
corso di laurea specialistica in ingegneria informatica<br />
Filippo Bannò<br />
STEREOSCOPIC AUGMENTED REALITY<br />
TO ASSIST ROBOT TELEOPERATION<br />
Tesi di laurea<br />
anno accademico 2008/2009<br />
Relatore:<br />
Prof. Ing. G. Muscato<br />
Correlatore:<br />
Dr. S. Livatino
Contents<br />
1 Introduction 3<br />
2 Background 7<br />
2.1 Augmented reality . . . . . . . . . . . . . . . . . . . . . 7<br />
2.2 Pinhole camera model . . . . . . . . . . . . . . . . . . . 10<br />
2.3 Stereoscopic visualization . . . . . . . . . . . . . . . . . . 14<br />
3 Augmented reality visual Interfaces in robot teleopera-<br />
tion 19<br />
3.1 A sensor fusion based user interface for vehicle teleoper-<br />
ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />
3.2 Fusion of laser and visual data for robot motion planning<br />
and collision avoidance . . . . . . . . . . . . . . . . . . . 22<br />
3.3 Using augmented reality to interact with an autonomous<br />
mobile platform . . . . . . . . . . . . . . . . . . . . . . . 23<br />
3.4 Improved interfaces for human-robot interaction in ur-<br />
ban search and rescue . . . . . . . . . . . . . . . . . . . . 25<br />
3.5 Ecological interfaces for improving mobile robot teleop-<br />
eration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />
3.6 Egocentric and exocentric teleoperation interface using<br />
real-time, 3D video projection . . . . . . . . . . . . . . . 29<br />
3.7 Summary and analysis . . . . . . . . . . . . . . . . . . . 31<br />
4 Previous work on 3MORDUC teleoperation 33<br />
4.1 The 3MORDUC platform . . . . . . . . . . . . . . . . . 34<br />
4.2 Mobile robotic teleguide based on video images . . . . . 37<br />
4.3 Depth-enhanced mobile robot teleguide based on laser<br />
images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />
4.4 Augmented reality stereoscopic visualization for intuitive<br />
robot teleguide . . . . . . . . . . . . . . . . . . . . . . . 41<br />
1
4.5 Summary and analysis . . . . . . . . . . . . . . . . . . . 43<br />
5 Proposed method: AR stereoscopic visualization 44<br />
5.1 Core idea and motivation . . . . . . . . . . . . . . . . . . 44<br />
5.2 Research development strategy . . . . . . . . . . . . . . 46<br />
6 Effective multi-sensor visual representation 52<br />
6.1 Visualization of laser data through AR features . . . . . 52<br />
6.2 Detection of discontinuities . . . . . . . . . . . . . . . . . 56<br />
6.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />
7 Laser-camera alignment and calibration 60<br />
7.1 Laser-camera model . . . . . . . . . . . . . . . . . . . . . 60<br />
7.2 Feedback-based calibration procedure . . . . . . . . . . . 62<br />
7.3 Comparison with automatic calibration . . . . . . . . . . 63<br />
7.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />
8 Integrating 3D Graphics with image processing 67<br />
8.1 Edge detection algorithm . . . . . . . . . . . . . . . . . . 67<br />
8.2 Nearest edges discovery . . . . . . . . . . . . . . . . . . . 70<br />
8.3 Improving alignment with edges . . . . . . . . . . . . . . 73<br />
8.4 Improving reliability with edges . . . . . . . . . . . . . . 73<br />
8.5 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />
9 Stereoscopic augmented reality 79<br />
9.1 Stereo AR alignment . . . . . . . . . . . . . . . . . . . . 79<br />
9.2 NEP correspondence and suppression . . . . . . . . . . . 80<br />
9.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />
10 Conclusions 86<br />
References 88<br />
2
1 Introduction<br />
Robot teleoperation is a solution for many problems which cannot be<br />
solved neither by a robot alone, nor by the sole human intervention.<br />
Teleoperation of a robotic manipulator is widely used for tasks where<br />
a high precision of movements is required, or when the scale of the<br />
task forbids direct human intervention, as in robotic surgery [1, 2].<br />
Besides, robots can be teleoperated to execute exploration or manipu-<br />
lation tasks in unknown, inaccessible, or dangerous environments where<br />
human beings could not operate safely, e.g. in deep waters, in plane-<br />
tary or volcanoes exploration, in USAR (Urban Search And Rescue)<br />
applications, for bomb finding and their deactivation [3<strong>–</strong>7].<br />
Figure 1: Telerobotics applications: robotic surgery, exploration of volcanoes, deep<br />
waters, planets.<br />
On the other hand, as the sophistication of techniques for manag-<br />
3
ing telerobotic systems continues to grow, it is nevertheless clear to<br />
those familiar with control technologies that complex robotic tasks are<br />
unlikely to be achievable using fully autonomous robotic systems, and<br />
especially not in highly unstructured and dynamically varying environ-<br />
ments. In these cases, human-cognition is irreplaceable because of the<br />
high operational accuracy that is required, as well as deep environment<br />
understanding and fast decision-making [8].<br />
When piloting a mobile robot accurate robot navigation is necessary.<br />
Errors and collisions must be minimized since the robot could receive<br />
unpredictable damage, and in most cases repairs would be difficult if not<br />
impossible (a representative example is space/planetary exploration).<br />
The same is valid during tasks where the robot has to physically interact<br />
with people, since a careless teleoperation may cause harm to them.<br />
Accuracy and reactivity of a robot teleoperator can be improved by<br />
enhancing his sense of presence in the remote environment. Therefore,<br />
a relevant aspect of a telerobotic system is the user interface, which<br />
must be designed in order to be as immersive as possible.<br />
Vision being the dominant human sensor modality, large attention<br />
has been paid to the visualization aspect in literature. The video sen-<br />
sor is an essential part in most telerobotic systems, since it provides a<br />
considerable amount of highly contrasted information in a way which is<br />
easy for the user to assimilate. Though, there are a number of other sen-<br />
sors which can well complement visual sensor output, e.g. range sensors<br />
(laser-based, sonar-based), odometric sensors, bumpers, etc. Numer-<br />
ous works (see for example [9<strong>–</strong>13]) study interface design and propose<br />
methods to effectively display visual and sensor data in a teleoperation<br />
interface.<br />
This work proposes a novel approach to visualization of video and<br />
sensor data in a teleguide interface. The proposed approach exploits<br />
augmented reality and stereoscopic visualization to assist tele-naviga-<br />
4
tion of a mobile robot.<br />
Augmented reality consists in enhancing a real world representa-<br />
tion with virtual graphical additions. It gives the possibility to display<br />
sensor data together with visual data in an intuitive and quickly com-<br />
prehensible way. Up to now, AR has found application in several fields.<br />
It can be used in medical and manufacturing fields for intuitive training<br />
and for assistance during precision tasks, or to display annotations over<br />
the real workspace in collaborative applications. It is frequently used in<br />
military (e.g. Head-Up Displays for aircraft and helicopter pilots) and<br />
commercial applications, e.g. to enhance sporting events on television<br />
[14, 15]. Numerous applications of augmented reality in robotics are<br />
found in literature. It has been frequently used to introduce visual aids<br />
into telemanipulation tasks [16<strong>–</strong>18], to facilitate robot programming<br />
[19<strong>–</strong>21] or to assist mobile robots teleguide [9, 22<strong>–</strong>26].<br />
Stereoscopic visualization is today well-known thanks to the spread<br />
of “3D movies”. Stereoscopy is a group of technologies which permit<br />
to reproduce the tridimensional depth effect given by binocular vision<br />
using a bidimensional display. Several works demonstrate that stere-<br />
oscopic visualization may provide a teleoperator with a higher sense<br />
of presence in remote environments because of higher depth percep-<br />
tion [27<strong>–</strong>32]. This leads to higher comprehension of distance as well as<br />
aspects related to it, e.g., ambient and obstacle layout<br />
The proposed visualization approach has been implemented at the<br />
3D Visualization and Robotics Lab at the University of Hertfordshire,<br />
United Kingdom. It has been tested by teleoperating the 3MORDUC<br />
platform, a wheeled mobile robot located at DIEES (Dipartimento di<br />
Ingegneria Elettrica, Elettronica e dei Sistemi), in University of Cata-<br />
nia, Italy, more than 2500 km far from the operator location.<br />
This thesis is structured as follows. Section 2 introduces some pre-<br />
liminary notions about augmented reality and stereoscopic visualiza-<br />
5
tion. Section 3 describes the state of the art in visualization interfaces<br />
for mobile robot teleguide and presents points of strength and weak-<br />
nesses of the proposed approaches. Section 4 describes the past work<br />
performed on the teleoperation of 3MORDUC robotic platform, expos-<br />
ing results and limitations. Section 5 presents the proposed stereoscopic<br />
augmented reality approach, outlining the adopted development strat-<br />
egy. Sections 6 to 9 describe the various steps of the implementation in<br />
detail and present test results. Section 10 draws the conclusions, and<br />
introduces further developments.<br />
6
2 Background<br />
2.1 Augmented reality<br />
Augmented reality (AR) is a term for a live, direct or indirect view of<br />
a physical real-world environment whose elements are augmented by<br />
virtual computer-generated imagery [33].<br />
Figure 2: Example of AR: a graphical model is rendered on a real fiducial marker<br />
[34].<br />
Many definitions have been proposed in literature for AR. Azuma<br />
et al. [14] define AR as a variant of Virtual Reality (VR). VR tech-<br />
nologies completely immerse a user inside a synthetic environment. In<br />
contrast, AR allows the user to see the real world, with virtual ob-<br />
jects superimposed upon or composited with the real world. Therefore,<br />
AR supplements reality, rather than completely replacing it. Ideally, it<br />
would appear to the user that the virtual and real objects coexisted in<br />
the same space. Azuma et al. [14] states that the main requirements<br />
for a visualization interface to fall within the AR category are:<br />
• to combine real and virtual;<br />
• to be interactive in real-time;<br />
7
• to be registered in 3D (that is, virtual overlays are integrated in<br />
3D with real world).<br />
Milgram and Kishino [35] ideate the Reality-Virtuality (RV) con-<br />
tinuum (figure 3) to draw a coherent definition of VR and AR environ-<br />
ments. VR and real environments constitute the two ends of the con-<br />
tinuum. The commonly held view of a VR environment is one in which<br />
Figure 3: Mixed reality display continuum [35].<br />
the participant-observer is totally immersed in a completely synthetic<br />
world, which may or may not mimic the properties of a real-world envi-<br />
ronment, but which may also exceed the bounds of physical reality. In<br />
contrast, a strictly real-world environment clearly must be constrained<br />
by the laws of physics. All the environments between these two ex-<br />
tremes are considered Mixed Reality (MR) forms. AR, which consists<br />
in the addition of several virtual overlays to a real environment is con-<br />
sidered a form of MR near to the “real” end. The reverse of AR is<br />
augmented virtuality (AV), which consists in the addition of (video or<br />
texture mapped) elements from a real environment to a virtual, totally<br />
synthetic environment.<br />
2.1.1 Alignment and registration<br />
Augmented reality does not simply mean the superimposition of a<br />
graphic object over a real world scene. This is technically an easy task.<br />
One significant difficulty in augmenting reality is the need to maintain<br />
accurate registration of the virtual objects with the real world image.<br />
This often requires detailed knowledge of the relationship between the<br />
8
frames of reference for the real world, the camera and the user. The<br />
correct registration must also be maintained while the user (or the<br />
user’s viewpoint) moves within the real environment. Discrepancies<br />
or changes in the apparent registration will range from distracting to<br />
physically disturbing for the user, making the system unusable. AR<br />
demands much more accurate registration than VR, because humans<br />
are much more sensitive to visual differences between virtual and real<br />
objects than to inconsistencies between vision and other senses [36].<br />
According to [14], sources of registration errors can be divided into<br />
two types: static and dynamic. Static sources are the ones that cause<br />
registration errors even when the user’s viewpoint and the objects in the<br />
environment remain completely still. Dynamic sources are the ones that<br />
have no effect until either the viewpoint or the objects begin moving.<br />
Static errors are usually caused by distortions in the optics, track-<br />
ing errors, mechanical misalignments in the employed hardware and/or<br />
incorrect estimation of viewing parameters. Distortions and viewing<br />
parameters inaccuracy usually cause systematic errors, which can be<br />
estimated and corrected. The other factors can cause errors which are<br />
difficult to predict and correct, therefore it is recommended to take<br />
precautions against them during development phase (for example, by<br />
an accurate design of the tracking system and an accurate alignment<br />
of the hardware devices).<br />
Dynamic errors occur essentially because of system delays in the<br />
rendering of the overlays. If the user’s viewpoint is in motion and a<br />
significant delay is present between the moment when the viewpoint<br />
position/orientation is sampled and the moment when the virtual over-<br />
lay is rendered, the virtual objects will not “move” in sync with real<br />
objects, causing misalignments. Dynamic errors can be reduced by re-<br />
ducing system delay, or by predicting the future position/orientation of<br />
the viewpoint and rendering the corresponding part of the virtual over-<br />
9
lay in advance. In video-based AR systems (i.e. when the user does<br />
not see the real world directly, but through a camera) it is possible<br />
to eliminate dynamic errors by synchronizing the video stream with<br />
the rendering of the overlay. This is the case of teleoperation systems,<br />
where the real world is seen through a camera mounted on the robot.<br />
Vision-based techniques are often used to detect the viewpoint po-<br />
sition from the real view, and then correctly register the overlay with<br />
the image. Usually, these approaches use fiducials, well-known objects<br />
whose position and orientation can be easily recognized within an image<br />
[37<strong>–</strong>39].<br />
2.2 Pinhole camera model<br />
A camera model determines a projection function from scene points<br />
(points of the 3D real world viewed by the camera) to image points<br />
(points within the 2D camera image). Correspondence between scene<br />
points and image points is needed in computer graphics to know where<br />
on the screen virtual 3D objects have to be rendered.<br />
The most popular and simple camera model is the pinhole model.<br />
A pinhole camera is a camera with no lens and a single very small<br />
aperture. Simply explained, it is a light-proof box with a small hole in<br />
one side (figure.<br />
Figure 4: Simple representation of a pinhole camera [40].<br />
10
Light from a scene passes through this single point and projects an<br />
inverted image on the opposite side of the box. Cameras using small<br />
apertures and the human eye in bright light both act like a pinhole<br />
camera [40].<br />
The pinhole camera model is based on the principle of collinearity,<br />
where each point in the object space is projected by a straight line<br />
through the projection center into the image plane. Figure 5 shows the<br />
geometric model of a pinhole camera. The camera coordinate system<br />
Q<br />
Y<br />
R<br />
1<br />
Image plane<br />
Y<br />
2<br />
O<br />
X<br />
x 1<br />
Figure 5: Geometric model of the pinhole camera. [40]<br />
(O, X1, X2, X3) has its origin at the camera aperture (which is consid-<br />
ered infinitely small, coincident with a point). Axis X3 is pointing in<br />
the viewing direction of the camera and is referred to as the optical<br />
axis. The plane which intersects with axes X1 and X2 is the front side<br />
of the camera, or principal plane.<br />
1<br />
The image plane is where the 3D world is projected through the<br />
aperture of the camera. It is parallel to axes X1 and X2 and is located<br />
at distance f from the origin O in the negative direction of the optical<br />
axis. f is also referred to as the focal length of the pinhole camera.<br />
11<br />
x 2<br />
f<br />
X<br />
2<br />
x 3<br />
X<br />
P
The point R at the intersection of the optical axis and the image plane<br />
is referred to as the principal point of the camera, or center of the<br />
image. The 2D image coordinate system (R, Y1, Y2) has the origin at<br />
the principal point and the axes parallel to X1 and X2.<br />
For each point P = (x1, x2, x3) such that x3 > 0 a projection<br />
Q = (y1, y2) is defined on the image plane. It is easy to calculate<br />
Q<br />
O<br />
-y1 x3<br />
Y<br />
1<br />
f<br />
X<br />
1<br />
Figure 6: Geometric model of a pinhole camera as seen from the X2 axis. [40]<br />
the coordinates of the projection from those of the original point using<br />
similar triangles (see figure 6 for clarity):<br />
−y1 : f = x1 : x3 → y1 = −f x1<br />
−y2 : f = x2 : x3 → y2 = −f x2<br />
x3<br />
Since a coordinate is lost during the projection, it is not possible to<br />
retrieve the original 3D coordinates of P from the image coordinates of<br />
its projection. In fact, a point in the image corresponds to a line in the<br />
space (see the green line in figure 5 and 6).<br />
When rendering on the screen, the image coordinates of the projec-<br />
tion are converted to pixel coordinates by discretizing them and sum-<br />
ming them to an offset (since the screen coordinate center is usually in<br />
the upper left corner of the screen, instead than in the center of the<br />
image).<br />
12<br />
x<br />
P<br />
1<br />
x3<br />
X<br />
3
Focal length and size of the image are referred to as intrinsic pa-<br />
rameters of the camera, since they depend only on the specific camera<br />
and on nothing else. In the general case, coordinates of scene points<br />
are defined in respect to a world coordinate system, which is different<br />
from the camera system. In this case, it is necessary to convert the<br />
coordinates to their expression in the camera system before calculating<br />
the projection on the image plane. Therefore, a set of extrinsic param-<br />
eters (position and orientation of the camera in respect to the world<br />
system) are needed to obtain image coordinates. Given the extrinsic<br />
parameters, it is possible to determine the transformation matrix be-<br />
tween world and camera systems, thus the function to map points of<br />
the world system to the camera system.<br />
The pinhole model does not consider camera distortions, thus it<br />
accurately correspond to reality only in those cases where distortions<br />
are neglectable (typically, in good quality cameras or in the central zone<br />
of the image). Several camera models exist which include distortion<br />
factors in the projection function. For example, the model used in<br />
Heikkila and Silven [41] differs from the pinhole model in two aspects:<br />
• the position of the principal point can be different from the center<br />
of the image;<br />
• radial and tangential distortion (as defined in [42]) are present,<br />
and apply a non-linear transformation to final projection coordi-<br />
nates.<br />
Other models [43] consider the possibility for the camera axes to be non-<br />
orthogonal; others [44, 45] introduce a prism distortion due to camera<br />
imperfect manufacturing.<br />
13
2.3 Stereoscopic visualization<br />
A stereoscopic image presents the left and right eyes of the viewer with<br />
different perspective viewpoints, just as the viewer sees the real world.<br />
From these two slightly different views, the eye-brain synthesizes an<br />
image of the world with stereoscopic depth [46].<br />
When a human looks at an object in the space, his eyes converge on<br />
that object, i.e. they rotate until both their optical axes cross the object.<br />
It is easy to notice that, when eyes converge on something, objects much<br />
nearer or further than the convergence point appear as double (figure<br />
7). This is because the images projected on the left and right retinae<br />
Figure 7: (a) Eyes converge on the thumb; the flag, which is further, appears as<br />
double. (b) Eyes converge on the flag; the thumb, which is nearer, appears as<br />
double. [46]<br />
are slightly different, since they correspond to two slightly different<br />
viewpoints. If the retinal images are overlaid, corresponding points<br />
will be separated by a horizontal offset. This offset is referred to as<br />
retinal disparity. Points of the retinal images corresponding to an object<br />
on which the eyes converge will have zero disparity. Nearer objects<br />
will have negative disparity, while further objects will have positive<br />
disparity. Retinal disparity is interpreted by the brain to produce a<br />
sense of depth, with a process called stereopsis.<br />
14
Stereopsis works together with monocular depth cues to produce<br />
depth perception. Monocular cues are elements of a 2D image which can<br />
provide depth information. Some monocular cues are motion parallax,<br />
perspective, occlusion, relative size of objects [47].<br />
Stereoscopic displays obtain a depth effect by displaying a parallax<br />
value for each image pixel. Given two views of the same scene from<br />
slightly different side-by-side viewpoints, parallax is the horizontal off-<br />
set, measured on the display, between pixels corresponding in the left<br />
and in the right. It produces a directly proportional disparity on the<br />
retinae. Pixels having zero parallax (figure 8a) will produce zero dis-<br />
Figure 8: (a) Zero parallax. (b) Positive parallax. (c) Negative parallax. (d)<br />
Divergent parallax. [46]<br />
parity on the retinae, and will be seen as lying on the plane of the<br />
display. Pixels having positive parallax (figure 8b) will produce posi-<br />
tive disparity, and will be seen as if they were behind the display. Vice<br />
versa, pixels having negative parallax (figure 8b) will produce negative<br />
15
disparity and will be seen as if they were in front of the screen. Finally,<br />
pixels having divergent parallax, i.e. parallax higher than the distance<br />
between the viewer’s eyes (figure 8d) do not have a valid correspon-<br />
dent disparity value. Trying to fuse objects having a divergent parallax<br />
requires an unusual muscular effort, and often results in discomfort.<br />
Only horizontal parallax/disparity produces a sense of depth. Verti-<br />
cal disparity between left and right images is not natural, and has anal-<br />
ogous effects to divergent disparity (eye strain, discomfort). Therefore,<br />
it should be avoided in the generation of stereoscopic images.<br />
2.3.1 Visualization devices<br />
Numerous technologies have been developed for the visualization of<br />
parallax on planar displays. Stereo visualization devices are mainly<br />
divided into:<br />
• passive glasses;<br />
• active glasses;<br />
• autostereoscopic displays.<br />
Passive stereo technologies are based on the use of glasses very sim-<br />
ple and without electronics. The cheapest kind of passive stereo is<br />
anaglyph stereo. This consists in filtering the two images with oppo-<br />
site colors, and viewing them through special glasses with oppositely<br />
colored lenses, so that each eye sees only the corresponding image.<br />
The most common couple of colors is red-cyan. Anaglyph stereo<br />
does not require a special display, and anaglyph glasses are very cheap,<br />
but the resulting quality of the image is rather low. Moreover, tradi-<br />
tional anaglyph cannot display the full visible color range (although a<br />
patented technique has been developed to provide perceived full color<br />
viewing with simple colored glasses [49]).<br />
16
Figure 9: Paper anaglyph glasses [48].<br />
A more complex and performing passive stereo technology is based<br />
on differently polarized light. Two projectors are used to display the<br />
two images using orthogonally polarized light, and the images are view-<br />
ed through glasses with orthogonally polarized lenses. Each lens lets<br />
through light having the same direction of polarization, while filtering<br />
all light whose polarization is orthogonal. Thus, each eye sees the<br />
corresponding image, in full color but with half its brightness. Polarized<br />
glasses are relatively cheap and the resulting image has a good quality.<br />
For these reasons, polarized stereo is commonly used in cinemas for 3D<br />
movies.<br />
Active stereo is based on the use of more complex visualization<br />
devices. The two most notable examples are shutter glasses and Head-<br />
Mounted Displays (HMD). Shutter glasses are based on alternatively<br />
displaying left and right images on the same display, at a very high<br />
frequency, alternatively occluding the left and right eyes in sync with<br />
the display. This way, each eye sees only the corresponding image. If<br />
the alternating frequency is sufficiently high, the brain fuses the images<br />
as two continuous streams.<br />
Stereo-enabled HMDs use separate displays for each of the two eyes,<br />
so that the eyes actually see two different video streams. Active stereo<br />
device usually provide an image quality superior to passive stereo, al-<br />
17
Figure 10: (a) CristalEyes shutter glasses. (b) Emagin Z800 HMD. [50]<br />
though they are much more expensive.<br />
Some technologies have been developed to build autostereoscopic<br />
displays, which do not need for the user to wear glasses in order to<br />
view stereo images [51]. Though, some of these technologies are still<br />
very expensive, while the others provide a very low image quality.<br />
18
3 Augmented reality visual Interfaces in<br />
robot teleoperation<br />
Example of use of augmented reality in telerobotics are numerous in lit-<br />
erature, as regards both telemanipulation and mobile robots teleguide.<br />
This section contains a review of the current state of the art in<br />
visualization techniques for teleguide interfaces based on augmented<br />
reality and sensor fusion. Major contributions are resumed and their<br />
main points are highlighted and discussed.<br />
19
3.1 A sensor fusion based user interface for vehicle<br />
teleoperation<br />
The work of Meier et al. [24] describes a technique of sensor fusion for<br />
mobile teleoperation which uses different sensors in a complementary<br />
manner, balancing respective points of strength and weaknesses.<br />
A brief analysis of sensor fusion for teleoperation is carried out.<br />
Sensor fusion needs to be human-oriented, and the representation of the<br />
data has to be accessible and understandable. Fusing data in a single<br />
display, rather than representing each different sensor in a different<br />
display, makes perception quicker and reduces cognitive workload. The<br />
most important kind of information which can be shown by sensor<br />
fusion to an operator who is driving a mobile robot is depth information.<br />
This work considers color intensity the most efficient way in which this<br />
kind of information can be delivered to a human; in particular, HSV<br />
color model [52] is considered to be the one which best mimics human<br />
color perception.<br />
The teleoperation system described in this paper uses a stereo vision<br />
system, a ring of ultrasonic sonars and odometric sensors; sensor data<br />
are processed by a Kalman filter. The teleoperation interface contains<br />
a display showing the video stream from the robot cameras and a bidi-<br />
mensional map of the environment, gradually created as an occupancy<br />
grid using Histogramic In-Motion Mapping [53].<br />
Depth information is overlaid on the video display as a layer com-<br />
posed by differently colored pixels. Stereo having a higher angular<br />
resolution than the sonar, it is used by default to create the overlay.<br />
Instead, sonar is used in regions of the image where stereo disparity is<br />
not reliable, i.e. regions with scarce texturing or where the sonar de-<br />
tects very close objects. A grid is projected on the region of the image<br />
identified as ground, in order to improve distance estimation (figure<br />
11a). The bidimensional map (figure 11b) is created by combining<br />
20
Figure 11: (a) Image display processing. (b) Bidimensional map of the environment.<br />
[24]<br />
sonar distance data with stereo disparity; disparity is calculated along<br />
a horizontal line taken at a chosen height in the stereo images.<br />
Main points:<br />
• Using color as an immediate and efficient mean to convey infor-<br />
mation<br />
• Using geometric overlays to enhance distance estimation<br />
• Sensor fusion balances weaknesses of single sensors<br />
21
3.2 Fusion of laser and visual data for robot motion<br />
planning and collision avoidance<br />
The paper of Baltzakis et al. [54] proposes a SLAM (Simultaneous<br />
Localization and Mapping) algorithm based on fusion of 2D laser range<br />
data and stereo visual data. The proposed method uses stereo disparity<br />
to correct laser measurements where they are evidently wrong.<br />
The algorithm initially creates a 3D model of the environment as a<br />
series of vertical walls based on the 2D laser scan. This model forcibly<br />
omits all the objects which do not intersect the plane of the laser scan,<br />
since they cannot be detected by the laser sensor.<br />
Then, the pixels of one of the stereo images are ray-traced to the<br />
3D model, and the 3D coordinates correspondent to each pixel are ob-<br />
tained. Finally, the algorithm re-projects each pixel onto the second<br />
image. If the attributes (color, intensity, etc.) of the pixel in the second<br />
image are similar to the attributes of the correspondent one in the first<br />
image, then the value measured by the laser for that pixel is assumed to<br />
be correct. Instead, if the attributes of the pixels are different, the pix-<br />
els are assumed to belong to an object which is nearer or further than<br />
the distance measured by the laser. In this case, a distance estimation,<br />
based on the disparity between the images, is performed. Range esti-<br />
mates are accumulated on a 2D occupancy grid (in order to decrease<br />
the inaccuracy deriving by image noise or lack of texture).<br />
A simple collision avoidance algorithm, using the mapping method<br />
just exposed, is presented. The algorithm is tested both in artificial and<br />
in real environments, showing to have good results in aiding navigation.<br />
Main points:<br />
• 3D map of the environment based on range sensors and video<br />
• Using stereo to correct and integrate data from range sensors<br />
22
3.3 Using augmented reality to interact with an<br />
autonomous mobile platform<br />
The work of Giesler et al. [21] presents an AR-based, speech-based<br />
technique to quickly and intuitively program paths for a mobile robot<br />
in a wide environment.<br />
The operator who programs the robot needs a HMD and a tool<br />
(“magic wand”), both of which have to be tracked around the envi-<br />
ronment where the paths will be set up. The operator may define and<br />
view paths in form of nodes, which correspond to points on the ground,<br />
and edges, straight lines which connect couples of nodes (figure 12).<br />
Nodes and edges are created by pointing to the ground with the wand<br />
and issuing verbal commands (e.g. “Connect this node...”, “...with this<br />
node”) and are visualized by the HMD worn by the operator.<br />
Figure 12: Robot follows AR path nodes, redirects when obstacle in way. [21]<br />
The operator may issue commands to the mobile robot in the same<br />
manner. It is possible to command the robot to move from one node<br />
of the graph to another, or to move autonomously between two deter-<br />
minate points of the ground. When the robot has to navigate within<br />
the graph, it calculates automatically the shortest sequence of edges<br />
between the start node and the end node. If it detects an obstacle on<br />
the path it has just chosen, it calculates an alternative path through<br />
23
the edges of the graph. Once the robot has chosen a path, nodes and<br />
edges belonging to it are depicted with a different color, so that the<br />
operator is able to see which path the robot is going to take.<br />
Main points:<br />
• AR used as an efficient method to convey data about the envi-<br />
ronment (e.g. nodes position)<br />
• AR used as an efficient method to exchange information with the<br />
robot<br />
24
3.4 Improved interfaces for human-robot interaction<br />
in urban search and rescue<br />
The work of Baker et al. [9] proposes several modifications to the INEEL<br />
interface [22, 55] for telerobotics in urban search and rescue (USAR).<br />
The modifications are designed to decrease complexity and increase<br />
usability of the interface by non-experienced users. This work is based<br />
on the results of several past works of the same authors [10, 11], which<br />
analyse several teleguide interfaces used in international competitions<br />
and outline their points of strength and weaknesses.<br />
Most of the modifications are oriented to reduce the cognitive work-<br />
load imposed by the interface. For example, pan and tilt angles of the<br />
camera are indicated by the position of a light cross overlaid on the<br />
video display, rather than by separate meters. Proximity/collision in-<br />
dicators are visualized as colored blocks around the video display, and<br />
each one of them becomes visible only when an obstacle in the corre-<br />
spondent direction is sufficiently near. Rarely consulted information<br />
(e.g. battery charge) is treated as a system alert and visualized only<br />
when it is necessary. The environment map is placed at the same level<br />
of the video display, so that it is not tiring for the operator to shift his<br />
attention from the video display to the map and vice versa (figure 13).<br />
As a future work is indicated the possibility to fuse heat, sound and<br />
CO2 sensor data as a color map overlaid on the video display.<br />
Since it has been shown that most of collisions happen to the rear<br />
of the robot, a rear camera is included and its video stream is displayed<br />
above the video display (like a rear view mirror on a car).<br />
Main points:<br />
• Integration of sensor data in the same window to reduce cognitive<br />
workload<br />
25
Figure 13: Modified INEEL interface. [9]<br />
• Sensor data representation should be:<br />
<strong>–</strong> non invasive;<br />
<strong>–</strong> quickly comprehensible (e.g. resemble known/conventional<br />
symbols).<br />
26
3.5 Ecological interfaces for improving mobile robot<br />
teleoperation<br />
The work of Nielsen et al. [26] describes an interface for mobile robot<br />
teleoperation based on ecological [56] interface design and augmented<br />
virtuality. Different versions of the same interface are compared, show-<br />
ing that integration of sensor data gives better results for navigation<br />
than displaying results separately.<br />
The presented interface displays a map of the environment recon-<br />
structed from range sensors (laser, sonar) together with a video image<br />
from the remote site. The 2D version of the interface shows video and<br />
map one beside the other; instead, the 3D version shows a 3D model<br />
of the robot within a 3D representation of the map. The 3D map is<br />
created by elevating obstacles to a fixed height. The viewpoint is posi-<br />
tioned little behind the robot, and video data is visualized in a window<br />
in front of the robot model (figure 14).<br />
Figure 14: 3D interface presented in [26].<br />
Tests performed on the different versions prove that the 3D version<br />
of the interface generates always better results than the 2D version.<br />
Moreover, it is shown that operators which use the 2D version do not<br />
27
enefit from having both video and map, since the two displays compete<br />
for their attention. The difference in performance is explained by the 3D<br />
version complying to three important principles of HRI: 1) presenting a<br />
common reference frame; 2) providing visual support for the correlation<br />
between action and response; 3) allowing an adjustable perspective.<br />
Main points:<br />
• 3D map of the environment based on range sensors<br />
• Integration of sensor data in the same window reduces cognitive<br />
workload<br />
• More information does not necessarily imply better performance<br />
28
3.6 Egocentric and exocentric teleoperation interface<br />
using real-time, 3D video projection<br />
The paper of Ferland et al. [25] presents an augmented-virtuality-based<br />
interface for mobile robot teleoperation. As the one described in [26],<br />
it displays data from range and video sensors. In addition, it makes use<br />
of different projection methods of the video image in order to increase<br />
the quality of the information provided.<br />
Sensors used consist in a laser range sensor and a couple of stereo<br />
cameras. The laser is used to build a global 2D map of the environ-<br />
ment. The operator interface displays the map as a 3D environment,<br />
visualizing the obstacles detected by the laser as fixed-height walls. A<br />
3D model of the robot is displayed within the 3D environment. Two<br />
viewpoints are available to the operator: an egocentric viewpoint, co-<br />
incident with the position of the stereo camera (figure 15a), and an<br />
exocentric viewpoint, freely positionable in the zone behind and above<br />
the robot (figure 15b).<br />
Figure 15: Egocentric (a) and exocentric (b) viewpoints of the interface presented<br />
in [25].<br />
The video image is mapped on the 3D environment using one of<br />
two projection methods. The laser-based method first projects the 3D<br />
29
mesh of the environment onto the left video image frame, then simply<br />
maps the single left image on the resulting vertices. The stereoscopic<br />
method uses the disparity values from the stereo camera to project the<br />
stereo image to the 3D space, then maps it to the mesh using a set of<br />
OpenGL Shading Language [57] fragment shaders.<br />
Testing results show that both the egocentric and the exocentric<br />
points of view are considered useful by most of the users. Most of the<br />
time, viewpoints positioned little behind the robot are used; vertical,<br />
bird’s eye-like viewpoints are preferred in tight navigation situations or<br />
to obtain a global view of the map. The laser mapping proves to be the<br />
most useful source of information for navigation; laser-based projection<br />
is also considered useful, differently from stereoscopic projection which<br />
is too sensitive to the quality of disparity data.<br />
Main points:<br />
• 3D map of the environment based on range sensors<br />
• It is necessary to design a reliable method for image projection<br />
in the virtual workspace<br />
30
3.7 Summary and analysis<br />
The works presented above highlight several benefits provided by sensor<br />
fusion. Sensor fusion techniques help to balance strong and weak points<br />
of different types of sensors and to retrieve more reliable information<br />
from the robot and the surrounding environment [24, 54]. Besides,<br />
fused sensor data can be displayed to the user in a unified form.<br />
Unified sensor representation has many advantages in respect to<br />
visualization in separate displays. Presenting data inside a unique dis-<br />
play, within a common reference frame, avoids competition for the<br />
user’s attention. Interfaces that separately visualize different sensors<br />
data force operators to continuously switch between different displays,<br />
reference frames and visualization modalities. Instead, a unified rep-<br />
resentation prevents this switching, thus strongly reducing the user’s<br />
cognitive workload [9, 26].<br />
Augmented reality is a form of unified representation which presents<br />
a further advantage. Namely, visualizing complex data (as positions<br />
and paths in [21]) as a graphic overlay on an image of the real worlds<br />
permits a faster and more intuitive interpretation by a human operator.<br />
Several approaches to AR-based representation of visual and range<br />
data in telerobotics have been described. Some of them ([9, 24]) use<br />
bidimensional augmentations to the video image. They use the color of<br />
these overlays as a quick and effective way to communicate a distance<br />
measure to the user. Though, since bidimensional overlays display in-<br />
formation only on a single plane, their capacity to communicate a depth<br />
value is intrinsically limited.<br />
Others approaches [25, 26] create a bidimensional map of the envi-<br />
ronment using laser data, and display a 3D representation of the map<br />
by elevating virtual 3D walls. This approach has several advantages<br />
in respect to using 2D overlays. First, a 3D map usually looks more<br />
realistic, and can communicate depth in a more intuitive way because<br />
31
of monocular depth cues. Besides, while 2D overlays display raw range<br />
information and leave to the user the responsibility of deducing the<br />
shape of the environment, the 3D approach relieves the user from this<br />
work by presenting range data in a more quickly understandable form,<br />
namely as a 3D map.<br />
The drawback of the described approaches is the scarce quality of<br />
the integration between laser and visual data in the user interface. In<br />
[26] the video image is visualized on the display, but no correspondence<br />
between elements in the image and the laser-generated map is estab-<br />
lished. Therefore, the user must manually associate obstacles in the<br />
video image with obstacles in the laser map. Instead, in [25] corre-<br />
spondence between laser and video is automatically calculated through<br />
projection (see section 3.6). Though, the quality of the projection is<br />
strongly dependent on laser-camera calibration, on the correctness of<br />
laser measurements and, in the case of stereoscopic projection, on dis-<br />
parity data. Since laser-camera calibration always involves a certain<br />
degree of inaccuracy, laser sensor can miss some objects (e.g. low or<br />
transparent objects) and disparity data is strongly dependent on envi-<br />
ronment features, we consider these requirements to be too strict to be<br />
enforced in a general case.<br />
32
4 Previous work on 3MORDUC teleoperation<br />
The 3MORDUC (3rd version of the MObile Robot DIEES University of<br />
Catania) is a wheeled mobile robot located at DIEES (Dipartimento di<br />
Ingegneria Elettrica, Elettronica e dei Sistemi), in University of Catania<br />
[58]. It has been used over several years in order to perform research<br />
work within the field of telerobotics and teleguide visual interfaces.<br />
This sections gives a brief description of the robotic platform, and<br />
exposes past work where it has been involved. Then, past work is<br />
discussed and main issues are outlined.<br />
33
4.1 The 3MORDUC platform<br />
The 3MORDUC uses two Maxon F2260 motors (40W DC) for move-<br />
ment. The motors are connected to two rubber wheels through a shaft.<br />
A castor wheel is employed to facilitate curve execution. Two lead<br />
batteries (12V/18Ah) provide an autonomy of about 30-40 minutes.<br />
Figure 16: The 3MORDUC platform<br />
Several sensors on board monitor the workspace and the robot state.<br />
Here we give a brief description of these sensors.<br />
Laser scanner A Sick LMS200 laser measurement sensor system (fig-<br />
ure 17) is set on the front part of the 3MORDUC. The LMS operates<br />
by emitting a pulsed laser beam towards a definite direction. The re-<br />
flected pulse is received and registered, and the distance between the<br />
robot and the obstacle which reflected the pulse is estimated by mea-<br />
suring the time of flight of laser light. The procedure is repeated for<br />
several different directions on a plane, to generate a scan of the sur-<br />
roundings of the sensor. It is possible to configure angular resolution<br />
34
Figure 17: The Sick LMS200 laser sensor.<br />
(0.25 ◦ , 0.5 ◦ , 1 ◦ ) and maximum scan angle (100 ◦ /180 ◦ ). Each scan is<br />
executed in clockwise mode. Measurement data are available in real<br />
time for further evaluation via RS232/RS422 serial interface.<br />
Laser sensors are usually very accurate (each distance measure has<br />
an accuracy of some millimeters) and reliable. Though, they can be<br />
deceived by transparent or very dark surfaces, which do not adequately<br />
reflect laser light to the receiver, and generate outliers. Besides, laser<br />
sensors obviously cannot detect objects which do not intersect their<br />
scan plane.<br />
Stereo cameras The STH-MDCS2-VAR-C (figure 18) is a low power<br />
compact digital stereo camera system, which can be connected to a PC<br />
via IEEE 1394. Each camera has a resolution of 1.3 megapixel, and it<br />
is equipped with a fixed focus lens (4.5 mm). The CCD sensors of the<br />
cameras provide a good noise immunity. Capturing parameters (e.g.<br />
exposure gain, frame rate, resolution) are adjustable.<br />
The cameras are mounted on a rigid support, which permits to set<br />
the cameras baseline to any value in the range 5-20 cm. Their optical<br />
axes are maintained parallel. The cameras are positioned on the top<br />
layer of the robot, about 95 cm above the ground. They are pointed<br />
towards the direction in front of the robot, and they are slightly tilted<br />
towards the ground.<br />
35
Figure 18: The STH-MDCS2-VAR-C stereo cameras.<br />
Encoders An incremental rotary encoder with a resolution of 500<br />
pulses/turn is placed on each wheel of the robot. Incremental encoders<br />
convert movement into a sequence of digital pulses. Movement/rotation<br />
of the robot in respect to a determined start position/orientation can<br />
be calculated by counting the pulses generated by each encoder.<br />
Proximity sensors A belt of 8 SRF08 sonar sensors is positioned<br />
around the robot. Sonars measure the distance from obstacles by cal-<br />
culating the time of flight of a reflected sonic signal originally produced<br />
by a vibration of a piezoelectric sensor. Sonars field of view has a conic<br />
shape, so the sensitive area increases proportionately to the distance<br />
from the robot. For this reason, sonars have a far lower angular resolu-<br />
tion than the laser sensor. Furthermore, an inhibition time is necessary<br />
between the generation of the sonic signal and its reception, and this<br />
introduces a lower limit to measurable distances.<br />
A belt of bumpers (16 switches) is mounted around the entire pe-<br />
rimeter of the robot base, just over the wheels level. These sensors<br />
recognize and reduce damage in case of a collision.<br />
36
4.2 Mobile robotic teleguide based on video images<br />
The work of Livatino et al. [59] performs a systematic evaluation of the<br />
impact of different stereoscopic visualization modes on performance in<br />
telerobotics tasks. The paper describes the design of the evaluation<br />
experiment and presents and analyses its results.<br />
The experiment involved 12 participants. Each of them executed a<br />
simple teleguide task, which consisted in teleoperating the 3MORDUC<br />
platform (located at DIEES, Catania, in Italy) from the University of<br />
Aalborg (Denmark). The participants were able to visualize the video<br />
data from 3MORDUC cameras. Each of the participants executed the<br />
task using two different visualization setups: a 15” laptop and a 2<br />
× 2 projected wall display. Besides, the task was executed twice for<br />
each setup, using respectively monoscopic and stereoscopic visualiza-<br />
tion. Within the laptop setup, stereoscopic visualization used colored<br />
anaglyph, while within the wall display setup it used polarized projec-<br />
tion.<br />
A set of both qualitative and quantitative parameters were evalu-<br />
ated during the trials. A 2-way ANOVA (ANalysis Of VAriance) was<br />
performed to measure statistical significance of quantitative results.<br />
The results show that stereo visualization introduces a significant reduc-<br />
tion of the collision rate. This is because stereo visualization strongly<br />
enhances the sense of depth of the visualized scene. Furthermore, re-<br />
alism and sense of presence of the user in the remote environment are<br />
higher in respect to monoscopic visualization.<br />
As regards the comparison between the laptop and wall display<br />
setups, it has been shown that in the laptop setup users benefit from<br />
a stronger depth perception and obtain a lower number of collisions.<br />
Instead, since the wall display causes a wider use of peripheral vision,<br />
it generates a higher sense of presence and confidence, which implicate<br />
37
higher mean speeds.<br />
Main points:<br />
• Stereoscopic visualization enhances collision avoidance<br />
• Stereoscopic visualization increases realism and sense of presence<br />
38
4.3 Depth-enhanced mobile robot teleguide based<br />
on laser images<br />
The work of Livatino et al. [60] performs a systematical evaluation<br />
analogue to the one described in [59]. Though, the evaluated teleguide<br />
interface displays synthetic images generated by laser scans instead than<br />
real camera images. In this telerobotic system, laser data is processed<br />
Figure 19: The process of generating 3D graphical environment views from laser<br />
range information. The top-left image shows a 2D floor map generated by the laser<br />
sensor. The bottom-left image shows a 3D extrapolation of a portion of it. The<br />
right-image shows a portion of the workspace visible to a user during navigation.<br />
[60]<br />
on the robot to construct in real-time 2D maps of the robot surrounding<br />
workspace in real-time. A 3D representation is extrapolated from the<br />
2D maps by elevating wall lines and obstacle posts. A current front-<br />
view of the robot workspace is then generated and displayed to the user<br />
by using graphical software (figure 19). The teleoperation task to be<br />
executed by the participants and the visualization setups used during<br />
the experiment were the same used in [59].<br />
As regards the stereo-mono and the laptop-wall comparisons for the<br />
laser-based visualization interface, the results obtained are analogue to<br />
the ones exposed in [59]. It can be deduced that stereoscopic visualiza-<br />
39
tion permits a significant decrease in collision rates independently from<br />
the fact that the interface is visual-based or laser-based. Besides, it is<br />
shown that participants using the laser-based interface perform better<br />
in terms of completion time. This is supposed to be due to the real-<br />
time performance provided by the laser-based interface. In fact, visual<br />
data requiring a significant bandwidth to be transmitted, the average<br />
delay between the display of two consecutive video images is about<br />
one second. As exposed in [61], this strongly decreases teleoperation<br />
performances. Instead, since laser data requires a much smaller band-<br />
width than visual data, a teleoperation client can receive and process<br />
it in real-time, thus increasing the operator’s performance in terms of<br />
average speed.<br />
Main points:<br />
• Stereoscopic visualization benefits are present also in the case of<br />
laser-generated images<br />
• Laser-data can be used in real-time for an increase in performance<br />
40
4.4 Augmented reality stereoscopic visualization<br />
for intuitive robot teleguide<br />
The paper of Livatino et al. [13] proposes a methodology for fusion of<br />
laser and visual data in a teleoperation interface. This methodology<br />
exploits augmented reality to realize a coherent and intuitive visualiza-<br />
tion of integrated data, and uses stereoscopy to increase teleoperation<br />
efficiency.<br />
The interface presented in this work represents laser data as virtual<br />
overlays on the video images received by the robot cameras. Three<br />
different kinds of virtual overlays are used:<br />
• proximity planes, semi-transparent colored layers superimposed<br />
on the objects within the scene (figure 20a);<br />
• rays, colored lines departing approximately from the camera po-<br />
sition and reaching the closest objects (figure 20b);<br />
• distance values, indications of the absolute distance between the<br />
robot and the objects (figure 20b).<br />
The virtual overlays have a different color depending on the distance<br />
between the robot and the real objects to which they correspond. Red<br />
overlays correspond to the nearest objects, yellow overlays to objects<br />
at medium distances, green overlays to the furthest objects.<br />
The laser measures are linearly mapped to image pixels between the<br />
left and the right margin of the image. A semi-automatic calibration<br />
permits to the user to adjust the first and the last mapped angles. Then,<br />
edge detection is executed on the image in order to individuate the<br />
bases of the objects in the image (by taking the first edge pixels from<br />
the bottom of the image) and to vertically align the virtual overlays<br />
with the real objects.<br />
41
Figure 20: (a) Proximity planes overlaid on the image. (b) Rays and distance values<br />
overlaid on the image. [13]<br />
A pilot test has been carried out by teleoperating the 3MORDUC<br />
from the 3D Visualization and Robotics Lab at the University of Hert-<br />
fordshire, using both monoscopic and stereoscopic visualization. Al-<br />
though the results have been encouraging as regards the use of aug-<br />
mented reality and the semi-automatic calibration, the system has<br />
revealed not to be ready for stereoscopic visualization yet. In fact,<br />
since edge detection is performed on the left and right images inde-<br />
pendently, it generates different results (especially if the quality of the<br />
images is low and artifacts are present). This often causes the drawing<br />
of non-correspondent virtual overlays, which need to be detected and<br />
deleted/recomputed before rendering.<br />
Main points:<br />
• effective methodology to integrate laser and visual data in a co-<br />
herent representation<br />
• use of different AR features to highlight distances from objects<br />
• necessity of a technique for excluding non-correspondent measures<br />
42
4.5 Summary and analysis<br />
In [59] and [60] two different approaches to visualization interfaces for<br />
mobile robot teleoperation are described. The visual-based approach<br />
consists in displaying a (mono or stereo) video stream from the remote<br />
site, while the laser-based approach consists in displaying a synthetic<br />
view of the robot workspace generated from laser range data.<br />
Each approach has its points of strength and weaknesses. Visual<br />
data is highly contrasted, and provides a high amount of information<br />
and a wide field of view to the operator, but the massive quantity of<br />
data to be transmitted implicates a delay between the visualization of<br />
consecutive frames if the available bandwidth is low. Laser-generated<br />
images are much poorer in detail than video, but they can be generated<br />
and visualized in real-time.<br />
The works exposed above show that both visualization methods<br />
greatly benefit from stereoscopic visualization. Stereo increases the<br />
sense of presence of the operator in the remote environment and the<br />
perceived sense of depth, thus increasing driving accuracy.<br />
Livatino et al. [13] introduce an innovative methodology to join the<br />
advantages of these two approaches by using augmented reality. Col-<br />
ored overlays are used to fuse visual and laser data in a unique, coherent<br />
and intuitive representation. Though, application of stereoscopic visu-<br />
alization to this methodology has proven not to be straight-forward. In<br />
fact, a technique has to be developed to conciliate the results of left and<br />
right image processing in order to obtain a correct stereo rendering.<br />
43
5 Proposed method: AR stereoscopic visualization<br />
5.1 Core idea and motivation<br />
The purpose of this work has been to project and implement a laser-<br />
and-vision-based visualization approach for mobile robot teleguide. The<br />
proposed approach is meant to fully exploit the benefits provided by<br />
augmented reality and stereoscopic visualization described in the pre-<br />
vious sections in order to assist robot navigation.<br />
The proposed visualization approach has been developed with the<br />
following aims:<br />
1. the approach should communicate appropriately distance infor-<br />
mation from a laser sensor to the user; laser data should be rep-<br />
resented in a way as intuitive and easy to interpret as possible;<br />
2. the approach should be fully applicable to both monoscopic and<br />
stereoscopic visualization setups; it should be flexible and perform<br />
well even using a single camera image; at the same time, it should<br />
be designed in order to take advantage from stereo visualization<br />
features, where stereo is available;<br />
3. the approach should avoid every useless increment of the opera-<br />
tor’s cognitive workload; the interface design should be such that<br />
the operator does not need to frequently shift attention between<br />
different elements, and teleguide should be as little tiring as pos-<br />
sible;<br />
4. the approach should be robust relatively to sensor inaccuracy and<br />
errors; the visualization method should perform well even when<br />
sensor data is noisy or incorrect (e.g. in case of invalid disparity<br />
or laser outliers).<br />
44
In order to achieve the above-described aims, techniques described<br />
in literature can be used and improved. Specifically, the developed<br />
interface is based on:<br />
Augmented reality As described in section 3, augmented reality is<br />
an extremely convenient method for representation of sensor data.<br />
Since it integrates sensor and visual data inside a single display,<br />
competition for user attention is avoided. Moreover, if sensor data<br />
are represented as immersed in the real workspace, correlation<br />
between sensor data and real objects becomes easy and intuitive.<br />
In the case of a laser-video unified representation, laser distance<br />
data can be visualized directly on the correspondent zones of the<br />
camera image, thus giving a depth dimension to the image.<br />
3D overlays Differently from bidimensional overlays, 3D objects can<br />
be rendered in order to look nearer or further from the view-<br />
point. Depth of 3D graphical objects can be represented through<br />
stereo visualization or, in cases where a single camera is avail-<br />
able, through monocular depth cues (e.g. perspective, occlusion).<br />
Therefore, 3D objects are ideal for communicating depth infor-<br />
mation.<br />
Colors colors being a very effective mean to convey information to<br />
humans, they can be used to make data interpretation faster and<br />
more intuitive. As in [9, 13, 24], the proposed visualization ap-<br />
proach associates different colors to different distance values.<br />
Image processing As described in [54], image processing can be used<br />
to retrieve distance information. This information can be inte-<br />
grated with laser measures to increase range data reliability.<br />
45
5.2 Research development strategy<br />
The visualization approach introduces in the previous section has been<br />
implemented within the MOSTAR (MORDUC teleguide through STer-<br />
eoscopic Augmented Reality) interface for teleoperation of the 3MOR-<br />
DUC platform.<br />
The development of the MOSTAR interface has been divided into<br />
four main steps. This section gives a brief overview of these steps and<br />
of their main issues. Sections from 6 to 9 describe each step in detail.<br />
Design of<br />
laser-camera<br />
model<br />
Definition of<br />
calibration<br />
procedure<br />
Design of stereo<br />
alignment method<br />
Design of stereo<br />
correspondence<br />
algorithm<br />
Choice of<br />
3D objects<br />
Design of<br />
colored<br />
3D objects<br />
Development of<br />
the AR-based<br />
interface<br />
Extension for<br />
stereoscopic<br />
visualization<br />
Testing<br />
Choice of<br />
color range<br />
Design of<br />
edge detection<br />
algorithm<br />
Design of<br />
nearest objects<br />
detection<br />
algorithm<br />
Figure 21: Diagram of development and implementation steps.<br />
46
5.2.1 Design of an AR-based sensor fusion visualization<br />
The first step of the development of the MOSTAR interface has been<br />
the definition of a method to convert laser data to a set of graphical<br />
objects to be overlaid onto camera images.<br />
Colored 3D virtual objects have been chosen to represent laser data<br />
on the image. Virtual objects are positioned within the virtual 3D<br />
workspace according to laser measures, and they are rendered as a semi-<br />
transparent overlay above the camera image. It has been necessary to<br />
select appropriate 3D objects to represent laser data, and colors which<br />
were suitable to map distance in an intuitive way (section 6.1).<br />
As it has been shown in section 3, determining which visualization<br />
method is the most effective for complex data like those provided by a<br />
mobile robot is not a straight-forward issue. It is necessary to take into<br />
account numerous factors, among which the specific context of applica-<br />
tion and the user particular preferences. For example, a 2D bird’s eye<br />
view map of the robot workspace can be a very effective visualization<br />
method for the exploration of an environment, but it would usually not<br />
be sufficient for obstacle avoidance manoeuvres. Therefore, several rep-<br />
resentation modes have been designed for the MOSTAR interface, and<br />
several brief tests have been performed in order to determine points of<br />
strength and weaknesses of each mode.<br />
Furthermore, an algorithm to refine the graphical appearance of the<br />
overlay by detecting potential laser outliers has been developed (section<br />
6.2).<br />
5.2.2 Definition of a laser-camera model and a calibration<br />
procedure<br />
A calibration procedure is clearly needed in order to correctly align<br />
virtual objects defined in section 6 with the real objects in the camera<br />
images.<br />
47
Several approaches that permit an automatic determination of the<br />
intrinsic parameters of a camera, and of its extrinsic parameters in<br />
respect to a world system of coordinates, have been proposed in litera-<br />
ture [62, 63]. These approaches calculate the parameters by analysing<br />
a set of chosen calibration images, and guarantee the optimality of the<br />
parameters in term of accuracy through several statistical techniques.<br />
Though, they require rather long and complicated calibration proce-<br />
dures, which have to be repeated every time that the camera and/or<br />
the laser sensor are moved, if the alignment accuracy is to be main-<br />
tained.<br />
For these reasons, a semi-automatic feedback-based calibration has<br />
been preferred for the MOSTAR interface. This kind of calibration<br />
consists in varying manually a restricted set of parameters, while see-<br />
ing in real time the results of these variations. In other words, while<br />
the calibration parameters are adjusted, the virtual overlay is drawn ac-<br />
cording to the values selected each time. The user can gradually align<br />
the AR overlay with the objects in the image, before or during the tel-<br />
eguide, and he is able to get a good degree of accuracy in some minutes<br />
of adjustments, without any particular effort. Section 7 describes the<br />
developed feedback-based calibration procedure.<br />
The alignment precision is slightly inferior to the one which may be<br />
obtained with an automatic procedure, but it has shown to meet the<br />
requirements in most cases, and can be increased by exploiting image<br />
processing (see section 8).<br />
5.2.3 Development of a method for integration of image features<br />
Image processing has been employed within the MOSTAR interface for<br />
two different purposes: to improve the alignment of the overlay with<br />
the camera image, and to increase the reliability of the sensor data by<br />
48
detecting possible erroneous measurements of the laser sensor.<br />
The image processing technique chosen for the MOSTAR interface<br />
is edge detection. Analysis of the edges inside the images of the robot<br />
workspace have been used to detect walls and potential obstacles. A<br />
technique has been implemented for finding the nearest objects that<br />
the robot is facing, by individuating the borders of the bases of these<br />
objects (section 8.2).<br />
Once the edges of nearest objects are detected, it is necessary to<br />
integrate them with laser data somehow. Two techniques of integration<br />
have been developed and implemented:<br />
• a technique employing edges to improve camera alignment (sec-<br />
tion 8.3);<br />
• a technique employing edges to correct wrong laser measurements<br />
and individuate obstacles which are invisible to laser (section 8.4);<br />
5.2.4 Extension of the AR interface to stereoscopic visualization<br />
As exposed in section 4, stereoscopic visualization helps to reduce colli-<br />
sions during robot teleguide, by enhancing the user’s depth estimation.<br />
Since stereo can positively influence visualization of both real and syn-<br />
thetic images (as it was proven in [59, 60]), the MOSTAR interface can<br />
definitely benefit from it.<br />
Stereoscopic visualization has been easy to implement in the MOS-<br />
TAR interface. MORDUC cameras already provide a synchronized<br />
stereo couple of images, which can be directly displayed to the user.<br />
On the other hand, 3D virtual objects can be easily rendered from two<br />
different viewpoints, and each view of the objects can be overlaid on<br />
the corresponding real image.<br />
The main issue in the stereo extension of the MOSTAR visualiza-<br />
tion interface has been guaranteeing a suitable disparity level between<br />
49
the left and the right image. Real and synthetic images must be dis-<br />
played so that pixels which correspond in the images have no vertical<br />
parallax, and their horizontal parallax is correct (i.e. non-divergent)<br />
and comfortable for the user. Moreover, the couple of real images and<br />
the couple of synthetic images must be correctly aligned. Section 9.1<br />
explains how this issues have been managed.<br />
Furthermore, since camera left and right images are different be-<br />
tween each other, different edges are usually detected within them.<br />
Edge detection being intrinsically imperfect, some edges are detected<br />
in one image but not in the other. Therefore, a method has been imple-<br />
mented within the MOSTAR interface to deal with non-corresponding<br />
edges (section 9.2).<br />
5.2.5 Implementation and testing<br />
The MOSTAR interface has been implemented as a Visual C++ .NET<br />
application for Microsoft Windows. 3D rendering has been realized<br />
by means of OpenGL [64], and the GLUT library [65] has been used<br />
for windows and input handling. Image processing operations have<br />
been performed using functions from the OpenCV library [66]. The<br />
HTTP protocol and the WinHTTP library have been used to exchange<br />
driving commands and sensor data with the server program resident on<br />
the MORDUC platform.<br />
The MOSTAR interface has been subjected to several offline and<br />
online tests. During offline tests, the MOSTAR interface was used to<br />
display visual and laser data collected during previous teleguide ses-<br />
sions. During online tests, the MOSTAR interface was used to actively<br />
teleoperate the MORDUC platform in real-time. Online teleguide tests<br />
were performed from the 3D Visualization and Robotics Lab at the<br />
University of Hertfordshire, United Kingdom.<br />
During all tests, the MORDUC laser sensor was configured to sam-<br />
50
ple 181 distance values in the zone in front of the robot, with an angular<br />
resolution of 1 ◦ . The STH-MDCS2-VAR-C stereo cameras were used<br />
as visual sensors during most of the tests, using an image resolution<br />
of 640 × 480 pixels (per single image). During some of the tests, two<br />
Microsoft Lifecam Show webcams were used, mounted in a stereo con-<br />
figuration with a slightly different position and vertical inclination from<br />
the original setup.<br />
The visualization interface was run on a medium range laptop (In-<br />
tel Core 2 Duo T7500 processor, 2 GB RAM, ATI Mobility Radeon<br />
HD2600 graphic card). The timing values exposed in the next sections<br />
refer to this configuration.<br />
Several informal tests were conducted during the various implemen-<br />
tation steps by the developers, in order to validate design choices. A<br />
pilot test was conducted on the final version of the interface. Partic-<br />
ipants were 4, all with medium knowledge about augmented reality<br />
and stereoscopic visualization, and with no experience in robot tele-<br />
operation. Developers observed the performance of each participant,<br />
collecting impressions and comments. The results of the tests are re-<br />
ported in sections 6 to 9, depending on the aspect of the interface they<br />
are related to.<br />
51
6 Effective multi-sensor visual representation<br />
We describe here the set of augmented reality features developed for<br />
joint visualization of laser and video data. The features have been<br />
designed to assist the user during navigation and obstacle avoidance.<br />
Visualization methods for the other MORDUC sensors are in course of<br />
development, but have not been implemented yet.<br />
6.1 Visualization of laser data through AR features<br />
The MORDUC laser sensor provides a precise estimate of the distance<br />
between the robot and the surrounding obstacles. The MOSTAR in-<br />
terface uses AR to visualize this estimate in a way that facilitates im-<br />
mediate comprehension.<br />
Each single set of laser measures is processed independently by the<br />
laser visualization algorithm. Given each point p detected by the laser<br />
sensor on its plane, the 2D coordinates of p in respect to the laser origin<br />
are calculated:<br />
xp = dp cos αp<br />
zp = dp sin αp<br />
where dp and αp are, respectively the distance value and the laser ro-<br />
tation correspondent to the measurement of point p.<br />
Each laser point is assigned a particular color depending on its dis-<br />
tance value. As in [13] nearer points are assigned a color with a higher<br />
red component and a lower green component, while further points are<br />
assigned a color with a higher green component and a lower red compo-<br />
nent. A minimum and maximum distance, depending on the applica-<br />
tion, are set. Points with distance equal to or lower than the minimum<br />
52
distance will be pure red, points with distance equal to or higher than<br />
the maximum distance will be pure green. Distances between the two<br />
extremes are linearly mapped to the red-green range (figure 22). As<br />
Distance<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
0 50 100 150 180<br />
Angle<br />
Figure 22: Assigning colors to a set of laser points. Red line and green line represent,<br />
respectively, minimum and maximum distance limits.<br />
stated in [24], the human eye is more sensitive to variations in the HSV<br />
color space than in the RGB space; though, colors in the red-yellow-<br />
green range have a stronger impact on the user than the other colors,<br />
since they are conventionally associated to danger-caution-safety [67].<br />
Since the perception of distance through this range of color is supposed<br />
to be more intuitive, and since we consider immediacy of interpretation<br />
more important than a very high resolution in highlighting distances,<br />
our choice has fallen on this range rather than on the HSV range.<br />
Colored laser points are used to create semi-transparent tridimen-<br />
sional objects, which are aligned with the image (as described in section<br />
7) and rendered onto it as an overlay. Two kinds of AR objects have<br />
been implemented:<br />
• virtual walls are created between each couple of consecutive point-<br />
53
s, and elevated from the ground level to a fixed height; the color<br />
of each wall is determined by the laser points whichcolor of each<br />
wall is determined by the laser points which delimit it (figure<br />
23c); optionally, vertical lines are drawn in correspondence with<br />
the laser points (figure 23d);<br />
• virtual rays are drawn on the ground from the base of the robot<br />
(coincident with the projection of the laser origin on the ground)<br />
to the laser points, at regular intervals, and each of them takes<br />
the color of the correspondent point (figure 23e).<br />
Virtual walls position themselves over walls and obstacles in the robot<br />
workspace, highlighting objects depth with their colors. Virtual rays<br />
point out the bases of the obstacles, and give a hint to the user for<br />
estimating the distance between the robot and them. Several concentric<br />
circonferences, each of which is at a fixed distance from the previous<br />
one (0.5 m), are also drawn at the ground level, and serve as a further<br />
hint for estimating distances (figure 23 c-e).<br />
The rendering of tridimensional objects by means of the OpenGL<br />
libraries is much more effective than the methods used in [24] and [13],<br />
which are based on the sole drawing of bidimensional aids. In fact, as<br />
explained in section 5, tridimensional rendering can represent depth in<br />
a more intuitive way than bidimensional overlays.<br />
The methods of representing laser data together with video data de-<br />
scribed in [26] and [25] are similar to the one just described. Though,<br />
they are based on augmented virtuality rather than on augmented re-<br />
ality. As described in section 3.7, those approaches present some lim-<br />
itations, which the approach proposed here does not present. In fact,<br />
differently from [26], since our approach directly superimposes virtual<br />
objects above the corresponding real objects, correlation is easily and<br />
automatically established by the operator. Besides, while the approach<br />
54
Figure 23: (a) Plain camera image. (b) Laser map of the environment surrounding<br />
the robot. (c) Virtual walls overlaid onto the camera image. (d) Virtual walls and<br />
laser lines overlaid onto the camera image. (e) Virtual rays overlaid onto the camera<br />
image.<br />
55
of [25] is sensitive to calibration inaccuracies and to validity of dispar-<br />
ity data, the method described here is independent on disparity and<br />
relatively robust to calibration inaccuracies.<br />
6.2 Detection of discontinuities<br />
Virtual walls are a valid hint for estimating depth of the objects within<br />
the workspace, but, if rendered as they are, they have a drawback.<br />
Connecting each pair of consecutive laser points with a wall implies the<br />
creation of a unique, large surface surrounding the robot. This means<br />
that separate objects are represented as “melted” with each other (fig-<br />
ure 24a). In order to minimise this problem, a simple discontinuity<br />
Figure 24: (a) The walls and the box are covered by the same virtual wall, giving<br />
the user the wrong clue that they constitute a unique surface. (b) Discontinuity<br />
detection separates different objects.<br />
detection algorithm is executed on the laser points before rendering.<br />
For each wall between a couple of points, a slope coefficient is calcu-<br />
lated:<br />
slopep = distp−1 − distp<br />
distp−1 + distp<br />
where distp−1 and distp are the distance values corresponding to the<br />
two points which delimit the wall. It can be observed that the slope<br />
can assume values in the [−1, 1] interval. If the slope of a wall is<br />
56
different from the slope of the adjacent walls for a quantity higher<br />
than a parameterizable threshold (slopeT h), that wall is marked as a<br />
discontinuity and it is excluded from the rendering (figure 24b). Virtual<br />
rays are always drawn in correspondence of a discontinuity, in order to<br />
point out the edges of the objects in the workspace.<br />
Discontinuity detection is also used to detect potential laser outliers<br />
(figure 25). Isolated laser outliers have often a distance value strongly<br />
different from their immediate neighbors; therefore, the couple of walls<br />
which contain a laser outlier are very likely to be marked as discontinu-<br />
ities. This fact is exploited by searching for consecutive discontinuities.<br />
As the discontinuity detection algorithm is performed, laser points be-<br />
tween two consecutive discontinuities are labeled as potential outliers.<br />
In section 8 a method will be described to divide almost sure outliers<br />
from correct measurements.<br />
6.3 Testing<br />
Participants to the pilot test confirmed virtual objects as a valuable aid<br />
for navigation. All the participants found easier to estimate distances<br />
and to understand the conformation of the workspace when the virtual<br />
overlay was enabled.<br />
Virtual walls were judged the most useful kind of virtual objects,<br />
since they clearly highlighted walls and obstacles within the environ-<br />
ment. Thus, they made easier for users to detect position and size of<br />
obstacles. Instead, virtual rays were considered a little confusing, espe-<br />
cially when they were not strengthened by virtual walls and when the<br />
alignment with real objects was not precise. One of the participants<br />
also found that they were not clearly visible, especially in respect to<br />
virtual walls.<br />
Virtual circles on the ground were found useful for determining dis-<br />
tances. Still, participants felt the need for some hint to indicate abso-<br />
57
Figure 25: (a) Visual artifacts created by some outliers (caused by black square of<br />
the chessboard, which do not reflect laser beams). (b) The artifacts are partially<br />
eliminated by the discontinuity detection.<br />
58
lute distances.<br />
The discontinuity detection algorithm had excellent results, even<br />
when objects with a not regular shape (e.g. people) were in the range<br />
of the laser sensor. A fixed value of 0.05 for slopeT h proved to work<br />
well in most cases.<br />
Calculation and rendering of the overlays had excellent timing per-<br />
formances. Processing of laser data and rendering of the virtual objects<br />
took always less than 10 ms, which is a perfectly acceptable delay for<br />
AR applications according to [36].<br />
59
7 Laser-camera alignment and calibration<br />
The 3D virtual objects described in the previous section have coordi-<br />
nates defined in respect to the laser sensor origin. Laser-camera align-<br />
ment permits to determine where their coordinates are to be drawn<br />
within the camera image.<br />
The alignment is performed by adjusting a set of intrinsic and ex-<br />
trinsic parameters in order to replicate the ones of the real camera, then<br />
using these parameters to define a virtual viewpoint on the virtual scene<br />
overlaid onto the image. This way, the virtual camera will look at the<br />
virtual workspace with approximately the same position/orientation of<br />
the real camera in respect to the real workspace, so that the rendered<br />
scene will coincide with the camera scene.<br />
7.1 Laser-camera model<br />
In order to maintain the calibration procedure simple, the alignment<br />
algorithm is based on the undistorted pinhole camera model, and poses<br />
several constraints over the orientation of the camera.<br />
Since the dimensions of the camera image is fixed (it depends on the<br />
camera output resolution), and since the pinhole camera model does not<br />
consider distortions, the only variable intrinsic parameter for camera<br />
calibration is focal length. The focal length of a camera in millimeters<br />
is often available among the constructor data of the camera, therefore<br />
it is usually easy to retrieve. Though, for calibration it is necessary to<br />
convert its value into pixels. It is possible to effectuate the conversion<br />
from the value in millimeters, but for this it is necessary to know the<br />
dimensions of the CCD/CMOS camera sensor, which are not usually<br />
published by constructors.<br />
As regards the orientation of the camera in respect to the laser<br />
system of coordinates, three assumptions are made:<br />
60
• the x axis of the camera system of coordinates is parallel to the<br />
x axis of the laser system of coordinates; that is, neither panning<br />
movements nor roll rotations of the camera in respect to the laser<br />
are permitted;<br />
• the x axis of the camera system and the x axis of the laser system<br />
point towards the same direction;<br />
• tilt movements of the camera (i.e. rotations around the x axis) in<br />
respect to the laser system of coordinates are confined between<br />
−90 ◦ and 90 ◦ ; that is, camera optical axis and laser z axis“look”<br />
approximately towards the same direction.<br />
This assumptions are satisfied by the MORDUC platform (stereo cam-<br />
eras are parallel and oriented in the direction of the laser z axis, and<br />
they have a slight tilt angle towards the ground), and are generally<br />
reasonable.<br />
The variable calibration parameters left by the exposed approxima-<br />
tions are only four: the position coordinates of the camera center in<br />
respect to the laser origin (x, y and z) and the tilt angle of the camera<br />
(figure 26). In the case of a panning-enabled camera it is possible to<br />
add a fifth parameter, that is the pan angle of the camera.<br />
The resulting laser-camera model presents therefore five (six, if pan-<br />
ning is possible) parameters to configure. They are sufficient for the<br />
definition of the OpenGL viewing transforms for the rendering of the<br />
overlay on the image: camera position and tilt are used to position and<br />
orientate the point of view, while focal length and image ratio (which<br />
is known a priori) are used to determine the perspective projection<br />
matrix.<br />
The next section describes a simple procedure for the manual cali-<br />
bration of the parameters.<br />
61
Figure 26: Graphical representation of the laser-camera system and of calibration<br />
parameters.<br />
7.2 Feedback-based calibration procedure<br />
Given the model and the configuration parameters described in the<br />
previous section, it is possible to define a sequence of steps to obtain<br />
a satisfying set of values for those parameters by the feedback-based<br />
calibration.<br />
1. Adjust focal length of the virtual camera until the level of zoom<br />
on the overlay is equal to the level of zoom on the real image. Take<br />
as a reference the furthest object within the scene, and modify<br />
the parameter until the horizontal dimension of the corresponding<br />
virtual wall matches (figure 27b).<br />
2. Adjust y camera coordinate and tilt angle until the “virtual<br />
floor” is aligned with the real floor (figure 27c). Adjusting the<br />
y coordinate moves the virtual camera up and down in the 3D<br />
space (that is, it moves the overlay down/up in respect to the<br />
62
eal camera image). Virtual rays and circles can be used as aids<br />
to obtain a good alignment.<br />
3. Adjust z camera coordinate until the horizontal dimensions of<br />
both far and near objects match with the correspondent virtual<br />
walls (figure 27d). Adjusting the z coordinate moves the vir-<br />
tual camera forward and backward (that is, it moves the overlay<br />
backward/forward in respect to the real camera image). While<br />
modifying focal length controls the dimension of near and far<br />
parts of the overlay the same way, modifying the z coordinate<br />
of the virtual camera has a strong effect on near virtual objects<br />
and almost no influence on far ones. Therefore, is is suggested to<br />
set focal length first using a very far object as a reference, (as in<br />
point 1), then adjust the z coordinate using a very near object as<br />
a reference.<br />
4. Adjust x camera coordinate to eliminate the horizontal offset<br />
between the overlay and the real image (figure 27e). Adjusting<br />
the x coordinate moves the virtual camera to the left and to the<br />
right (that is, it moves the overlay to the right/left in respect to<br />
the image).<br />
This specific order of calibration permits to minimize the interfer-<br />
ence of adjustments of a calibration parameter with the others, so it<br />
avoids the necessity for the user to return to his own steps and adjust<br />
the same parameters again and again. However, if the final result is<br />
not satisfying, the user is free to modify any parameter at any moment,<br />
even during the teleguide.<br />
7.3 Comparison with automatic calibration<br />
The ease of calibration of the MOSTAR interface is bound to the re-<br />
duced number of parameters the user has to deal with. Therefore, it is<br />
63
Figure 27: (a) Overlay at the beginning of the calibration, (b) after the adjustment<br />
of the focal length, (c) after the adjustment of the y coordinate and the tilt angle,<br />
(d) after the adjustment of the z coordinate, (e) after the adjustment of the x<br />
coordinate.<br />
64
directly dependent on the approximations and constraints on the laser-<br />
camera model described in section 7.1. The feedback-based calibration<br />
procedure presented here could be also applied to more general cases,<br />
but at the expense of its simplicity. Automatic calibration may be<br />
preferable in the most general cases, or when high accuracy is needed.<br />
On the other hand, the point of strength of the feedback-based calibra-<br />
tion is the possibility to achieve an accuracy amply sufficient in most<br />
cases, without requiring a significant time or effort from the user.<br />
The feedback calibration procedure exposed in this section has been<br />
implemented in the MOSTAR interface, but the sensor representation<br />
logic described in section 6 is independent of the specific calibration<br />
procedure. In fact, since <strong>–</strong> as described at the beginning of this section<br />
<strong>–</strong> the alignment between the virtual overlay and the image is performed<br />
simply by setting OpenGL viewpoint parameters (specifically frustum<br />
shape and size, position and orientation of the viewpoint), there is no<br />
constraint on the method used to calculate these parameters. OpenGL<br />
camera model does not model lens distortion; though, several meth-<br />
ods exist to simulate distortion by texture mapping [68] or by OpenGL<br />
Shading Language [57, 69]. Consequently, the AR features of the MOS-<br />
TAR interface are applicable also to the case of a general laser-camera<br />
model, and can be used together with any calibration method.<br />
7.4 Testing<br />
The feedback-based calibration confirmed the expectations, showing<br />
to have good performances with both camera setups (STH-MDCS2-<br />
VAR-C stereo cameras and Microsoft Lifecams). With some minutes<br />
of calibration it was possible to obtain a sufficient degree of alignment.<br />
Subtle misalignments could not be completely eliminated, but usually<br />
they were not bothering for users.<br />
Participants who did know in advance the meaning of the calibration<br />
65
parameters found the calibration procedure easy and effective. Though,<br />
participants who did not have any advance knowledge about the camera<br />
model and the meaning of the calibration parameters found it slightly<br />
counterintuitive. In fact, it was not easy to deduce the nature of a<br />
parameter only from watching the effect of its adjustments. Therefore,<br />
it has been taken into consideration the possibility to add some (maybe<br />
graphical) hints to the interface in order to make clearer to users the<br />
nature and the effect of each calibration parameter.<br />
Finally, participants discovered that virtual rays and lines on vir-<br />
tual walls may assist calibration. In fact, their ends mark the precise<br />
position of each laser point. Participants found easier to position the<br />
overlay over the camera image when knowing precisely the point hit by<br />
each laser beam.<br />
66
8 Integrating 3D Graphics with image<br />
processing<br />
This section exposes the technique used within MOSTAR interface to<br />
integrate information from the edges in the camera image with laser<br />
data. Briefly, edges of the objects occupying the robot field of view are<br />
located inside the image; then, edge pixels are unprojected to the 3D<br />
world coordinate system and their position is calculated. The distance<br />
between the robot and each of these points is calculated and compared<br />
with the corresponding laser measures, so that the correctness of each<br />
laser measure can be double-checked.<br />
The technique used by the MOSTAR interface to test laser mea-<br />
surements is analogue to the one described in [54], which uses disparity<br />
information retrieved from a stereo couple to individuate points of the<br />
images whose depth does not correspond to the value detected by the<br />
laser. Though, the technique described here is based on edge detection<br />
rather than on stereoscopic disparity calculation. The consequence is<br />
that the image processing technique used in MOSTAR interface is ap-<br />
plicable even where only one camera is available. In addition, edge<br />
detection algorithms are usually faster in terms of performance than<br />
stereo disparity calculation algorithms.<br />
8.1 Edge detection algorithm<br />
The process used for edge detection in camera images is divided into<br />
two steps.<br />
First, the image is converted to grayscale and preprocessed by a con-<br />
trast stretching function. Two different gray value thresholds (lowT h <<br />
highT h) are applied to the image: the intensity of pixels whose original<br />
value is lower than lowT h is set to the minimum value (black), while<br />
the intensity of pixels whose original value is higher than highT h is set<br />
67
to the maximum value (white). Intensity values of all the other pix-<br />
els are linearly mapped to the range between minimum and maximum<br />
gray value (figure 28). This preprocessing step has two benefits. First,<br />
Mapped intensity<br />
MAX<br />
MIN<br />
MIN lowTh highTh MAX<br />
Original intensity<br />
Figure 28: Contrast stretching function used before actual edge detection.<br />
it improves the quality of the image for edge detection, by suppressing<br />
gradients in very bright or very dark areas and increasing contrast in<br />
the rest of the image. Secondly, if the floor of the working environment<br />
is much brighter (or darker) than the rest of the image, it can be used<br />
to neatly separate the intensity range of the floor from the intensity<br />
range of the workspace object, simplifying the individuation of edges<br />
on the ground, which correspond to the bases of obstacles.<br />
The second step is processing the contrast-stretched grayscale image<br />
with the Canny edge-detection algorithm [70]. The Canny algorithm<br />
is a simple and very popular algorithm for edge-detection, based on an<br />
intensity hysteresis process. First, the Sobel operator [71] is applied<br />
to the image along the horizontal and vertical directions. The Sobel<br />
operator has two purposes: averaging intensity values of the pixels along<br />
one direction - so reducing image noise by blurring - and calculating<br />
the gradient of the pixels along the perpendicular direction. This way,<br />
68
a gradient value and an edge direction are retrieved for each pixel.<br />
Then, a non-maximum suppression is executed, by checking whether<br />
each pixel has the maximum gradient among its neighbors taken along<br />
its edge direction. Non-maximum pixels are excluded from the edge<br />
detection. Finally, gradient values of remaining pixels are compared<br />
with a couple of thresholds (th1 < th2):<br />
• pixels with a gradient value higher than th2 are immediately<br />
marked as edge pixels;<br />
• pixels with a gradient value lower than th2 but higher than th1<br />
are marked as edge pixels only if they are encountered along an<br />
edge which contains pixels whose gradient value is higher than<br />
th2; otherwise, they are excluded from the edge detection;<br />
• pixels with a gradient value lower than th2 are always excluded<br />
from the edge detection.<br />
The advantage of the hysteresis process in respect to a single-threshold<br />
approach is that it permits to individuate only reliable edges (pixels<br />
with a high gradient, and pixels with a low gradient which are likely<br />
to belong to a real edge because they are connected to a strong pixel).<br />
This avoids a typical problem of using a single threshold, that is the<br />
creation of discontinuous edges; this happens where some pixels along<br />
an edge have a gradient value slightly higher than the threshold, while<br />
some others have a gradient slightly lower than it.<br />
Optimal threshold values for contrast stretching and Canny algo-<br />
rithm are strongly dependent on the features of the captured images<br />
(illumination, objects and floor textures, etc.). No tried and tested ap-<br />
proach to their determination exists yet. In the MOSTAR interface, it<br />
is possible to choose a value for each threshold by another feedback-<br />
based procedure. The edge detection calibration feature permits to<br />
69
variate the single parameters while visualizing the resulting contrast-<br />
stretched grayscale image and edge image (figure 29).<br />
Figure 29: (a) Original image. (b) Contrast-stretched grayscale and edge image,<br />
visualized during edge detection parameters calibration.<br />
The edge detection algorithm has been implemented using the Op-<br />
enCV library functions.<br />
8.2 Nearest edges discovery<br />
After an edge image is extracted by the method described in the previ-<br />
ous section, the NED (Nearest Edges Discovery) algorithm is executed<br />
on it. The aim of the NED algorithm is to detect the nearest objects<br />
within the area viewed by the robot camera through the analysis of<br />
edges present in the camera image.<br />
The NED algorithm begins processing each of the laser measures.<br />
For each laser point, the correspondent virtual ray (the line laying on<br />
the ground between the laser origin and the point itself) is projected<br />
to the camera image. Each virtual ray corresponds to a bidimensional<br />
line on the image plane, though usually only a part (or none) of it will<br />
lay inside the actual border of the image (figure 30).<br />
For each ray, the corresponding segment on the image is located (if it<br />
exists) with the help of the gluProject function of the OpenGL Utility<br />
70
Figure 30: Projection of a virtual ray to the camera image.<br />
library (GLU). The input arguments which the gluProject function<br />
needs are a point of the 3D space and the OpenGL camera transforma-<br />
tion parameters. The function calculates the coordinate transformation<br />
of the point, and returns the pixel coordinates and its depth value. It is<br />
subsequently possible to invert the projection and to go up to the orig-<br />
inal 3D coordinates of the point, since the pixel depth value eliminates<br />
the ambiguity.<br />
Each of the pixels composing the just retrieved segment will cor-<br />
respond to a 3D point on the virtual ray between the robot and a<br />
determinate laser point. Points further than the laser point along the<br />
same direction are also includes by prolonging the segment in the im-<br />
age. Then, the segment is scanned along this direction until a pixel<br />
corresponding to an edge is found.<br />
Once the nearest edge pixel along the virtual ray projection is found,<br />
its image coordinates are used to retrieve the 3D coordinates of the<br />
corresponding workspace point, by means of the gluUnProject func-<br />
tion (figure 31). The retrieved workspace point will be a point on the<br />
ground, usually corresponding to the base of an obstacle, and it will<br />
have a certain distance from the robot.<br />
71
Figure 31: Unprojection of the edge pixel to the 3D space, through pixel coordinates<br />
and depth value.<br />
At the end of the NED algorithm, a set of nearest edge points<br />
(NEPs) will have been calculated, each of which will have a correspond-<br />
ing laser point. The distance between each NEP and its correspond-<br />
ing laser point is calculated, and it is compared to a parameterizable<br />
threshold edgeT h:<br />
• if the distance between the NEP and the laser point is lesser<br />
than edgeT h, it is assumed that the NEP and the laser point<br />
correspond to the same real object;<br />
• if the distance between the NEP and the laser point is greater<br />
than edgeT h, it is assumed that the NEP and the laser point<br />
correspond to different objects.<br />
NEPs which fall into the first category are used for overlay align-<br />
ment. NEPs which fall into the second category are further divided into<br />
those which are nearer to the robot than the corresponding laser point<br />
and those which are further from the robot, and are used to detect<br />
obstacles missed by the laser and possible laser outliers.<br />
72
8.3 Improving alignment with edges<br />
In an ideal case, assuming perfect calibration and a completely disto-<br />
rtion-free camera, virtual objects borders would be perfectly aligned<br />
with the corresponding real objects edges. Though, a slightly imprecise<br />
calibration and/or small differences between the ideal camera model<br />
and the real camera may often cause little misalignments.<br />
NEPs which are located near to the corresponding laser points can<br />
be used to correct this misalignments. For each of these NEP, the<br />
coordinate of the corresponding laser point are corrected in order to be<br />
coincident with the NEP coordinates. This way, when the rendering of<br />
the overlay is performed, the virtual object based on that laser point<br />
will be precisely aligned with the edge of the real object underneath.<br />
(figure 32).<br />
8.4 Improving reliability with edges<br />
If an edge is detected in a point much nearer to the robot than the laser<br />
point, this can mean that:<br />
• an object is present in the workspace which has not been detected<br />
by the laser, and its base contains the NEP, or<br />
• a false edge has been detected in a point between the robot and<br />
the laser point.<br />
There is no trivial way to determine whether the NEP is a true edge<br />
or not, and whether it indicates an object which could be an obsta-<br />
cle for the robot or not (it could be, for example, a drawing or some<br />
pattern on the floor). Although the safest decision for the teleopera-<br />
tion would be to assume that an obstacle is present on the NEP, it is<br />
not actually convenient to consider each NEP an obstacle, since false<br />
edges are rather common even in rather structured environments and<br />
73
Figure 32: (a) Alignment using feedback-based calibration only: the margin of the<br />
overlay is slightly detached from the base of the box. (b) Alignment using feedbackbased<br />
calibration and NEPs: position of the laser points is corrected with NEPs in<br />
order to coincide with the base of the box in the image.<br />
74
after a careful tuning of edge detection parameters. Indeed, informa-<br />
tion retrieved from edge detection is far less reliable than information<br />
obtained by the laser sensor. Therefore, NEPs interpreted as potential<br />
obstacles are still indicated by an overlay, but in a different way from<br />
laser data. Specifically, a single colored point is rendered above each<br />
NEP which could indicate an obstacle (figure 33).<br />
Figure 33: NEPs nearer to the robot than the relative laser points are highlighted<br />
with colored dots.<br />
If an edge is detected in a point much further from the robot than<br />
the laser point, this can mean that:<br />
• an object is present in the workspace which has been detected by<br />
the laser but it is high above the ground (therefore, since its base<br />
edge is higher than it is expected to be, it is interpreted as further<br />
than it actually is), or<br />
• the edge detection algorithm has missed the real edge of the ob-<br />
ject, or<br />
• the laser measure is wrong and lower than it should be.<br />
75
Since the last case is rather unlikely, in this case the best (and safest)<br />
decision is to trust the laser measure, therefore the NEP is simply<br />
ignored.<br />
In both cases, the presence of a NEP which disagrees with the cor-<br />
respondent laser point casts doubt on the validity of the correspondent<br />
laser measurement. Therefore, if the laser point had previously been<br />
marked as a potential outlier (see section 6.2) it will be considered as<br />
a probable outlier and excluded from the rendering (figure 34).<br />
8.5 Testing<br />
The integration of image processing in the AR system had good results<br />
as regards alignment improvement and laser data correction.<br />
The edge detection algorithm exposed in section 8.1 correctly indi-<br />
viduates objects bases in cases where the floor is plain-textured. Pa-<br />
rameter tuning is necessary in cases where the floor presents a faint<br />
pattern, in order to suppress the range of gray values of the floor and<br />
highlight objects borders. The algorithm is not supposed to have good<br />
results in cases where the floor presents strongly-contrasted patterns<br />
(e.g. black-and-white tiles). In fact, in such cases contrast stretching<br />
would not be able to suppress floor edges, which would interfere with<br />
objects base detection. For those cases, a more sophisticated form of<br />
contrast stretching and intensity suppression function should be used.<br />
Edge data integration for overlay alignment had good overall per-<br />
formances. Though, it proved to be fairly sensitive to the quality of<br />
the edge detection. In cases where false edges where detected near to<br />
the border with which the 3D overlay should have been aligned, the<br />
overlay tended to follow false edges, producing unpleasant artifacts.<br />
On the other hand, NEPs non-corresponding to laser data proved<br />
to be a very effective visual aid. In fact, objects invisible to the laser<br />
use to generate many highlighted NEPs of the same color along a line<br />
76
Figure 34: (a) Outliers are detected as discontinuities, but they are still rendered.<br />
(b) Outliers are confirmed and excluded from the rendering.<br />
77
(see the box on the left of figure 33). On the contrary, false edges<br />
use to generate isolated NEPs (see NEPs in front of the box on the<br />
right of figure 33). The attention of the user is usually drawn by clus-<br />
ters of similar dots rather than by isolated dots; therefore, while real<br />
objects are strongly enhanced by NEPs, false edges remain relatively<br />
inconspicuous. This helps operators to focus their attention on real<br />
obstacles without distracting them with striking false edges.<br />
Several values have been tested for the edgeT h parameter. The<br />
higher the values of edgeT h, the more NEPs are considered as consistent<br />
with corresponding laser measures. Therefore, when a high value is<br />
used, more NEPs will be used for overlay alignment, thus the 3D overlay<br />
will closely follow image edges. Instead, when a low value is used, more<br />
NEPs will be used for laser correction, thus the overlay will follow laser<br />
values and more NEPs will be highlighted as inconsistent with the laser<br />
data. Values around 10 cm for edgeT h are generally well-performing.<br />
Timing performance was good. Contrast stretching and edge detec-<br />
tion on a single frame had an average duration of 45 ms. Since camera<br />
image and virtual overlay are displayed together at the same time, this<br />
does not cause dynamic registration errors, but it limits the maximum<br />
framerate to 25 fps. This value is sufficient for the teleguide of the<br />
MORDUC, which does not require very quick manoeuvres. Besides,<br />
during the tests the framerate was already limited to about 2 fps by<br />
network delay, therefore the delay introduced by image processing was<br />
neglectable.<br />
78
9 Stereoscopic augmented reality<br />
The techniques presented in the previous sections have good perfor-<br />
mances if they are used on a single video image, but their efficacy can<br />
be increased by using them in synchronization with stereoscopic cap-<br />
turing and visualization. Though, using a stereo couple of images raises<br />
several issues, derived from potential inconsistencies between left and<br />
right image (see, for example, [13]). This section describes these issues<br />
and the methods used in the MOSTAR interface to solve them. Fur-<br />
thermore, it shows how stereo information can be used to improve the<br />
results achieved by the algorithms exposed above.<br />
9.1 Stereo AR alignment<br />
The AR alignment problem in the case of a stereo couple of cameras<br />
is substantially analogue to the single-camera case. The difference is<br />
that in the stereo case the aims of the alignment procedure are three<br />
instead of one:<br />
• align the left camera image with the right camera image, so that<br />
correspondent pixels disparity is correct and comfortable;<br />
• align the AR overlay on the left image with the one on the right<br />
image, so that they are visualized with the correct disparity and<br />
they are seen by the human operator as a unique 3D overlay;<br />
• align the stereoscopic couple of overlays with the stereoscopic cou-<br />
ple of images, so that virtual objects are correctly positioned over<br />
real objects (which is the same aim to achieve in monoscopic AR).<br />
In ideal conditions (left and right cameras are perfectly identical,<br />
parallel and at the same height) the first aim would be automatically<br />
satisfied. Unfortunately, things are different in the real case: cameras<br />
79
elonging to the same model may have slightly different intrinsic pa-<br />
rameters, and they could be not perfectly positioned. Therefore, the<br />
MOSTAR interface gives the possibility to introduce a certain horizon-<br />
tal/vertical offset between the images, in order to correct inaccuracies<br />
in the cameras positions or differences in the cameras principal points.<br />
Depending on the values of the offset, the images are shifted in opposite<br />
directions one respect to the other, and the parts of them which lack a<br />
correspondent in the other image are cropped (figure 35).<br />
The second aim is automatically reached by exploiting the features<br />
of the OpenGL library. In fact, it is sufficient to render left and right<br />
overlays as two equal viewpoints on the same virtual scene, parallel<br />
between each other and one horizontally shifted respect to the other,<br />
to obtain a stereoscopic couple of virtual viewpoints.<br />
The third aim is obtained through the same feedback-based cali-<br />
bration procedure used in the monoscopic case. The user can adjust<br />
the parameters of the virtual cameras while the stereoscopic overlay is<br />
rendered on the images, and correct their values depending on what he<br />
sees. Although each virtual camera should have a separate set of intrin-<br />
sic and extrinsic parameters, most of them (specifically, focal length, y<br />
and z coordinates and rotation around the x axis) are kept equal for<br />
both, in order to preserve the correctness of the stereoscopic virtual<br />
couple. Only the x coordinates of the cameras are independent. As a<br />
general rule, the offset between the two final x values should be equal<br />
to the baseline between the real stereo cameras.<br />
9.2 NEP correspondence and suppression<br />
The NED algorithm described in section 8 is designed to be used on<br />
a single image. Executing it on both images independently is likely to<br />
give results conflicting with each other: in fact, weak edges can easily<br />
be recognized in one image and missed in the other, while one image<br />
80
Figure 35: (a) Original images; the left image is some pixel higher in respect to<br />
the right one. (b) Left image is shifted downward (and its higher part is cropped),<br />
while the right image is shifted upward (and its lower part is cropped), so that they<br />
are vertically aligned.<br />
81
could contain artifacts which the other could not not (especially if the<br />
quality of the captured images is low). Therefore, it has been necessary<br />
to implement a method to conciliate NEPs retrieved from the left and<br />
the right image.<br />
A NESC (Nearest Edges Stereo Correspondence) algorithm is run<br />
by the MOSTAR interface after the NED algorithm has been performed<br />
on both images, and before the NEPs are rendered. The NESC algo-<br />
rithm is based on a simple assumption: the real edges of the objects<br />
within the robot workspace are likely to be rather strong features and<br />
appear in both images, while false edges, like the ones produced by<br />
image artifacts, are likely to appear in only one image. Therefore, the<br />
algorithm searches for NEPs which appear in both images, and are<br />
likely to represent the same point in the space.<br />
The algorithm starts iterating over the laser points which have a<br />
correspondent NEP in one or both images. For each laser point, there<br />
are three possible alternatives:<br />
1. if the laser point has a correspondent NEP in only one of the<br />
images, that NEP is counted as an unreliable NEP;<br />
2. if the laser point has a correspondent NEP in both images, and<br />
the distance between left and right NEP is lesser than a param-<br />
eterizable threshold (stereoT h), the NEPs are considered as cor-<br />
responding to the same point in the space: therefore, a reliable<br />
NEP is counted, and its coordinates are set as the middle point<br />
between the left and the right NEP;<br />
3. if the laser point has a correspondent NEP in both images, but<br />
the distance between left and right NEP is greater than stereoT h,<br />
the NEPs are considered as corresponding to two different points<br />
in the space, and are both counted as unreliable NEPs.<br />
82
At the end of the iteration, laser points which fall into the first and<br />
the second categories exposed above will have one correspondent NEP<br />
(reliable or unreliable), while the ones belonging to the third will have<br />
two NEPs. In their case, only the NEP which corresponds to the laser<br />
measure (that is, the distance between the NEP is lesser than edgeT h),<br />
if any, is considered. In the unlikely case a laser point has two different<br />
unreliable NEPs, and both correspond to the laser measure (it can hap-<br />
pen if the chosen threshold values are such that stereoT h < 2edgeT h),<br />
only the NEP closer to the robot is considered <strong>–</strong> as a safety measure.<br />
After reliability of NEPs has been assessed through the NESC al-<br />
gorithm, reliable NEPs are used to refine alignment and double-check<br />
laser measures as described in sections 8.3 and 8.4. Instead, NEPs<br />
which have proved to be unreliable are used only if they agree with the<br />
correspondent laser measure, but they are disregarded if they contra-<br />
dict the laser sensor. Therefore, while reliable NEPs are used for both<br />
overlay alignment and laser correction, unreliable NEPs are used only<br />
for alignment. The idea is that if a NEP is unreliable (i.e. it appears<br />
only in one of the image), but it is strengthened by some other fac-<br />
tor (in our case, the laser measure), it is likely to correspond to a real<br />
edge, so it can be used for alignment; instead, if it disagrees with other<br />
measures, it is likely to be a false edge, so it should be neglected.<br />
9.3 Testing<br />
Evaluation of stereoscopic visualization went as expected. Participants<br />
found the stereoscopic modality of MOSTAR interface more realistic<br />
than the monoscopic modality, and felt an increased sense of awareness<br />
of the remote environment. No quantitative results where collected,<br />
though a systematic evaluation has been planned for the future.<br />
The stereo alignment technique has proven to be helpful to correct<br />
small misalignments between camera images. They have been espe-<br />
83
cially useful to eliminate vertical disparity between images (for the<br />
STH-MDCS2-VAR-C cameras, vertical disparity was due to a slight<br />
difference in the position of the principal point; for the Microsoft Life-<br />
cams, it was due to inaccurate positioning).<br />
Also the rendering of virtual objects took advantage of stereo. Par-<br />
ticipants observed that virtual walls appeared much more realistic when<br />
observed with stereoscopic visualization. On the other hand, virtual<br />
rays and lines on walls were found confusing and tiresome to look at.<br />
This is probably due to the fact that rays and lines were not clearly<br />
visible. However, as stated in section 7, it has been observed that<br />
virtual rays and lines were useful during laser-camera calibration, so<br />
participants usually preferred to visualize them during calibration.<br />
The application of the NESC algorithm was successful. As it can<br />
be seen in figure 36, after the application of NESC algorithm several<br />
stray NEPs are eliminated, while reliable NEPs (the ones coincident<br />
with borders of real objects) are left relatively untouched.<br />
NESC algorithm results were influenced by the value of the stereoT h<br />
parameter. Given a laser point having a corresponding NEP in the left<br />
image and another in the right image, the value of stereoT h determines<br />
how far (in the 3D space) the two NEPs have to be in order to be<br />
considered reliable (i.e. coincident). Higher values of stereoT h force<br />
the NESC algorithm to “trust” edge detection and to output more<br />
reliable edges, which means that more NEPs will ultimately be used<br />
during the rendering of the overlay. Therefore, high values of stereoT h<br />
should be used when edge detection has reliable results. During tests,<br />
a value of edgeT h/2 was used for stereoT h, with excellent results.<br />
Since in stereoscopic mode the image processing algorithms oper-<br />
ated on two different images, the delay introduced was double (about<br />
90 ms). However, as stated in section 8, it was neglectable in respect<br />
to the one introduced by the communication over the network.<br />
84
Figure 36: (a) Highlighted NEPs in left image. (b) Highlighted NEPs in left image<br />
after applying NESC algorithm.<br />
85
10 Conclusions<br />
This work has presented a new approach to visualization of video and<br />
sensor data in a teleguide interface. The approach is based on aug-<br />
mented reality and further enhanced by stereoscopic visualization. The<br />
approach has been implemented within the MOSTAR interface, and<br />
tested by teleoperating the MORDUC mobile robot from a distance of<br />
over 2500 km.<br />
The proposed approach displays visual and range data from a laser<br />
scanner in a unified, AR-based representation. The aim is to assist<br />
mobile robot navigation by providing depth information to the oper-<br />
ator, in an intuitive and effective way. Virtual colored tridimensional<br />
objects, built using laser data, are overlaid on the video image to high-<br />
light obstacles. Virtual objects are registered with real objects thanks<br />
to a simple and effective semi-automatic calibration procedure. Edge<br />
detection is used to individuate nearest edge points (NEPs), which in<br />
turn are used to refine the AR registration and to point out obstacles<br />
which the laser is not aware of. The approach can be used in both<br />
monoscopic and stereoscopic display solutions. If stereo cameras are<br />
available, stereo information is used to verify reliability of edge detec-<br />
tion.<br />
The proposed approach has been implemented and a pilot test has<br />
been performed to assess its validity. The test has had excellent results.<br />
Virtual objects have proven to be a valuable aid for distance estima-<br />
tion and for acquiring awareness of the remote environment. Semi-<br />
automatic calibration was sufficient to obtain a good alignment degree<br />
in the vast majority of cases. Edge detection highlighted obstacles in-<br />
visible to the laser, generating few neglectable false positives; though,<br />
the alignment correction feature proved to be too sensitive to noisy<br />
edges. The approach performed well both in monoscopic and stereos-<br />
86
copic modes. Tests showed that it is possible to significantly reduce the<br />
number of highlighted false edges by using stereo information.<br />
Planned further developments include the refinement of features<br />
which showed to perform poorly. Specifically, we mean to investigate<br />
computer vision methods to reliably detect object bases even in pres-<br />
ence of strong patterns on the floor. Besides, a method is being designed<br />
in order to make feedback-based calibration more intuitive. Finally, a<br />
systematical evaluation of the approach as in [59, 60] has been planned,<br />
in order to quantify the performance increment introduced by the ap-<br />
proach.<br />
87
References<br />
[1] B. Davies. A review of robotics in surgery. Proceedings of the In-<br />
stitution of Mechanical Engineers, Part H: Journal of Engineering<br />
in Medicine, 214(1):129<strong>–</strong>140, 2000.<br />
[2] A.R. Lanfranco, A.E. Castellanos, J.P. Desai, and W.C. Meyers.<br />
Robotic surgery: a current perspective. Annals of Surgery, 239(1):<br />
14, 2004.<br />
[3] P. Arena, P. Di Giamberardino, L. Fortuna, F. La Gala, S. Monaco,<br />
G. Muscato, A. Rizzo, and R. Ronchini. Toward a mobile au-<br />
tonomous robotic system for Mars exploration. Planetary and<br />
Space Science, 52(1-3):23<strong>–</strong>30, 2004.<br />
[4] G. Astuti, G. Giudice, D. Longo, C.D. Melita, G. Muscato, and<br />
A. Orlando. An Overview of the “Volcan Project”: An UAS for<br />
Exploration of Volcanic Environments. Journal of Intelligent and<br />
Robotic Systems, 54(1):471<strong>–</strong>494, 2009.<br />
[5] RR Murphy. Human-robot interaction in rescue robotics. IEEE<br />
Transactions on Systems, Man, and Cybernetics, Part C: Applica-<br />
tions and Reviews, 34(2):138<strong>–</strong>153, 2004.<br />
[6] G. Muscato, D. Caltabiano, S. Guccione, D. Longo, M. Coltelli,<br />
A. Cristaldi, E. Pecora, V. Sacco, P. Sim, GS Virk, et al. ROBO-<br />
VOLC: a robot for volcano exploration result of first test campaign.<br />
Industrial Robot: An International Journal, 30(3):231<strong>–</strong>242, 2003.<br />
[7] Z. Zhang, S. Ma, Z. Lu, and B. Cao. Communication Mechanism<br />
Study of a Multi-Robot Planetary Exploration System. In IEEE<br />
International Conference on Robotics and Biomimetics (ROBIO),<br />
pages 49<strong>–</strong>54, 2006.<br />
88
[8] P. Milgram, S. Yin, and J.J. Grodski. An augmented reality based<br />
teleoperation interface for unstructured environments. In Proc.<br />
American Nuclear Society 7th Topical Meeting on Robotics and<br />
Remote Systems, 1997.<br />
[9] M. Baker, R. Casey, B. Keyes, and H.A. Yanco. Improved inter-<br />
faces for human-robot interaction in urban search and rescue. In<br />
Proceedings of the IEEE Conference on Systems, Man and Cyber-<br />
netics, volume 3, pages 2960<strong>–</strong>2965, 2004.<br />
[10] J. Scholtz, J. Young, J. Drury, and H. Yanco. Evaluation of human-<br />
robot interaction awareness in search and rescue. In IEEE Inter-<br />
national Conference on Robotics and Automation, volume 3, pages<br />
2327<strong>–</strong>2332, 2004.<br />
[11] H.A. Yanco and J. Drury. ‘Where am I?’Acquiring situation aware-<br />
ness using a remote robot platform. In IEEE Conference on Sys-<br />
tems, Man and Cybernetics, pages 2835<strong>–</strong>2840, 2004.<br />
[12] M.W. Kadous, R.K.M. Sheh, and C. Sammut. Effective user in-<br />
terface design for rescue robotics. In Proceedings of the 1st ACM<br />
SIGCHI/SIGART conference on Human-robot interaction, page<br />
257. ACM, 2006.<br />
[13] S. Livatino, G. Muscato, D. De Tommaso, and M. Macaluso. Aug-<br />
mented reality stereoscopic visualization for intuitive robot tele-<br />
guide. In IEEE International Symposium on Industrial Electronics<br />
(ISIE), 2010.<br />
[14] R.T. Azuma et al. A survey of augmented reality. Presence-<br />
Teleoperators and Virtual Environments, 6(4):355<strong>–</strong>385, 1997.<br />
[15] R. Azuma, Y. Baillot, R. Behringer, S. Feiner, S. Julier, and<br />
89
B. MacIntyre. Recent advances in augmented reality. IEEE Com-<br />
puter Graphics and Applications, pages 34<strong>–</strong>47, 2001.<br />
[16] WS Kim, PS Schenker, AK Bejczy, and S. Hayati. Advanced<br />
graphics interfaces for telerobotic servicing and inspection. In<br />
Proc. IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems,<br />
Yokohama, pages 303<strong>–</strong>309, 1993.<br />
[17] P. Milgram, S. Zhai, D. Drascic, and J. Grodski. Applications of<br />
augmented reality for human-robot communication. In Proceedings<br />
of the IEEE/RSJ International Conference on Intelligent Robots<br />
and Systems (IROS)., volume 3, 1993.<br />
[18] S. Otmane, M. Mallem, A. Kheddar, and F. Chavand. Active vir-<br />
tual guides as an apparatus for augmented reality based telemanip-<br />
ulation system on the Internet. In Annual Simulation Symposium,<br />
volume 33, pages 185<strong>–</strong>191, 2000.<br />
[19] JWS Chong, SK Ong, AYC Nee, and K. Youcef-Youmi. Robot<br />
programming using augmented reality: An interactive method for<br />
planning collision-free paths. Robotics and Computer Integrated<br />
Manufacturing, 25(3):689<strong>–</strong>701, 2009.<br />
[20] T.H.J. Collett and B.A. MacDonald. Developer oriented vis-<br />
ualisation of a robot program. In Proceedings of the 1st<br />
ACM SIGCHI/SIGART conference on Human-robot interaction,<br />
page 56. ACM, 2006.<br />
[21] B. Giesler, T. Salb, P. Steinhaus, and R. Dillmann. Using aug-<br />
mented reality to interact with an autonomous mobile platform.<br />
In Proceedings of IEEE International Conference on Robotics and<br />
Automation (ICRA), volume 1, 2004.<br />
90
[22] D.J. Bruemmer, D.D. Dudenhoeffer, and J. Marble. Dynamic au-<br />
tonomy for urban search and rescue. In Proceedings of the AAAI<br />
Mobile Robot Workshop, 2002.<br />
[23] V. Brujic-Okretic, J.Y. Guillemaut, LJ Hitchin, M. Michielen, and<br />
GA Parker. Remote vehicle manoeuvring using augmented reality.<br />
In International Conference on Visual Information Engineering<br />
(VIE), pages 186<strong>–</strong>189, 2003.<br />
[24] R. Meier, T. Fong, C. Thorpe, and C. Baur. A sensor fusion<br />
based user interface for vehicle teleoperation. In Proceedings of<br />
the IEEE International Conference on Field and Service Robotics<br />
(FSR), 1999.<br />
[25] F. Ferland, F. Pomerleau, C.T. Le Dinh, and F. Michaud. Ego-<br />
centric and exocentric teleoperation interface using real-time, 3D<br />
video projection. In Proceedings of the 4th ACM/IEEE interna-<br />
tional conference on Human robot interaction, pages 37<strong>–</strong>44. ACM,<br />
2009.<br />
[26] C.W. Nielsen, M.A. Goodrich, and R.W. Ricks. Ecological inter-<br />
faces for improving mobile robot teleoperation. IEEE Transactions<br />
on Robotics, 23(5):927, 2007.<br />
[27] C. Demiralp, CD Jackson, DB Karelitz, S. Zhang, and DH Laid-<br />
law. Cave and fishtank virtual-reality displays: A qualitative and<br />
quantitative comparison. IEEE Transactions on Visualization and<br />
Computer Graphics, 12(3):323<strong>–</strong>330, 2006.<br />
[28] D. Drascic. Skill acquisition and task performance in teleoperation<br />
using monoscopic and stereoscopic video remote viewing. In Hu-<br />
man Factors and Ergonomics Society Annual Meeting Proceedings,<br />
volume 35, pages 1367<strong>–</strong>1371, 1991.<br />
91
[29] M. Ferre, R. Aracil, and MA Sanchez-Uran. Stereoscopic human<br />
interfaces. IEEE Robotics & Automation Magazine, 15(4):50<strong>–</strong>57,<br />
2008.<br />
[30] G.S. Hubona, G.W. Shirah, and D.G. Fout. The effects of motion<br />
and stereopsis on three-dimensional visualization. International<br />
journal of human-computer studies, 47(5):609<strong>–</strong>627, 1997.<br />
[31] G. Jones, D. Lee, N. Holliman, and D. Ezra. Controlling perceived<br />
depth in stereoscopic images. In Stereoscopic Displays and Virtual<br />
Reality Systems VIII, Proceedings of SPIE, volume 4297, pages<br />
42<strong>–</strong>53, 2001.<br />
[32] I. Sexton and P. Surman. Stereoscopic and autostereoscopic display<br />
systems. IEEE Signal Processing Magazine, 16(3):85<strong>–</strong>99, 1999.<br />
[33] Wikipedia. Augmented reality.<br />
http://en.wikipedia.org/wiki/Augmented_reality, 2010.<br />
[34] M. Billinghurst, I. Poupyrev, H. Kato, and R. May. Mixing realities<br />
in shared space: An augmented reality interface for collaborative<br />
computing. In ICME 2000, pages 1641<strong>–</strong>1644, 2000.<br />
[35] P. Milgram and F. Kishino. A taxonomy of mixed reality visual<br />
displays. IEICE Transactions on Information and Systems E series<br />
D, 77:1321<strong>–</strong>1321, 1994.<br />
[36] R. Azuma. Tracking requirements for augmented reality. Commu-<br />
nications of the ACM, 36(7):51, 1993.<br />
[37] AJ Davison, ID Reid, ND Molton, and O. Stasse. MonoSLAM:<br />
Real-time single camera SLAM. IEEE Transactions on Pattern<br />
Analysis and Machine Intelligence, 29(6):1052<strong>–</strong>1067, 2007.<br />
92
[38] W.A. Hoff, K. Nguyen, and T. Lyon. Computer vision-based regis-<br />
tration techniques for augmented reality. Proceedings of Intelligent<br />
Robots and Computer Vision XV (SPIE), 2904:538<strong>–</strong>548, 1996.<br />
[39] H. Kato and M. Billinghurst. Marker tracking and hmd calibra-<br />
tion for a video-based augmented reality conferencing system. In<br />
Proceedings of the 2nd IEEE and ACM International Workshop<br />
on Augmented Reality, volume 99, pages 85<strong>–</strong>94, 1999.<br />
[40] Wikipedia. Pinhole camera.<br />
http://en.wikipedia.org/wiki/Pinhole_camera, 2010.<br />
[41] J. Heikkila and O. Silven. A four-step camera calibration proce-<br />
dure with implicit imagecorrection. In Proceedings of the IEEE<br />
Computer Society Conference on Computer Vision and Pattern<br />
Recognition, pages 1106<strong>–</strong>1112, 1997.<br />
[42] C.C. Slama, C. Theurer, and S.W. Henriksen. Manual of pho-<br />
togrammetry. American Society of Photogrammetry Falls Church,<br />
Virginia, 1980.<br />
[43] T. Melen. Geometrical modelling and calibration of video cam-<br />
eras for underwater navigation. PhD thesis, Institutt for teknisk<br />
kybernetikk, Universitetet i Trondheim, 1994.<br />
[44] W. Faig. Calibration of close-range photogrammetry systems:<br />
Mathematical formulation. Photogrammetric engineering and re-<br />
mote sensing, 41(12):1479<strong>–</strong>1486, 1975.<br />
[45] J. Weng, P. Cohen, and M. Herniou. Camera calibration with<br />
distortion models and accuracy evaluation. IEEE Transactions on<br />
Pattern Analysis and Machine Intelligence, 14(10):965<strong>–</strong>980, 1992.<br />
93
[46] L. Lipton. Stereographics, Developers Handbook. StereoGraphics<br />
Corporation, 1991.<br />
[47] L. Lipton. Foundations of the stereoscopic cinema: a study in<br />
depth. Van Nostrand Reinhold, 1982.<br />
[48] Wikipedia. Anaglyph image.<br />
http://en.wikipedia.org/wiki/Anaglyph_image, 2010.<br />
[49] S.E.B. Sorensen, P.S. Hansen, and N.L. Sorensen. Method for<br />
recording and viewing stereoscopic images in color using multi-<br />
chrome filters, February 3 2004. US Patent 6,687,003.<br />
[50] Wikipedia. Stereoscopy.<br />
http://en.wikipedia.org/wiki/Stereoscopy, 2010.<br />
[51] M. Halle. Autostereoscopic displays and computer graphics. In<br />
ACM SIGGRAPH Courses, page 104. ACM, 2005.<br />
[52] Wikipedia. HSL and HSV color spaces.<br />
http://en.wikipedia.org/wiki/HSL_and_HSV, 2010.<br />
[53] J. Borenstein and Y. Koren. Histogramic in-motion mapping for<br />
mobile robot obstacle avoidance. IEEE Journal of Robotics and<br />
Automation, 7(4):535<strong>–</strong>539, 1991.<br />
[54] H. Baltzakis, A. Argyros, and P. Trahanias. Fusion of laser and<br />
visual data for robot motion planning and collision avoidance. Ma-<br />
chine Vision and Applications, 15(2):92<strong>–</strong>100, 2003.<br />
[55] D.J. Bruemmer, R.L. Boring, D.A. Few, J. Marble, and M.C. Wal-<br />
ton. “I call shotgun!”: An evaluation of mixed-initiative control<br />
for novice users of a search and rescue robot. In Proceedings of the<br />
IEEE Conference on Systems, Man & Cybernetics, 2004.<br />
94
[56] J. J. Gibson. The ecological approach to visual perception.<br />
Houghton Mifflin Boston, 1979.<br />
[57] R.J. Rost. OpenGL R⃝ Shading Language. Addison Wesley Long-<br />
man Publishing Co., Inc. Redwood City, CA, USA, 2004.<br />
[58] DIEES University of Catania. 3MORDUC.<br />
http://www.robotic.diees.unict.it/robots/morduc/<br />
morduc.htm, 2010.<br />
[59] S. Livatino, G. Muscato, S. Sessa, C. Koffel, C. Arena, A. Pennisi,<br />
D. Di Mauro, and E. Malkondu. Mobile robotic teleguide based<br />
on video images. IEEE Robotics & Automation Magazine, 15(4):<br />
58<strong>–</strong>67, 2008.<br />
[60] S. Livatino, G. Muscato, S. Sessa, and V. Neri. Depth-enhanced<br />
mobile robot teleguide based on laser images. Mechatronics, In<br />
Press, 2010.<br />
[61] J. Corde Lane, R. Carignan, B.R. Sullivan, D.L. Akin, T. Hunt,<br />
and R. Cohen. Effects of time delay on telerobotic control of neu-<br />
tral buoyancy vehicles. In Proceedings of IEEE International Con-<br />
ference on Robotics and Automation, volume 3, pages 2874<strong>–</strong>2879,<br />
2002.<br />
[62] Y. Bok, Y. Hwang, and I.S. Kweon. Accurate motion estimation<br />
and high-precision 3d reconstruction by sensor fusion. In IEEE In-<br />
ternational Conference on Robotics and Automation, pages 4721<strong>–</strong><br />
4726, 2007.<br />
[63] Q. Zhang and R. Pless. Extrinsic calibration of a camera and<br />
laser range finder (improves camera calibration). In Proceedings<br />
of IEEE/RSJ International Conference on Intelligent Robots and<br />
Systems(IROS), volume 3, 2004.<br />
95
[64] OpenGL website.<br />
http://www.opengl.org, 2010.<br />
[65] GLUT website.<br />
http://www.opengl.org/resources/libraries/glut/, 2010.<br />
[66] OpenCV website.<br />
http://sourceforge.net/projects/opencvlibrary/, 2010.<br />
[67] R. Williams and B. Andrews. The non-designer’s design book.<br />
Peachpit Press Berkeley, 1994.<br />
[68] Paul Bourke. Nonlinear Lens Distortion.<br />
http://local.wasp.uwa.edu.au/~pbourke/miscellaneous/<br />
lenscorrection/#opengl, August 2000.<br />
[69] Graphics Size Coding. Tiny distortion shader.<br />
http://sizecoding.blogspot.com/2007/10/tiny-<br />
distortion-shader.html, October 2007.<br />
[70] J. Canny. A computational approach to edge detection. Readings<br />
in computer vision: issues, problems, principles, and paradigms,<br />
page 184, 1987.<br />
[71] I. Sobel and G. Feldman. A 3x3 isotropic gradient operator for im-<br />
age processing. Presentation for Stanford Artificial Project, 1968.<br />
96