01.06.2013 Views

Scarica (PDF – 6.19 MB)

Scarica (PDF – 6.19 MB)

Scarica (PDF – 6.19 MB)

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

UNIVERSITÀ DEGLI STUDI DI CATANIA<br />

facoltà di Ingegneria<br />

corso di laurea specialistica in ingegneria informatica<br />

Filippo Bannò<br />

STEREOSCOPIC AUGMENTED REALITY<br />

TO ASSIST ROBOT TELEOPERATION<br />

Tesi di laurea<br />

anno accademico 2008/2009<br />

Relatore:<br />

Prof. Ing. G. Muscato<br />

Correlatore:<br />

Dr. S. Livatino


Contents<br />

1 Introduction 3<br />

2 Background 7<br />

2.1 Augmented reality . . . . . . . . . . . . . . . . . . . . . 7<br />

2.2 Pinhole camera model . . . . . . . . . . . . . . . . . . . 10<br />

2.3 Stereoscopic visualization . . . . . . . . . . . . . . . . . . 14<br />

3 Augmented reality visual Interfaces in robot teleopera-<br />

tion 19<br />

3.1 A sensor fusion based user interface for vehicle teleoper-<br />

ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

3.2 Fusion of laser and visual data for robot motion planning<br />

and collision avoidance . . . . . . . . . . . . . . . . . . . 22<br />

3.3 Using augmented reality to interact with an autonomous<br />

mobile platform . . . . . . . . . . . . . . . . . . . . . . . 23<br />

3.4 Improved interfaces for human-robot interaction in ur-<br />

ban search and rescue . . . . . . . . . . . . . . . . . . . . 25<br />

3.5 Ecological interfaces for improving mobile robot teleop-<br />

eration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />

3.6 Egocentric and exocentric teleoperation interface using<br />

real-time, 3D video projection . . . . . . . . . . . . . . . 29<br />

3.7 Summary and analysis . . . . . . . . . . . . . . . . . . . 31<br />

4 Previous work on 3MORDUC teleoperation 33<br />

4.1 The 3MORDUC platform . . . . . . . . . . . . . . . . . 34<br />

4.2 Mobile robotic teleguide based on video images . . . . . 37<br />

4.3 Depth-enhanced mobile robot teleguide based on laser<br />

images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

4.4 Augmented reality stereoscopic visualization for intuitive<br />

robot teleguide . . . . . . . . . . . . . . . . . . . . . . . 41<br />

1


4.5 Summary and analysis . . . . . . . . . . . . . . . . . . . 43<br />

5 Proposed method: AR stereoscopic visualization 44<br />

5.1 Core idea and motivation . . . . . . . . . . . . . . . . . . 44<br />

5.2 Research development strategy . . . . . . . . . . . . . . 46<br />

6 Effective multi-sensor visual representation 52<br />

6.1 Visualization of laser data through AR features . . . . . 52<br />

6.2 Detection of discontinuities . . . . . . . . . . . . . . . . . 56<br />

6.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />

7 Laser-camera alignment and calibration 60<br />

7.1 Laser-camera model . . . . . . . . . . . . . . . . . . . . . 60<br />

7.2 Feedback-based calibration procedure . . . . . . . . . . . 62<br />

7.3 Comparison with automatic calibration . . . . . . . . . . 63<br />

7.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />

8 Integrating 3D Graphics with image processing 67<br />

8.1 Edge detection algorithm . . . . . . . . . . . . . . . . . . 67<br />

8.2 Nearest edges discovery . . . . . . . . . . . . . . . . . . . 70<br />

8.3 Improving alignment with edges . . . . . . . . . . . . . . 73<br />

8.4 Improving reliability with edges . . . . . . . . . . . . . . 73<br />

8.5 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />

9 Stereoscopic augmented reality 79<br />

9.1 Stereo AR alignment . . . . . . . . . . . . . . . . . . . . 79<br />

9.2 NEP correspondence and suppression . . . . . . . . . . . 80<br />

9.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />

10 Conclusions 86<br />

References 88<br />

2


1 Introduction<br />

Robot teleoperation is a solution for many problems which cannot be<br />

solved neither by a robot alone, nor by the sole human intervention.<br />

Teleoperation of a robotic manipulator is widely used for tasks where<br />

a high precision of movements is required, or when the scale of the<br />

task forbids direct human intervention, as in robotic surgery [1, 2].<br />

Besides, robots can be teleoperated to execute exploration or manipu-<br />

lation tasks in unknown, inaccessible, or dangerous environments where<br />

human beings could not operate safely, e.g. in deep waters, in plane-<br />

tary or volcanoes exploration, in USAR (Urban Search And Rescue)<br />

applications, for bomb finding and their deactivation [3<strong>–</strong>7].<br />

Figure 1: Telerobotics applications: robotic surgery, exploration of volcanoes, deep<br />

waters, planets.<br />

On the other hand, as the sophistication of techniques for manag-<br />

3


ing telerobotic systems continues to grow, it is nevertheless clear to<br />

those familiar with control technologies that complex robotic tasks are<br />

unlikely to be achievable using fully autonomous robotic systems, and<br />

especially not in highly unstructured and dynamically varying environ-<br />

ments. In these cases, human-cognition is irreplaceable because of the<br />

high operational accuracy that is required, as well as deep environment<br />

understanding and fast decision-making [8].<br />

When piloting a mobile robot accurate robot navigation is necessary.<br />

Errors and collisions must be minimized since the robot could receive<br />

unpredictable damage, and in most cases repairs would be difficult if not<br />

impossible (a representative example is space/planetary exploration).<br />

The same is valid during tasks where the robot has to physically interact<br />

with people, since a careless teleoperation may cause harm to them.<br />

Accuracy and reactivity of a robot teleoperator can be improved by<br />

enhancing his sense of presence in the remote environment. Therefore,<br />

a relevant aspect of a telerobotic system is the user interface, which<br />

must be designed in order to be as immersive as possible.<br />

Vision being the dominant human sensor modality, large attention<br />

has been paid to the visualization aspect in literature. The video sen-<br />

sor is an essential part in most telerobotic systems, since it provides a<br />

considerable amount of highly contrasted information in a way which is<br />

easy for the user to assimilate. Though, there are a number of other sen-<br />

sors which can well complement visual sensor output, e.g. range sensors<br />

(laser-based, sonar-based), odometric sensors, bumpers, etc. Numer-<br />

ous works (see for example [9<strong>–</strong>13]) study interface design and propose<br />

methods to effectively display visual and sensor data in a teleoperation<br />

interface.<br />

This work proposes a novel approach to visualization of video and<br />

sensor data in a teleguide interface. The proposed approach exploits<br />

augmented reality and stereoscopic visualization to assist tele-naviga-<br />

4


tion of a mobile robot.<br />

Augmented reality consists in enhancing a real world representa-<br />

tion with virtual graphical additions. It gives the possibility to display<br />

sensor data together with visual data in an intuitive and quickly com-<br />

prehensible way. Up to now, AR has found application in several fields.<br />

It can be used in medical and manufacturing fields for intuitive training<br />

and for assistance during precision tasks, or to display annotations over<br />

the real workspace in collaborative applications. It is frequently used in<br />

military (e.g. Head-Up Displays for aircraft and helicopter pilots) and<br />

commercial applications, e.g. to enhance sporting events on television<br />

[14, 15]. Numerous applications of augmented reality in robotics are<br />

found in literature. It has been frequently used to introduce visual aids<br />

into telemanipulation tasks [16<strong>–</strong>18], to facilitate robot programming<br />

[19<strong>–</strong>21] or to assist mobile robots teleguide [9, 22<strong>–</strong>26].<br />

Stereoscopic visualization is today well-known thanks to the spread<br />

of “3D movies”. Stereoscopy is a group of technologies which permit<br />

to reproduce the tridimensional depth effect given by binocular vision<br />

using a bidimensional display. Several works demonstrate that stere-<br />

oscopic visualization may provide a teleoperator with a higher sense<br />

of presence in remote environments because of higher depth percep-<br />

tion [27<strong>–</strong>32]. This leads to higher comprehension of distance as well as<br />

aspects related to it, e.g., ambient and obstacle layout<br />

The proposed visualization approach has been implemented at the<br />

3D Visualization and Robotics Lab at the University of Hertfordshire,<br />

United Kingdom. It has been tested by teleoperating the 3MORDUC<br />

platform, a wheeled mobile robot located at DIEES (Dipartimento di<br />

Ingegneria Elettrica, Elettronica e dei Sistemi), in University of Cata-<br />

nia, Italy, more than 2500 km far from the operator location.<br />

This thesis is structured as follows. Section 2 introduces some pre-<br />

liminary notions about augmented reality and stereoscopic visualiza-<br />

5


tion. Section 3 describes the state of the art in visualization interfaces<br />

for mobile robot teleguide and presents points of strength and weak-<br />

nesses of the proposed approaches. Section 4 describes the past work<br />

performed on the teleoperation of 3MORDUC robotic platform, expos-<br />

ing results and limitations. Section 5 presents the proposed stereoscopic<br />

augmented reality approach, outlining the adopted development strat-<br />

egy. Sections 6 to 9 describe the various steps of the implementation in<br />

detail and present test results. Section 10 draws the conclusions, and<br />

introduces further developments.<br />

6


2 Background<br />

2.1 Augmented reality<br />

Augmented reality (AR) is a term for a live, direct or indirect view of<br />

a physical real-world environment whose elements are augmented by<br />

virtual computer-generated imagery [33].<br />

Figure 2: Example of AR: a graphical model is rendered on a real fiducial marker<br />

[34].<br />

Many definitions have been proposed in literature for AR. Azuma<br />

et al. [14] define AR as a variant of Virtual Reality (VR). VR tech-<br />

nologies completely immerse a user inside a synthetic environment. In<br />

contrast, AR allows the user to see the real world, with virtual ob-<br />

jects superimposed upon or composited with the real world. Therefore,<br />

AR supplements reality, rather than completely replacing it. Ideally, it<br />

would appear to the user that the virtual and real objects coexisted in<br />

the same space. Azuma et al. [14] states that the main requirements<br />

for a visualization interface to fall within the AR category are:<br />

• to combine real and virtual;<br />

• to be interactive in real-time;<br />

7


• to be registered in 3D (that is, virtual overlays are integrated in<br />

3D with real world).<br />

Milgram and Kishino [35] ideate the Reality-Virtuality (RV) con-<br />

tinuum (figure 3) to draw a coherent definition of VR and AR environ-<br />

ments. VR and real environments constitute the two ends of the con-<br />

tinuum. The commonly held view of a VR environment is one in which<br />

Figure 3: Mixed reality display continuum [35].<br />

the participant-observer is totally immersed in a completely synthetic<br />

world, which may or may not mimic the properties of a real-world envi-<br />

ronment, but which may also exceed the bounds of physical reality. In<br />

contrast, a strictly real-world environment clearly must be constrained<br />

by the laws of physics. All the environments between these two ex-<br />

tremes are considered Mixed Reality (MR) forms. AR, which consists<br />

in the addition of several virtual overlays to a real environment is con-<br />

sidered a form of MR near to the “real” end. The reverse of AR is<br />

augmented virtuality (AV), which consists in the addition of (video or<br />

texture mapped) elements from a real environment to a virtual, totally<br />

synthetic environment.<br />

2.1.1 Alignment and registration<br />

Augmented reality does not simply mean the superimposition of a<br />

graphic object over a real world scene. This is technically an easy task.<br />

One significant difficulty in augmenting reality is the need to maintain<br />

accurate registration of the virtual objects with the real world image.<br />

This often requires detailed knowledge of the relationship between the<br />

8


frames of reference for the real world, the camera and the user. The<br />

correct registration must also be maintained while the user (or the<br />

user’s viewpoint) moves within the real environment. Discrepancies<br />

or changes in the apparent registration will range from distracting to<br />

physically disturbing for the user, making the system unusable. AR<br />

demands much more accurate registration than VR, because humans<br />

are much more sensitive to visual differences between virtual and real<br />

objects than to inconsistencies between vision and other senses [36].<br />

According to [14], sources of registration errors can be divided into<br />

two types: static and dynamic. Static sources are the ones that cause<br />

registration errors even when the user’s viewpoint and the objects in the<br />

environment remain completely still. Dynamic sources are the ones that<br />

have no effect until either the viewpoint or the objects begin moving.<br />

Static errors are usually caused by distortions in the optics, track-<br />

ing errors, mechanical misalignments in the employed hardware and/or<br />

incorrect estimation of viewing parameters. Distortions and viewing<br />

parameters inaccuracy usually cause systematic errors, which can be<br />

estimated and corrected. The other factors can cause errors which are<br />

difficult to predict and correct, therefore it is recommended to take<br />

precautions against them during development phase (for example, by<br />

an accurate design of the tracking system and an accurate alignment<br />

of the hardware devices).<br />

Dynamic errors occur essentially because of system delays in the<br />

rendering of the overlays. If the user’s viewpoint is in motion and a<br />

significant delay is present between the moment when the viewpoint<br />

position/orientation is sampled and the moment when the virtual over-<br />

lay is rendered, the virtual objects will not “move” in sync with real<br />

objects, causing misalignments. Dynamic errors can be reduced by re-<br />

ducing system delay, or by predicting the future position/orientation of<br />

the viewpoint and rendering the corresponding part of the virtual over-<br />

9


lay in advance. In video-based AR systems (i.e. when the user does<br />

not see the real world directly, but through a camera) it is possible<br />

to eliminate dynamic errors by synchronizing the video stream with<br />

the rendering of the overlay. This is the case of teleoperation systems,<br />

where the real world is seen through a camera mounted on the robot.<br />

Vision-based techniques are often used to detect the viewpoint po-<br />

sition from the real view, and then correctly register the overlay with<br />

the image. Usually, these approaches use fiducials, well-known objects<br />

whose position and orientation can be easily recognized within an image<br />

[37<strong>–</strong>39].<br />

2.2 Pinhole camera model<br />

A camera model determines a projection function from scene points<br />

(points of the 3D real world viewed by the camera) to image points<br />

(points within the 2D camera image). Correspondence between scene<br />

points and image points is needed in computer graphics to know where<br />

on the screen virtual 3D objects have to be rendered.<br />

The most popular and simple camera model is the pinhole model.<br />

A pinhole camera is a camera with no lens and a single very small<br />

aperture. Simply explained, it is a light-proof box with a small hole in<br />

one side (figure.<br />

Figure 4: Simple representation of a pinhole camera [40].<br />

10


Light from a scene passes through this single point and projects an<br />

inverted image on the opposite side of the box. Cameras using small<br />

apertures and the human eye in bright light both act like a pinhole<br />

camera [40].<br />

The pinhole camera model is based on the principle of collinearity,<br />

where each point in the object space is projected by a straight line<br />

through the projection center into the image plane. Figure 5 shows the<br />

geometric model of a pinhole camera. The camera coordinate system<br />

Q<br />

Y<br />

R<br />

1<br />

Image plane<br />

Y<br />

2<br />

O<br />

X<br />

x 1<br />

Figure 5: Geometric model of the pinhole camera. [40]<br />

(O, X1, X2, X3) has its origin at the camera aperture (which is consid-<br />

ered infinitely small, coincident with a point). Axis X3 is pointing in<br />

the viewing direction of the camera and is referred to as the optical<br />

axis. The plane which intersects with axes X1 and X2 is the front side<br />

of the camera, or principal plane.<br />

1<br />

The image plane is where the 3D world is projected through the<br />

aperture of the camera. It is parallel to axes X1 and X2 and is located<br />

at distance f from the origin O in the negative direction of the optical<br />

axis. f is also referred to as the focal length of the pinhole camera.<br />

11<br />

x 2<br />

f<br />

X<br />

2<br />

x 3<br />

X<br />

P


The point R at the intersection of the optical axis and the image plane<br />

is referred to as the principal point of the camera, or center of the<br />

image. The 2D image coordinate system (R, Y1, Y2) has the origin at<br />

the principal point and the axes parallel to X1 and X2.<br />

For each point P = (x1, x2, x3) such that x3 > 0 a projection<br />

Q = (y1, y2) is defined on the image plane. It is easy to calculate<br />

Q<br />

O<br />

-y1 x3<br />

Y<br />

1<br />

f<br />

X<br />

1<br />

Figure 6: Geometric model of a pinhole camera as seen from the X2 axis. [40]<br />

the coordinates of the projection from those of the original point using<br />

similar triangles (see figure 6 for clarity):<br />

−y1 : f = x1 : x3 → y1 = −f x1<br />

−y2 : f = x2 : x3 → y2 = −f x2<br />

x3<br />

Since a coordinate is lost during the projection, it is not possible to<br />

retrieve the original 3D coordinates of P from the image coordinates of<br />

its projection. In fact, a point in the image corresponds to a line in the<br />

space (see the green line in figure 5 and 6).<br />

When rendering on the screen, the image coordinates of the projec-<br />

tion are converted to pixel coordinates by discretizing them and sum-<br />

ming them to an offset (since the screen coordinate center is usually in<br />

the upper left corner of the screen, instead than in the center of the<br />

image).<br />

12<br />

x<br />

P<br />

1<br />

x3<br />

X<br />

3


Focal length and size of the image are referred to as intrinsic pa-<br />

rameters of the camera, since they depend only on the specific camera<br />

and on nothing else. In the general case, coordinates of scene points<br />

are defined in respect to a world coordinate system, which is different<br />

from the camera system. In this case, it is necessary to convert the<br />

coordinates to their expression in the camera system before calculating<br />

the projection on the image plane. Therefore, a set of extrinsic param-<br />

eters (position and orientation of the camera in respect to the world<br />

system) are needed to obtain image coordinates. Given the extrinsic<br />

parameters, it is possible to determine the transformation matrix be-<br />

tween world and camera systems, thus the function to map points of<br />

the world system to the camera system.<br />

The pinhole model does not consider camera distortions, thus it<br />

accurately correspond to reality only in those cases where distortions<br />

are neglectable (typically, in good quality cameras or in the central zone<br />

of the image). Several camera models exist which include distortion<br />

factors in the projection function. For example, the model used in<br />

Heikkila and Silven [41] differs from the pinhole model in two aspects:<br />

• the position of the principal point can be different from the center<br />

of the image;<br />

• radial and tangential distortion (as defined in [42]) are present,<br />

and apply a non-linear transformation to final projection coordi-<br />

nates.<br />

Other models [43] consider the possibility for the camera axes to be non-<br />

orthogonal; others [44, 45] introduce a prism distortion due to camera<br />

imperfect manufacturing.<br />

13


2.3 Stereoscopic visualization<br />

A stereoscopic image presents the left and right eyes of the viewer with<br />

different perspective viewpoints, just as the viewer sees the real world.<br />

From these two slightly different views, the eye-brain synthesizes an<br />

image of the world with stereoscopic depth [46].<br />

When a human looks at an object in the space, his eyes converge on<br />

that object, i.e. they rotate until both their optical axes cross the object.<br />

It is easy to notice that, when eyes converge on something, objects much<br />

nearer or further than the convergence point appear as double (figure<br />

7). This is because the images projected on the left and right retinae<br />

Figure 7: (a) Eyes converge on the thumb; the flag, which is further, appears as<br />

double. (b) Eyes converge on the flag; the thumb, which is nearer, appears as<br />

double. [46]<br />

are slightly different, since they correspond to two slightly different<br />

viewpoints. If the retinal images are overlaid, corresponding points<br />

will be separated by a horizontal offset. This offset is referred to as<br />

retinal disparity. Points of the retinal images corresponding to an object<br />

on which the eyes converge will have zero disparity. Nearer objects<br />

will have negative disparity, while further objects will have positive<br />

disparity. Retinal disparity is interpreted by the brain to produce a<br />

sense of depth, with a process called stereopsis.<br />

14


Stereopsis works together with monocular depth cues to produce<br />

depth perception. Monocular cues are elements of a 2D image which can<br />

provide depth information. Some monocular cues are motion parallax,<br />

perspective, occlusion, relative size of objects [47].<br />

Stereoscopic displays obtain a depth effect by displaying a parallax<br />

value for each image pixel. Given two views of the same scene from<br />

slightly different side-by-side viewpoints, parallax is the horizontal off-<br />

set, measured on the display, between pixels corresponding in the left<br />

and in the right. It produces a directly proportional disparity on the<br />

retinae. Pixels having zero parallax (figure 8a) will produce zero dis-<br />

Figure 8: (a) Zero parallax. (b) Positive parallax. (c) Negative parallax. (d)<br />

Divergent parallax. [46]<br />

parity on the retinae, and will be seen as lying on the plane of the<br />

display. Pixels having positive parallax (figure 8b) will produce posi-<br />

tive disparity, and will be seen as if they were behind the display. Vice<br />

versa, pixels having negative parallax (figure 8b) will produce negative<br />

15


disparity and will be seen as if they were in front of the screen. Finally,<br />

pixels having divergent parallax, i.e. parallax higher than the distance<br />

between the viewer’s eyes (figure 8d) do not have a valid correspon-<br />

dent disparity value. Trying to fuse objects having a divergent parallax<br />

requires an unusual muscular effort, and often results in discomfort.<br />

Only horizontal parallax/disparity produces a sense of depth. Verti-<br />

cal disparity between left and right images is not natural, and has anal-<br />

ogous effects to divergent disparity (eye strain, discomfort). Therefore,<br />

it should be avoided in the generation of stereoscopic images.<br />

2.3.1 Visualization devices<br />

Numerous technologies have been developed for the visualization of<br />

parallax on planar displays. Stereo visualization devices are mainly<br />

divided into:<br />

• passive glasses;<br />

• active glasses;<br />

• autostereoscopic displays.<br />

Passive stereo technologies are based on the use of glasses very sim-<br />

ple and without electronics. The cheapest kind of passive stereo is<br />

anaglyph stereo. This consists in filtering the two images with oppo-<br />

site colors, and viewing them through special glasses with oppositely<br />

colored lenses, so that each eye sees only the corresponding image.<br />

The most common couple of colors is red-cyan. Anaglyph stereo<br />

does not require a special display, and anaglyph glasses are very cheap,<br />

but the resulting quality of the image is rather low. Moreover, tradi-<br />

tional anaglyph cannot display the full visible color range (although a<br />

patented technique has been developed to provide perceived full color<br />

viewing with simple colored glasses [49]).<br />

16


Figure 9: Paper anaglyph glasses [48].<br />

A more complex and performing passive stereo technology is based<br />

on differently polarized light. Two projectors are used to display the<br />

two images using orthogonally polarized light, and the images are view-<br />

ed through glasses with orthogonally polarized lenses. Each lens lets<br />

through light having the same direction of polarization, while filtering<br />

all light whose polarization is orthogonal. Thus, each eye sees the<br />

corresponding image, in full color but with half its brightness. Polarized<br />

glasses are relatively cheap and the resulting image has a good quality.<br />

For these reasons, polarized stereo is commonly used in cinemas for 3D<br />

movies.<br />

Active stereo is based on the use of more complex visualization<br />

devices. The two most notable examples are shutter glasses and Head-<br />

Mounted Displays (HMD). Shutter glasses are based on alternatively<br />

displaying left and right images on the same display, at a very high<br />

frequency, alternatively occluding the left and right eyes in sync with<br />

the display. This way, each eye sees only the corresponding image. If<br />

the alternating frequency is sufficiently high, the brain fuses the images<br />

as two continuous streams.<br />

Stereo-enabled HMDs use separate displays for each of the two eyes,<br />

so that the eyes actually see two different video streams. Active stereo<br />

device usually provide an image quality superior to passive stereo, al-<br />

17


Figure 10: (a) CristalEyes shutter glasses. (b) Emagin Z800 HMD. [50]<br />

though they are much more expensive.<br />

Some technologies have been developed to build autostereoscopic<br />

displays, which do not need for the user to wear glasses in order to<br />

view stereo images [51]. Though, some of these technologies are still<br />

very expensive, while the others provide a very low image quality.<br />

18


3 Augmented reality visual Interfaces in<br />

robot teleoperation<br />

Example of use of augmented reality in telerobotics are numerous in lit-<br />

erature, as regards both telemanipulation and mobile robots teleguide.<br />

This section contains a review of the current state of the art in<br />

visualization techniques for teleguide interfaces based on augmented<br />

reality and sensor fusion. Major contributions are resumed and their<br />

main points are highlighted and discussed.<br />

19


3.1 A sensor fusion based user interface for vehicle<br />

teleoperation<br />

The work of Meier et al. [24] describes a technique of sensor fusion for<br />

mobile teleoperation which uses different sensors in a complementary<br />

manner, balancing respective points of strength and weaknesses.<br />

A brief analysis of sensor fusion for teleoperation is carried out.<br />

Sensor fusion needs to be human-oriented, and the representation of the<br />

data has to be accessible and understandable. Fusing data in a single<br />

display, rather than representing each different sensor in a different<br />

display, makes perception quicker and reduces cognitive workload. The<br />

most important kind of information which can be shown by sensor<br />

fusion to an operator who is driving a mobile robot is depth information.<br />

This work considers color intensity the most efficient way in which this<br />

kind of information can be delivered to a human; in particular, HSV<br />

color model [52] is considered to be the one which best mimics human<br />

color perception.<br />

The teleoperation system described in this paper uses a stereo vision<br />

system, a ring of ultrasonic sonars and odometric sensors; sensor data<br />

are processed by a Kalman filter. The teleoperation interface contains<br />

a display showing the video stream from the robot cameras and a bidi-<br />

mensional map of the environment, gradually created as an occupancy<br />

grid using Histogramic In-Motion Mapping [53].<br />

Depth information is overlaid on the video display as a layer com-<br />

posed by differently colored pixels. Stereo having a higher angular<br />

resolution than the sonar, it is used by default to create the overlay.<br />

Instead, sonar is used in regions of the image where stereo disparity is<br />

not reliable, i.e. regions with scarce texturing or where the sonar de-<br />

tects very close objects. A grid is projected on the region of the image<br />

identified as ground, in order to improve distance estimation (figure<br />

11a). The bidimensional map (figure 11b) is created by combining<br />

20


Figure 11: (a) Image display processing. (b) Bidimensional map of the environment.<br />

[24]<br />

sonar distance data with stereo disparity; disparity is calculated along<br />

a horizontal line taken at a chosen height in the stereo images.<br />

Main points:<br />

• Using color as an immediate and efficient mean to convey infor-<br />

mation<br />

• Using geometric overlays to enhance distance estimation<br />

• Sensor fusion balances weaknesses of single sensors<br />

21


3.2 Fusion of laser and visual data for robot motion<br />

planning and collision avoidance<br />

The paper of Baltzakis et al. [54] proposes a SLAM (Simultaneous<br />

Localization and Mapping) algorithm based on fusion of 2D laser range<br />

data and stereo visual data. The proposed method uses stereo disparity<br />

to correct laser measurements where they are evidently wrong.<br />

The algorithm initially creates a 3D model of the environment as a<br />

series of vertical walls based on the 2D laser scan. This model forcibly<br />

omits all the objects which do not intersect the plane of the laser scan,<br />

since they cannot be detected by the laser sensor.<br />

Then, the pixels of one of the stereo images are ray-traced to the<br />

3D model, and the 3D coordinates correspondent to each pixel are ob-<br />

tained. Finally, the algorithm re-projects each pixel onto the second<br />

image. If the attributes (color, intensity, etc.) of the pixel in the second<br />

image are similar to the attributes of the correspondent one in the first<br />

image, then the value measured by the laser for that pixel is assumed to<br />

be correct. Instead, if the attributes of the pixels are different, the pix-<br />

els are assumed to belong to an object which is nearer or further than<br />

the distance measured by the laser. In this case, a distance estimation,<br />

based on the disparity between the images, is performed. Range esti-<br />

mates are accumulated on a 2D occupancy grid (in order to decrease<br />

the inaccuracy deriving by image noise or lack of texture).<br />

A simple collision avoidance algorithm, using the mapping method<br />

just exposed, is presented. The algorithm is tested both in artificial and<br />

in real environments, showing to have good results in aiding navigation.<br />

Main points:<br />

• 3D map of the environment based on range sensors and video<br />

• Using stereo to correct and integrate data from range sensors<br />

22


3.3 Using augmented reality to interact with an<br />

autonomous mobile platform<br />

The work of Giesler et al. [21] presents an AR-based, speech-based<br />

technique to quickly and intuitively program paths for a mobile robot<br />

in a wide environment.<br />

The operator who programs the robot needs a HMD and a tool<br />

(“magic wand”), both of which have to be tracked around the envi-<br />

ronment where the paths will be set up. The operator may define and<br />

view paths in form of nodes, which correspond to points on the ground,<br />

and edges, straight lines which connect couples of nodes (figure 12).<br />

Nodes and edges are created by pointing to the ground with the wand<br />

and issuing verbal commands (e.g. “Connect this node...”, “...with this<br />

node”) and are visualized by the HMD worn by the operator.<br />

Figure 12: Robot follows AR path nodes, redirects when obstacle in way. [21]<br />

The operator may issue commands to the mobile robot in the same<br />

manner. It is possible to command the robot to move from one node<br />

of the graph to another, or to move autonomously between two deter-<br />

minate points of the ground. When the robot has to navigate within<br />

the graph, it calculates automatically the shortest sequence of edges<br />

between the start node and the end node. If it detects an obstacle on<br />

the path it has just chosen, it calculates an alternative path through<br />

23


the edges of the graph. Once the robot has chosen a path, nodes and<br />

edges belonging to it are depicted with a different color, so that the<br />

operator is able to see which path the robot is going to take.<br />

Main points:<br />

• AR used as an efficient method to convey data about the envi-<br />

ronment (e.g. nodes position)<br />

• AR used as an efficient method to exchange information with the<br />

robot<br />

24


3.4 Improved interfaces for human-robot interaction<br />

in urban search and rescue<br />

The work of Baker et al. [9] proposes several modifications to the INEEL<br />

interface [22, 55] for telerobotics in urban search and rescue (USAR).<br />

The modifications are designed to decrease complexity and increase<br />

usability of the interface by non-experienced users. This work is based<br />

on the results of several past works of the same authors [10, 11], which<br />

analyse several teleguide interfaces used in international competitions<br />

and outline their points of strength and weaknesses.<br />

Most of the modifications are oriented to reduce the cognitive work-<br />

load imposed by the interface. For example, pan and tilt angles of the<br />

camera are indicated by the position of a light cross overlaid on the<br />

video display, rather than by separate meters. Proximity/collision in-<br />

dicators are visualized as colored blocks around the video display, and<br />

each one of them becomes visible only when an obstacle in the corre-<br />

spondent direction is sufficiently near. Rarely consulted information<br />

(e.g. battery charge) is treated as a system alert and visualized only<br />

when it is necessary. The environment map is placed at the same level<br />

of the video display, so that it is not tiring for the operator to shift his<br />

attention from the video display to the map and vice versa (figure 13).<br />

As a future work is indicated the possibility to fuse heat, sound and<br />

CO2 sensor data as a color map overlaid on the video display.<br />

Since it has been shown that most of collisions happen to the rear<br />

of the robot, a rear camera is included and its video stream is displayed<br />

above the video display (like a rear view mirror on a car).<br />

Main points:<br />

• Integration of sensor data in the same window to reduce cognitive<br />

workload<br />

25


Figure 13: Modified INEEL interface. [9]<br />

• Sensor data representation should be:<br />

<strong>–</strong> non invasive;<br />

<strong>–</strong> quickly comprehensible (e.g. resemble known/conventional<br />

symbols).<br />

26


3.5 Ecological interfaces for improving mobile robot<br />

teleoperation<br />

The work of Nielsen et al. [26] describes an interface for mobile robot<br />

teleoperation based on ecological [56] interface design and augmented<br />

virtuality. Different versions of the same interface are compared, show-<br />

ing that integration of sensor data gives better results for navigation<br />

than displaying results separately.<br />

The presented interface displays a map of the environment recon-<br />

structed from range sensors (laser, sonar) together with a video image<br />

from the remote site. The 2D version of the interface shows video and<br />

map one beside the other; instead, the 3D version shows a 3D model<br />

of the robot within a 3D representation of the map. The 3D map is<br />

created by elevating obstacles to a fixed height. The viewpoint is posi-<br />

tioned little behind the robot, and video data is visualized in a window<br />

in front of the robot model (figure 14).<br />

Figure 14: 3D interface presented in [26].<br />

Tests performed on the different versions prove that the 3D version<br />

of the interface generates always better results than the 2D version.<br />

Moreover, it is shown that operators which use the 2D version do not<br />

27


enefit from having both video and map, since the two displays compete<br />

for their attention. The difference in performance is explained by the 3D<br />

version complying to three important principles of HRI: 1) presenting a<br />

common reference frame; 2) providing visual support for the correlation<br />

between action and response; 3) allowing an adjustable perspective.<br />

Main points:<br />

• 3D map of the environment based on range sensors<br />

• Integration of sensor data in the same window reduces cognitive<br />

workload<br />

• More information does not necessarily imply better performance<br />

28


3.6 Egocentric and exocentric teleoperation interface<br />

using real-time, 3D video projection<br />

The paper of Ferland et al. [25] presents an augmented-virtuality-based<br />

interface for mobile robot teleoperation. As the one described in [26],<br />

it displays data from range and video sensors. In addition, it makes use<br />

of different projection methods of the video image in order to increase<br />

the quality of the information provided.<br />

Sensors used consist in a laser range sensor and a couple of stereo<br />

cameras. The laser is used to build a global 2D map of the environ-<br />

ment. The operator interface displays the map as a 3D environment,<br />

visualizing the obstacles detected by the laser as fixed-height walls. A<br />

3D model of the robot is displayed within the 3D environment. Two<br />

viewpoints are available to the operator: an egocentric viewpoint, co-<br />

incident with the position of the stereo camera (figure 15a), and an<br />

exocentric viewpoint, freely positionable in the zone behind and above<br />

the robot (figure 15b).<br />

Figure 15: Egocentric (a) and exocentric (b) viewpoints of the interface presented<br />

in [25].<br />

The video image is mapped on the 3D environment using one of<br />

two projection methods. The laser-based method first projects the 3D<br />

29


mesh of the environment onto the left video image frame, then simply<br />

maps the single left image on the resulting vertices. The stereoscopic<br />

method uses the disparity values from the stereo camera to project the<br />

stereo image to the 3D space, then maps it to the mesh using a set of<br />

OpenGL Shading Language [57] fragment shaders.<br />

Testing results show that both the egocentric and the exocentric<br />

points of view are considered useful by most of the users. Most of the<br />

time, viewpoints positioned little behind the robot are used; vertical,<br />

bird’s eye-like viewpoints are preferred in tight navigation situations or<br />

to obtain a global view of the map. The laser mapping proves to be the<br />

most useful source of information for navigation; laser-based projection<br />

is also considered useful, differently from stereoscopic projection which<br />

is too sensitive to the quality of disparity data.<br />

Main points:<br />

• 3D map of the environment based on range sensors<br />

• It is necessary to design a reliable method for image projection<br />

in the virtual workspace<br />

30


3.7 Summary and analysis<br />

The works presented above highlight several benefits provided by sensor<br />

fusion. Sensor fusion techniques help to balance strong and weak points<br />

of different types of sensors and to retrieve more reliable information<br />

from the robot and the surrounding environment [24, 54]. Besides,<br />

fused sensor data can be displayed to the user in a unified form.<br />

Unified sensor representation has many advantages in respect to<br />

visualization in separate displays. Presenting data inside a unique dis-<br />

play, within a common reference frame, avoids competition for the<br />

user’s attention. Interfaces that separately visualize different sensors<br />

data force operators to continuously switch between different displays,<br />

reference frames and visualization modalities. Instead, a unified rep-<br />

resentation prevents this switching, thus strongly reducing the user’s<br />

cognitive workload [9, 26].<br />

Augmented reality is a form of unified representation which presents<br />

a further advantage. Namely, visualizing complex data (as positions<br />

and paths in [21]) as a graphic overlay on an image of the real worlds<br />

permits a faster and more intuitive interpretation by a human operator.<br />

Several approaches to AR-based representation of visual and range<br />

data in telerobotics have been described. Some of them ([9, 24]) use<br />

bidimensional augmentations to the video image. They use the color of<br />

these overlays as a quick and effective way to communicate a distance<br />

measure to the user. Though, since bidimensional overlays display in-<br />

formation only on a single plane, their capacity to communicate a depth<br />

value is intrinsically limited.<br />

Others approaches [25, 26] create a bidimensional map of the envi-<br />

ronment using laser data, and display a 3D representation of the map<br />

by elevating virtual 3D walls. This approach has several advantages<br />

in respect to using 2D overlays. First, a 3D map usually looks more<br />

realistic, and can communicate depth in a more intuitive way because<br />

31


of monocular depth cues. Besides, while 2D overlays display raw range<br />

information and leave to the user the responsibility of deducing the<br />

shape of the environment, the 3D approach relieves the user from this<br />

work by presenting range data in a more quickly understandable form,<br />

namely as a 3D map.<br />

The drawback of the described approaches is the scarce quality of<br />

the integration between laser and visual data in the user interface. In<br />

[26] the video image is visualized on the display, but no correspondence<br />

between elements in the image and the laser-generated map is estab-<br />

lished. Therefore, the user must manually associate obstacles in the<br />

video image with obstacles in the laser map. Instead, in [25] corre-<br />

spondence between laser and video is automatically calculated through<br />

projection (see section 3.6). Though, the quality of the projection is<br />

strongly dependent on laser-camera calibration, on the correctness of<br />

laser measurements and, in the case of stereoscopic projection, on dis-<br />

parity data. Since laser-camera calibration always involves a certain<br />

degree of inaccuracy, laser sensor can miss some objects (e.g. low or<br />

transparent objects) and disparity data is strongly dependent on envi-<br />

ronment features, we consider these requirements to be too strict to be<br />

enforced in a general case.<br />

32


4 Previous work on 3MORDUC teleoperation<br />

The 3MORDUC (3rd version of the MObile Robot DIEES University of<br />

Catania) is a wheeled mobile robot located at DIEES (Dipartimento di<br />

Ingegneria Elettrica, Elettronica e dei Sistemi), in University of Catania<br />

[58]. It has been used over several years in order to perform research<br />

work within the field of telerobotics and teleguide visual interfaces.<br />

This sections gives a brief description of the robotic platform, and<br />

exposes past work where it has been involved. Then, past work is<br />

discussed and main issues are outlined.<br />

33


4.1 The 3MORDUC platform<br />

The 3MORDUC uses two Maxon F2260 motors (40W DC) for move-<br />

ment. The motors are connected to two rubber wheels through a shaft.<br />

A castor wheel is employed to facilitate curve execution. Two lead<br />

batteries (12V/18Ah) provide an autonomy of about 30-40 minutes.<br />

Figure 16: The 3MORDUC platform<br />

Several sensors on board monitor the workspace and the robot state.<br />

Here we give a brief description of these sensors.<br />

Laser scanner A Sick LMS200 laser measurement sensor system (fig-<br />

ure 17) is set on the front part of the 3MORDUC. The LMS operates<br />

by emitting a pulsed laser beam towards a definite direction. The re-<br />

flected pulse is received and registered, and the distance between the<br />

robot and the obstacle which reflected the pulse is estimated by mea-<br />

suring the time of flight of laser light. The procedure is repeated for<br />

several different directions on a plane, to generate a scan of the sur-<br />

roundings of the sensor. It is possible to configure angular resolution<br />

34


Figure 17: The Sick LMS200 laser sensor.<br />

(0.25 ◦ , 0.5 ◦ , 1 ◦ ) and maximum scan angle (100 ◦ /180 ◦ ). Each scan is<br />

executed in clockwise mode. Measurement data are available in real<br />

time for further evaluation via RS232/RS422 serial interface.<br />

Laser sensors are usually very accurate (each distance measure has<br />

an accuracy of some millimeters) and reliable. Though, they can be<br />

deceived by transparent or very dark surfaces, which do not adequately<br />

reflect laser light to the receiver, and generate outliers. Besides, laser<br />

sensors obviously cannot detect objects which do not intersect their<br />

scan plane.<br />

Stereo cameras The STH-MDCS2-VAR-C (figure 18) is a low power<br />

compact digital stereo camera system, which can be connected to a PC<br />

via IEEE 1394. Each camera has a resolution of 1.3 megapixel, and it<br />

is equipped with a fixed focus lens (4.5 mm). The CCD sensors of the<br />

cameras provide a good noise immunity. Capturing parameters (e.g.<br />

exposure gain, frame rate, resolution) are adjustable.<br />

The cameras are mounted on a rigid support, which permits to set<br />

the cameras baseline to any value in the range 5-20 cm. Their optical<br />

axes are maintained parallel. The cameras are positioned on the top<br />

layer of the robot, about 95 cm above the ground. They are pointed<br />

towards the direction in front of the robot, and they are slightly tilted<br />

towards the ground.<br />

35


Figure 18: The STH-MDCS2-VAR-C stereo cameras.<br />

Encoders An incremental rotary encoder with a resolution of 500<br />

pulses/turn is placed on each wheel of the robot. Incremental encoders<br />

convert movement into a sequence of digital pulses. Movement/rotation<br />

of the robot in respect to a determined start position/orientation can<br />

be calculated by counting the pulses generated by each encoder.<br />

Proximity sensors A belt of 8 SRF08 sonar sensors is positioned<br />

around the robot. Sonars measure the distance from obstacles by cal-<br />

culating the time of flight of a reflected sonic signal originally produced<br />

by a vibration of a piezoelectric sensor. Sonars field of view has a conic<br />

shape, so the sensitive area increases proportionately to the distance<br />

from the robot. For this reason, sonars have a far lower angular resolu-<br />

tion than the laser sensor. Furthermore, an inhibition time is necessary<br />

between the generation of the sonic signal and its reception, and this<br />

introduces a lower limit to measurable distances.<br />

A belt of bumpers (16 switches) is mounted around the entire pe-<br />

rimeter of the robot base, just over the wheels level. These sensors<br />

recognize and reduce damage in case of a collision.<br />

36


4.2 Mobile robotic teleguide based on video images<br />

The work of Livatino et al. [59] performs a systematic evaluation of the<br />

impact of different stereoscopic visualization modes on performance in<br />

telerobotics tasks. The paper describes the design of the evaluation<br />

experiment and presents and analyses its results.<br />

The experiment involved 12 participants. Each of them executed a<br />

simple teleguide task, which consisted in teleoperating the 3MORDUC<br />

platform (located at DIEES, Catania, in Italy) from the University of<br />

Aalborg (Denmark). The participants were able to visualize the video<br />

data from 3MORDUC cameras. Each of the participants executed the<br />

task using two different visualization setups: a 15” laptop and a 2<br />

× 2 projected wall display. Besides, the task was executed twice for<br />

each setup, using respectively monoscopic and stereoscopic visualiza-<br />

tion. Within the laptop setup, stereoscopic visualization used colored<br />

anaglyph, while within the wall display setup it used polarized projec-<br />

tion.<br />

A set of both qualitative and quantitative parameters were evalu-<br />

ated during the trials. A 2-way ANOVA (ANalysis Of VAriance) was<br />

performed to measure statistical significance of quantitative results.<br />

The results show that stereo visualization introduces a significant reduc-<br />

tion of the collision rate. This is because stereo visualization strongly<br />

enhances the sense of depth of the visualized scene. Furthermore, re-<br />

alism and sense of presence of the user in the remote environment are<br />

higher in respect to monoscopic visualization.<br />

As regards the comparison between the laptop and wall display<br />

setups, it has been shown that in the laptop setup users benefit from<br />

a stronger depth perception and obtain a lower number of collisions.<br />

Instead, since the wall display causes a wider use of peripheral vision,<br />

it generates a higher sense of presence and confidence, which implicate<br />

37


higher mean speeds.<br />

Main points:<br />

• Stereoscopic visualization enhances collision avoidance<br />

• Stereoscopic visualization increases realism and sense of presence<br />

38


4.3 Depth-enhanced mobile robot teleguide based<br />

on laser images<br />

The work of Livatino et al. [60] performs a systematical evaluation<br />

analogue to the one described in [59]. Though, the evaluated teleguide<br />

interface displays synthetic images generated by laser scans instead than<br />

real camera images. In this telerobotic system, laser data is processed<br />

Figure 19: The process of generating 3D graphical environment views from laser<br />

range information. The top-left image shows a 2D floor map generated by the laser<br />

sensor. The bottom-left image shows a 3D extrapolation of a portion of it. The<br />

right-image shows a portion of the workspace visible to a user during navigation.<br />

[60]<br />

on the robot to construct in real-time 2D maps of the robot surrounding<br />

workspace in real-time. A 3D representation is extrapolated from the<br />

2D maps by elevating wall lines and obstacle posts. A current front-<br />

view of the robot workspace is then generated and displayed to the user<br />

by using graphical software (figure 19). The teleoperation task to be<br />

executed by the participants and the visualization setups used during<br />

the experiment were the same used in [59].<br />

As regards the stereo-mono and the laptop-wall comparisons for the<br />

laser-based visualization interface, the results obtained are analogue to<br />

the ones exposed in [59]. It can be deduced that stereoscopic visualiza-<br />

39


tion permits a significant decrease in collision rates independently from<br />

the fact that the interface is visual-based or laser-based. Besides, it is<br />

shown that participants using the laser-based interface perform better<br />

in terms of completion time. This is supposed to be due to the real-<br />

time performance provided by the laser-based interface. In fact, visual<br />

data requiring a significant bandwidth to be transmitted, the average<br />

delay between the display of two consecutive video images is about<br />

one second. As exposed in [61], this strongly decreases teleoperation<br />

performances. Instead, since laser data requires a much smaller band-<br />

width than visual data, a teleoperation client can receive and process<br />

it in real-time, thus increasing the operator’s performance in terms of<br />

average speed.<br />

Main points:<br />

• Stereoscopic visualization benefits are present also in the case of<br />

laser-generated images<br />

• Laser-data can be used in real-time for an increase in performance<br />

40


4.4 Augmented reality stereoscopic visualization<br />

for intuitive robot teleguide<br />

The paper of Livatino et al. [13] proposes a methodology for fusion of<br />

laser and visual data in a teleoperation interface. This methodology<br />

exploits augmented reality to realize a coherent and intuitive visualiza-<br />

tion of integrated data, and uses stereoscopy to increase teleoperation<br />

efficiency.<br />

The interface presented in this work represents laser data as virtual<br />

overlays on the video images received by the robot cameras. Three<br />

different kinds of virtual overlays are used:<br />

• proximity planes, semi-transparent colored layers superimposed<br />

on the objects within the scene (figure 20a);<br />

• rays, colored lines departing approximately from the camera po-<br />

sition and reaching the closest objects (figure 20b);<br />

• distance values, indications of the absolute distance between the<br />

robot and the objects (figure 20b).<br />

The virtual overlays have a different color depending on the distance<br />

between the robot and the real objects to which they correspond. Red<br />

overlays correspond to the nearest objects, yellow overlays to objects<br />

at medium distances, green overlays to the furthest objects.<br />

The laser measures are linearly mapped to image pixels between the<br />

left and the right margin of the image. A semi-automatic calibration<br />

permits to the user to adjust the first and the last mapped angles. Then,<br />

edge detection is executed on the image in order to individuate the<br />

bases of the objects in the image (by taking the first edge pixels from<br />

the bottom of the image) and to vertically align the virtual overlays<br />

with the real objects.<br />

41


Figure 20: (a) Proximity planes overlaid on the image. (b) Rays and distance values<br />

overlaid on the image. [13]<br />

A pilot test has been carried out by teleoperating the 3MORDUC<br />

from the 3D Visualization and Robotics Lab at the University of Hert-<br />

fordshire, using both monoscopic and stereoscopic visualization. Al-<br />

though the results have been encouraging as regards the use of aug-<br />

mented reality and the semi-automatic calibration, the system has<br />

revealed not to be ready for stereoscopic visualization yet. In fact,<br />

since edge detection is performed on the left and right images inde-<br />

pendently, it generates different results (especially if the quality of the<br />

images is low and artifacts are present). This often causes the drawing<br />

of non-correspondent virtual overlays, which need to be detected and<br />

deleted/recomputed before rendering.<br />

Main points:<br />

• effective methodology to integrate laser and visual data in a co-<br />

herent representation<br />

• use of different AR features to highlight distances from objects<br />

• necessity of a technique for excluding non-correspondent measures<br />

42


4.5 Summary and analysis<br />

In [59] and [60] two different approaches to visualization interfaces for<br />

mobile robot teleoperation are described. The visual-based approach<br />

consists in displaying a (mono or stereo) video stream from the remote<br />

site, while the laser-based approach consists in displaying a synthetic<br />

view of the robot workspace generated from laser range data.<br />

Each approach has its points of strength and weaknesses. Visual<br />

data is highly contrasted, and provides a high amount of information<br />

and a wide field of view to the operator, but the massive quantity of<br />

data to be transmitted implicates a delay between the visualization of<br />

consecutive frames if the available bandwidth is low. Laser-generated<br />

images are much poorer in detail than video, but they can be generated<br />

and visualized in real-time.<br />

The works exposed above show that both visualization methods<br />

greatly benefit from stereoscopic visualization. Stereo increases the<br />

sense of presence of the operator in the remote environment and the<br />

perceived sense of depth, thus increasing driving accuracy.<br />

Livatino et al. [13] introduce an innovative methodology to join the<br />

advantages of these two approaches by using augmented reality. Col-<br />

ored overlays are used to fuse visual and laser data in a unique, coherent<br />

and intuitive representation. Though, application of stereoscopic visu-<br />

alization to this methodology has proven not to be straight-forward. In<br />

fact, a technique has to be developed to conciliate the results of left and<br />

right image processing in order to obtain a correct stereo rendering.<br />

43


5 Proposed method: AR stereoscopic visualization<br />

5.1 Core idea and motivation<br />

The purpose of this work has been to project and implement a laser-<br />

and-vision-based visualization approach for mobile robot teleguide. The<br />

proposed approach is meant to fully exploit the benefits provided by<br />

augmented reality and stereoscopic visualization described in the pre-<br />

vious sections in order to assist robot navigation.<br />

The proposed visualization approach has been developed with the<br />

following aims:<br />

1. the approach should communicate appropriately distance infor-<br />

mation from a laser sensor to the user; laser data should be rep-<br />

resented in a way as intuitive and easy to interpret as possible;<br />

2. the approach should be fully applicable to both monoscopic and<br />

stereoscopic visualization setups; it should be flexible and perform<br />

well even using a single camera image; at the same time, it should<br />

be designed in order to take advantage from stereo visualization<br />

features, where stereo is available;<br />

3. the approach should avoid every useless increment of the opera-<br />

tor’s cognitive workload; the interface design should be such that<br />

the operator does not need to frequently shift attention between<br />

different elements, and teleguide should be as little tiring as pos-<br />

sible;<br />

4. the approach should be robust relatively to sensor inaccuracy and<br />

errors; the visualization method should perform well even when<br />

sensor data is noisy or incorrect (e.g. in case of invalid disparity<br />

or laser outliers).<br />

44


In order to achieve the above-described aims, techniques described<br />

in literature can be used and improved. Specifically, the developed<br />

interface is based on:<br />

Augmented reality As described in section 3, augmented reality is<br />

an extremely convenient method for representation of sensor data.<br />

Since it integrates sensor and visual data inside a single display,<br />

competition for user attention is avoided. Moreover, if sensor data<br />

are represented as immersed in the real workspace, correlation<br />

between sensor data and real objects becomes easy and intuitive.<br />

In the case of a laser-video unified representation, laser distance<br />

data can be visualized directly on the correspondent zones of the<br />

camera image, thus giving a depth dimension to the image.<br />

3D overlays Differently from bidimensional overlays, 3D objects can<br />

be rendered in order to look nearer or further from the view-<br />

point. Depth of 3D graphical objects can be represented through<br />

stereo visualization or, in cases where a single camera is avail-<br />

able, through monocular depth cues (e.g. perspective, occlusion).<br />

Therefore, 3D objects are ideal for communicating depth infor-<br />

mation.<br />

Colors colors being a very effective mean to convey information to<br />

humans, they can be used to make data interpretation faster and<br />

more intuitive. As in [9, 13, 24], the proposed visualization ap-<br />

proach associates different colors to different distance values.<br />

Image processing As described in [54], image processing can be used<br />

to retrieve distance information. This information can be inte-<br />

grated with laser measures to increase range data reliability.<br />

45


5.2 Research development strategy<br />

The visualization approach introduces in the previous section has been<br />

implemented within the MOSTAR (MORDUC teleguide through STer-<br />

eoscopic Augmented Reality) interface for teleoperation of the 3MOR-<br />

DUC platform.<br />

The development of the MOSTAR interface has been divided into<br />

four main steps. This section gives a brief overview of these steps and<br />

of their main issues. Sections from 6 to 9 describe each step in detail.<br />

Design of<br />

laser-camera<br />

model<br />

Definition of<br />

calibration<br />

procedure<br />

Design of stereo<br />

alignment method<br />

Design of stereo<br />

correspondence<br />

algorithm<br />

Choice of<br />

3D objects<br />

Design of<br />

colored<br />

3D objects<br />

Development of<br />

the AR-based<br />

interface<br />

Extension for<br />

stereoscopic<br />

visualization<br />

Testing<br />

Choice of<br />

color range<br />

Design of<br />

edge detection<br />

algorithm<br />

Design of<br />

nearest objects<br />

detection<br />

algorithm<br />

Figure 21: Diagram of development and implementation steps.<br />

46


5.2.1 Design of an AR-based sensor fusion visualization<br />

The first step of the development of the MOSTAR interface has been<br />

the definition of a method to convert laser data to a set of graphical<br />

objects to be overlaid onto camera images.<br />

Colored 3D virtual objects have been chosen to represent laser data<br />

on the image. Virtual objects are positioned within the virtual 3D<br />

workspace according to laser measures, and they are rendered as a semi-<br />

transparent overlay above the camera image. It has been necessary to<br />

select appropriate 3D objects to represent laser data, and colors which<br />

were suitable to map distance in an intuitive way (section 6.1).<br />

As it has been shown in section 3, determining which visualization<br />

method is the most effective for complex data like those provided by a<br />

mobile robot is not a straight-forward issue. It is necessary to take into<br />

account numerous factors, among which the specific context of applica-<br />

tion and the user particular preferences. For example, a 2D bird’s eye<br />

view map of the robot workspace can be a very effective visualization<br />

method for the exploration of an environment, but it would usually not<br />

be sufficient for obstacle avoidance manoeuvres. Therefore, several rep-<br />

resentation modes have been designed for the MOSTAR interface, and<br />

several brief tests have been performed in order to determine points of<br />

strength and weaknesses of each mode.<br />

Furthermore, an algorithm to refine the graphical appearance of the<br />

overlay by detecting potential laser outliers has been developed (section<br />

6.2).<br />

5.2.2 Definition of a laser-camera model and a calibration<br />

procedure<br />

A calibration procedure is clearly needed in order to correctly align<br />

virtual objects defined in section 6 with the real objects in the camera<br />

images.<br />

47


Several approaches that permit an automatic determination of the<br />

intrinsic parameters of a camera, and of its extrinsic parameters in<br />

respect to a world system of coordinates, have been proposed in litera-<br />

ture [62, 63]. These approaches calculate the parameters by analysing<br />

a set of chosen calibration images, and guarantee the optimality of the<br />

parameters in term of accuracy through several statistical techniques.<br />

Though, they require rather long and complicated calibration proce-<br />

dures, which have to be repeated every time that the camera and/or<br />

the laser sensor are moved, if the alignment accuracy is to be main-<br />

tained.<br />

For these reasons, a semi-automatic feedback-based calibration has<br />

been preferred for the MOSTAR interface. This kind of calibration<br />

consists in varying manually a restricted set of parameters, while see-<br />

ing in real time the results of these variations. In other words, while<br />

the calibration parameters are adjusted, the virtual overlay is drawn ac-<br />

cording to the values selected each time. The user can gradually align<br />

the AR overlay with the objects in the image, before or during the tel-<br />

eguide, and he is able to get a good degree of accuracy in some minutes<br />

of adjustments, without any particular effort. Section 7 describes the<br />

developed feedback-based calibration procedure.<br />

The alignment precision is slightly inferior to the one which may be<br />

obtained with an automatic procedure, but it has shown to meet the<br />

requirements in most cases, and can be increased by exploiting image<br />

processing (see section 8).<br />

5.2.3 Development of a method for integration of image features<br />

Image processing has been employed within the MOSTAR interface for<br />

two different purposes: to improve the alignment of the overlay with<br />

the camera image, and to increase the reliability of the sensor data by<br />

48


detecting possible erroneous measurements of the laser sensor.<br />

The image processing technique chosen for the MOSTAR interface<br />

is edge detection. Analysis of the edges inside the images of the robot<br />

workspace have been used to detect walls and potential obstacles. A<br />

technique has been implemented for finding the nearest objects that<br />

the robot is facing, by individuating the borders of the bases of these<br />

objects (section 8.2).<br />

Once the edges of nearest objects are detected, it is necessary to<br />

integrate them with laser data somehow. Two techniques of integration<br />

have been developed and implemented:<br />

• a technique employing edges to improve camera alignment (sec-<br />

tion 8.3);<br />

• a technique employing edges to correct wrong laser measurements<br />

and individuate obstacles which are invisible to laser (section 8.4);<br />

5.2.4 Extension of the AR interface to stereoscopic visualization<br />

As exposed in section 4, stereoscopic visualization helps to reduce colli-<br />

sions during robot teleguide, by enhancing the user’s depth estimation.<br />

Since stereo can positively influence visualization of both real and syn-<br />

thetic images (as it was proven in [59, 60]), the MOSTAR interface can<br />

definitely benefit from it.<br />

Stereoscopic visualization has been easy to implement in the MOS-<br />

TAR interface. MORDUC cameras already provide a synchronized<br />

stereo couple of images, which can be directly displayed to the user.<br />

On the other hand, 3D virtual objects can be easily rendered from two<br />

different viewpoints, and each view of the objects can be overlaid on<br />

the corresponding real image.<br />

The main issue in the stereo extension of the MOSTAR visualiza-<br />

tion interface has been guaranteeing a suitable disparity level between<br />

49


the left and the right image. Real and synthetic images must be dis-<br />

played so that pixels which correspond in the images have no vertical<br />

parallax, and their horizontal parallax is correct (i.e. non-divergent)<br />

and comfortable for the user. Moreover, the couple of real images and<br />

the couple of synthetic images must be correctly aligned. Section 9.1<br />

explains how this issues have been managed.<br />

Furthermore, since camera left and right images are different be-<br />

tween each other, different edges are usually detected within them.<br />

Edge detection being intrinsically imperfect, some edges are detected<br />

in one image but not in the other. Therefore, a method has been imple-<br />

mented within the MOSTAR interface to deal with non-corresponding<br />

edges (section 9.2).<br />

5.2.5 Implementation and testing<br />

The MOSTAR interface has been implemented as a Visual C++ .NET<br />

application for Microsoft Windows. 3D rendering has been realized<br />

by means of OpenGL [64], and the GLUT library [65] has been used<br />

for windows and input handling. Image processing operations have<br />

been performed using functions from the OpenCV library [66]. The<br />

HTTP protocol and the WinHTTP library have been used to exchange<br />

driving commands and sensor data with the server program resident on<br />

the MORDUC platform.<br />

The MOSTAR interface has been subjected to several offline and<br />

online tests. During offline tests, the MOSTAR interface was used to<br />

display visual and laser data collected during previous teleguide ses-<br />

sions. During online tests, the MOSTAR interface was used to actively<br />

teleoperate the MORDUC platform in real-time. Online teleguide tests<br />

were performed from the 3D Visualization and Robotics Lab at the<br />

University of Hertfordshire, United Kingdom.<br />

During all tests, the MORDUC laser sensor was configured to sam-<br />

50


ple 181 distance values in the zone in front of the robot, with an angular<br />

resolution of 1 ◦ . The STH-MDCS2-VAR-C stereo cameras were used<br />

as visual sensors during most of the tests, using an image resolution<br />

of 640 × 480 pixels (per single image). During some of the tests, two<br />

Microsoft Lifecam Show webcams were used, mounted in a stereo con-<br />

figuration with a slightly different position and vertical inclination from<br />

the original setup.<br />

The visualization interface was run on a medium range laptop (In-<br />

tel Core 2 Duo T7500 processor, 2 GB RAM, ATI Mobility Radeon<br />

HD2600 graphic card). The timing values exposed in the next sections<br />

refer to this configuration.<br />

Several informal tests were conducted during the various implemen-<br />

tation steps by the developers, in order to validate design choices. A<br />

pilot test was conducted on the final version of the interface. Partic-<br />

ipants were 4, all with medium knowledge about augmented reality<br />

and stereoscopic visualization, and with no experience in robot tele-<br />

operation. Developers observed the performance of each participant,<br />

collecting impressions and comments. The results of the tests are re-<br />

ported in sections 6 to 9, depending on the aspect of the interface they<br />

are related to.<br />

51


6 Effective multi-sensor visual representation<br />

We describe here the set of augmented reality features developed for<br />

joint visualization of laser and video data. The features have been<br />

designed to assist the user during navigation and obstacle avoidance.<br />

Visualization methods for the other MORDUC sensors are in course of<br />

development, but have not been implemented yet.<br />

6.1 Visualization of laser data through AR features<br />

The MORDUC laser sensor provides a precise estimate of the distance<br />

between the robot and the surrounding obstacles. The MOSTAR in-<br />

terface uses AR to visualize this estimate in a way that facilitates im-<br />

mediate comprehension.<br />

Each single set of laser measures is processed independently by the<br />

laser visualization algorithm. Given each point p detected by the laser<br />

sensor on its plane, the 2D coordinates of p in respect to the laser origin<br />

are calculated:<br />

xp = dp cos αp<br />

zp = dp sin αp<br />

where dp and αp are, respectively the distance value and the laser ro-<br />

tation correspondent to the measurement of point p.<br />

Each laser point is assigned a particular color depending on its dis-<br />

tance value. As in [13] nearer points are assigned a color with a higher<br />

red component and a lower green component, while further points are<br />

assigned a color with a higher green component and a lower red compo-<br />

nent. A minimum and maximum distance, depending on the applica-<br />

tion, are set. Points with distance equal to or lower than the minimum<br />

52


distance will be pure red, points with distance equal to or higher than<br />

the maximum distance will be pure green. Distances between the two<br />

extremes are linearly mapped to the red-green range (figure 22). As<br />

Distance<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

0 50 100 150 180<br />

Angle<br />

Figure 22: Assigning colors to a set of laser points. Red line and green line represent,<br />

respectively, minimum and maximum distance limits.<br />

stated in [24], the human eye is more sensitive to variations in the HSV<br />

color space than in the RGB space; though, colors in the red-yellow-<br />

green range have a stronger impact on the user than the other colors,<br />

since they are conventionally associated to danger-caution-safety [67].<br />

Since the perception of distance through this range of color is supposed<br />

to be more intuitive, and since we consider immediacy of interpretation<br />

more important than a very high resolution in highlighting distances,<br />

our choice has fallen on this range rather than on the HSV range.<br />

Colored laser points are used to create semi-transparent tridimen-<br />

sional objects, which are aligned with the image (as described in section<br />

7) and rendered onto it as an overlay. Two kinds of AR objects have<br />

been implemented:<br />

• virtual walls are created between each couple of consecutive point-<br />

53


s, and elevated from the ground level to a fixed height; the color<br />

of each wall is determined by the laser points whichcolor of each<br />

wall is determined by the laser points which delimit it (figure<br />

23c); optionally, vertical lines are drawn in correspondence with<br />

the laser points (figure 23d);<br />

• virtual rays are drawn on the ground from the base of the robot<br />

(coincident with the projection of the laser origin on the ground)<br />

to the laser points, at regular intervals, and each of them takes<br />

the color of the correspondent point (figure 23e).<br />

Virtual walls position themselves over walls and obstacles in the robot<br />

workspace, highlighting objects depth with their colors. Virtual rays<br />

point out the bases of the obstacles, and give a hint to the user for<br />

estimating the distance between the robot and them. Several concentric<br />

circonferences, each of which is at a fixed distance from the previous<br />

one (0.5 m), are also drawn at the ground level, and serve as a further<br />

hint for estimating distances (figure 23 c-e).<br />

The rendering of tridimensional objects by means of the OpenGL<br />

libraries is much more effective than the methods used in [24] and [13],<br />

which are based on the sole drawing of bidimensional aids. In fact, as<br />

explained in section 5, tridimensional rendering can represent depth in<br />

a more intuitive way than bidimensional overlays.<br />

The methods of representing laser data together with video data de-<br />

scribed in [26] and [25] are similar to the one just described. Though,<br />

they are based on augmented virtuality rather than on augmented re-<br />

ality. As described in section 3.7, those approaches present some lim-<br />

itations, which the approach proposed here does not present. In fact,<br />

differently from [26], since our approach directly superimposes virtual<br />

objects above the corresponding real objects, correlation is easily and<br />

automatically established by the operator. Besides, while the approach<br />

54


Figure 23: (a) Plain camera image. (b) Laser map of the environment surrounding<br />

the robot. (c) Virtual walls overlaid onto the camera image. (d) Virtual walls and<br />

laser lines overlaid onto the camera image. (e) Virtual rays overlaid onto the camera<br />

image.<br />

55


of [25] is sensitive to calibration inaccuracies and to validity of dispar-<br />

ity data, the method described here is independent on disparity and<br />

relatively robust to calibration inaccuracies.<br />

6.2 Detection of discontinuities<br />

Virtual walls are a valid hint for estimating depth of the objects within<br />

the workspace, but, if rendered as they are, they have a drawback.<br />

Connecting each pair of consecutive laser points with a wall implies the<br />

creation of a unique, large surface surrounding the robot. This means<br />

that separate objects are represented as “melted” with each other (fig-<br />

ure 24a). In order to minimise this problem, a simple discontinuity<br />

Figure 24: (a) The walls and the box are covered by the same virtual wall, giving<br />

the user the wrong clue that they constitute a unique surface. (b) Discontinuity<br />

detection separates different objects.<br />

detection algorithm is executed on the laser points before rendering.<br />

For each wall between a couple of points, a slope coefficient is calcu-<br />

lated:<br />

slopep = distp−1 − distp<br />

distp−1 + distp<br />

where distp−1 and distp are the distance values corresponding to the<br />

two points which delimit the wall. It can be observed that the slope<br />

can assume values in the [−1, 1] interval. If the slope of a wall is<br />

56


different from the slope of the adjacent walls for a quantity higher<br />

than a parameterizable threshold (slopeT h), that wall is marked as a<br />

discontinuity and it is excluded from the rendering (figure 24b). Virtual<br />

rays are always drawn in correspondence of a discontinuity, in order to<br />

point out the edges of the objects in the workspace.<br />

Discontinuity detection is also used to detect potential laser outliers<br />

(figure 25). Isolated laser outliers have often a distance value strongly<br />

different from their immediate neighbors; therefore, the couple of walls<br />

which contain a laser outlier are very likely to be marked as discontinu-<br />

ities. This fact is exploited by searching for consecutive discontinuities.<br />

As the discontinuity detection algorithm is performed, laser points be-<br />

tween two consecutive discontinuities are labeled as potential outliers.<br />

In section 8 a method will be described to divide almost sure outliers<br />

from correct measurements.<br />

6.3 Testing<br />

Participants to the pilot test confirmed virtual objects as a valuable aid<br />

for navigation. All the participants found easier to estimate distances<br />

and to understand the conformation of the workspace when the virtual<br />

overlay was enabled.<br />

Virtual walls were judged the most useful kind of virtual objects,<br />

since they clearly highlighted walls and obstacles within the environ-<br />

ment. Thus, they made easier for users to detect position and size of<br />

obstacles. Instead, virtual rays were considered a little confusing, espe-<br />

cially when they were not strengthened by virtual walls and when the<br />

alignment with real objects was not precise. One of the participants<br />

also found that they were not clearly visible, especially in respect to<br />

virtual walls.<br />

Virtual circles on the ground were found useful for determining dis-<br />

tances. Still, participants felt the need for some hint to indicate abso-<br />

57


Figure 25: (a) Visual artifacts created by some outliers (caused by black square of<br />

the chessboard, which do not reflect laser beams). (b) The artifacts are partially<br />

eliminated by the discontinuity detection.<br />

58


lute distances.<br />

The discontinuity detection algorithm had excellent results, even<br />

when objects with a not regular shape (e.g. people) were in the range<br />

of the laser sensor. A fixed value of 0.05 for slopeT h proved to work<br />

well in most cases.<br />

Calculation and rendering of the overlays had excellent timing per-<br />

formances. Processing of laser data and rendering of the virtual objects<br />

took always less than 10 ms, which is a perfectly acceptable delay for<br />

AR applications according to [36].<br />

59


7 Laser-camera alignment and calibration<br />

The 3D virtual objects described in the previous section have coordi-<br />

nates defined in respect to the laser sensor origin. Laser-camera align-<br />

ment permits to determine where their coordinates are to be drawn<br />

within the camera image.<br />

The alignment is performed by adjusting a set of intrinsic and ex-<br />

trinsic parameters in order to replicate the ones of the real camera, then<br />

using these parameters to define a virtual viewpoint on the virtual scene<br />

overlaid onto the image. This way, the virtual camera will look at the<br />

virtual workspace with approximately the same position/orientation of<br />

the real camera in respect to the real workspace, so that the rendered<br />

scene will coincide with the camera scene.<br />

7.1 Laser-camera model<br />

In order to maintain the calibration procedure simple, the alignment<br />

algorithm is based on the undistorted pinhole camera model, and poses<br />

several constraints over the orientation of the camera.<br />

Since the dimensions of the camera image is fixed (it depends on the<br />

camera output resolution), and since the pinhole camera model does not<br />

consider distortions, the only variable intrinsic parameter for camera<br />

calibration is focal length. The focal length of a camera in millimeters<br />

is often available among the constructor data of the camera, therefore<br />

it is usually easy to retrieve. Though, for calibration it is necessary to<br />

convert its value into pixels. It is possible to effectuate the conversion<br />

from the value in millimeters, but for this it is necessary to know the<br />

dimensions of the CCD/CMOS camera sensor, which are not usually<br />

published by constructors.<br />

As regards the orientation of the camera in respect to the laser<br />

system of coordinates, three assumptions are made:<br />

60


• the x axis of the camera system of coordinates is parallel to the<br />

x axis of the laser system of coordinates; that is, neither panning<br />

movements nor roll rotations of the camera in respect to the laser<br />

are permitted;<br />

• the x axis of the camera system and the x axis of the laser system<br />

point towards the same direction;<br />

• tilt movements of the camera (i.e. rotations around the x axis) in<br />

respect to the laser system of coordinates are confined between<br />

−90 ◦ and 90 ◦ ; that is, camera optical axis and laser z axis“look”<br />

approximately towards the same direction.<br />

This assumptions are satisfied by the MORDUC platform (stereo cam-<br />

eras are parallel and oriented in the direction of the laser z axis, and<br />

they have a slight tilt angle towards the ground), and are generally<br />

reasonable.<br />

The variable calibration parameters left by the exposed approxima-<br />

tions are only four: the position coordinates of the camera center in<br />

respect to the laser origin (x, y and z) and the tilt angle of the camera<br />

(figure 26). In the case of a panning-enabled camera it is possible to<br />

add a fifth parameter, that is the pan angle of the camera.<br />

The resulting laser-camera model presents therefore five (six, if pan-<br />

ning is possible) parameters to configure. They are sufficient for the<br />

definition of the OpenGL viewing transforms for the rendering of the<br />

overlay on the image: camera position and tilt are used to position and<br />

orientate the point of view, while focal length and image ratio (which<br />

is known a priori) are used to determine the perspective projection<br />

matrix.<br />

The next section describes a simple procedure for the manual cali-<br />

bration of the parameters.<br />

61


Figure 26: Graphical representation of the laser-camera system and of calibration<br />

parameters.<br />

7.2 Feedback-based calibration procedure<br />

Given the model and the configuration parameters described in the<br />

previous section, it is possible to define a sequence of steps to obtain<br />

a satisfying set of values for those parameters by the feedback-based<br />

calibration.<br />

1. Adjust focal length of the virtual camera until the level of zoom<br />

on the overlay is equal to the level of zoom on the real image. Take<br />

as a reference the furthest object within the scene, and modify<br />

the parameter until the horizontal dimension of the corresponding<br />

virtual wall matches (figure 27b).<br />

2. Adjust y camera coordinate and tilt angle until the “virtual<br />

floor” is aligned with the real floor (figure 27c). Adjusting the<br />

y coordinate moves the virtual camera up and down in the 3D<br />

space (that is, it moves the overlay down/up in respect to the<br />

62


eal camera image). Virtual rays and circles can be used as aids<br />

to obtain a good alignment.<br />

3. Adjust z camera coordinate until the horizontal dimensions of<br />

both far and near objects match with the correspondent virtual<br />

walls (figure 27d). Adjusting the z coordinate moves the vir-<br />

tual camera forward and backward (that is, it moves the overlay<br />

backward/forward in respect to the real camera image). While<br />

modifying focal length controls the dimension of near and far<br />

parts of the overlay the same way, modifying the z coordinate<br />

of the virtual camera has a strong effect on near virtual objects<br />

and almost no influence on far ones. Therefore, is is suggested to<br />

set focal length first using a very far object as a reference, (as in<br />

point 1), then adjust the z coordinate using a very near object as<br />

a reference.<br />

4. Adjust x camera coordinate to eliminate the horizontal offset<br />

between the overlay and the real image (figure 27e). Adjusting<br />

the x coordinate moves the virtual camera to the left and to the<br />

right (that is, it moves the overlay to the right/left in respect to<br />

the image).<br />

This specific order of calibration permits to minimize the interfer-<br />

ence of adjustments of a calibration parameter with the others, so it<br />

avoids the necessity for the user to return to his own steps and adjust<br />

the same parameters again and again. However, if the final result is<br />

not satisfying, the user is free to modify any parameter at any moment,<br />

even during the teleguide.<br />

7.3 Comparison with automatic calibration<br />

The ease of calibration of the MOSTAR interface is bound to the re-<br />

duced number of parameters the user has to deal with. Therefore, it is<br />

63


Figure 27: (a) Overlay at the beginning of the calibration, (b) after the adjustment<br />

of the focal length, (c) after the adjustment of the y coordinate and the tilt angle,<br />

(d) after the adjustment of the z coordinate, (e) after the adjustment of the x<br />

coordinate.<br />

64


directly dependent on the approximations and constraints on the laser-<br />

camera model described in section 7.1. The feedback-based calibration<br />

procedure presented here could be also applied to more general cases,<br />

but at the expense of its simplicity. Automatic calibration may be<br />

preferable in the most general cases, or when high accuracy is needed.<br />

On the other hand, the point of strength of the feedback-based calibra-<br />

tion is the possibility to achieve an accuracy amply sufficient in most<br />

cases, without requiring a significant time or effort from the user.<br />

The feedback calibration procedure exposed in this section has been<br />

implemented in the MOSTAR interface, but the sensor representation<br />

logic described in section 6 is independent of the specific calibration<br />

procedure. In fact, since <strong>–</strong> as described at the beginning of this section<br />

<strong>–</strong> the alignment between the virtual overlay and the image is performed<br />

simply by setting OpenGL viewpoint parameters (specifically frustum<br />

shape and size, position and orientation of the viewpoint), there is no<br />

constraint on the method used to calculate these parameters. OpenGL<br />

camera model does not model lens distortion; though, several meth-<br />

ods exist to simulate distortion by texture mapping [68] or by OpenGL<br />

Shading Language [57, 69]. Consequently, the AR features of the MOS-<br />

TAR interface are applicable also to the case of a general laser-camera<br />

model, and can be used together with any calibration method.<br />

7.4 Testing<br />

The feedback-based calibration confirmed the expectations, showing<br />

to have good performances with both camera setups (STH-MDCS2-<br />

VAR-C stereo cameras and Microsoft Lifecams). With some minutes<br />

of calibration it was possible to obtain a sufficient degree of alignment.<br />

Subtle misalignments could not be completely eliminated, but usually<br />

they were not bothering for users.<br />

Participants who did know in advance the meaning of the calibration<br />

65


parameters found the calibration procedure easy and effective. Though,<br />

participants who did not have any advance knowledge about the camera<br />

model and the meaning of the calibration parameters found it slightly<br />

counterintuitive. In fact, it was not easy to deduce the nature of a<br />

parameter only from watching the effect of its adjustments. Therefore,<br />

it has been taken into consideration the possibility to add some (maybe<br />

graphical) hints to the interface in order to make clearer to users the<br />

nature and the effect of each calibration parameter.<br />

Finally, participants discovered that virtual rays and lines on vir-<br />

tual walls may assist calibration. In fact, their ends mark the precise<br />

position of each laser point. Participants found easier to position the<br />

overlay over the camera image when knowing precisely the point hit by<br />

each laser beam.<br />

66


8 Integrating 3D Graphics with image<br />

processing<br />

This section exposes the technique used within MOSTAR interface to<br />

integrate information from the edges in the camera image with laser<br />

data. Briefly, edges of the objects occupying the robot field of view are<br />

located inside the image; then, edge pixels are unprojected to the 3D<br />

world coordinate system and their position is calculated. The distance<br />

between the robot and each of these points is calculated and compared<br />

with the corresponding laser measures, so that the correctness of each<br />

laser measure can be double-checked.<br />

The technique used by the MOSTAR interface to test laser mea-<br />

surements is analogue to the one described in [54], which uses disparity<br />

information retrieved from a stereo couple to individuate points of the<br />

images whose depth does not correspond to the value detected by the<br />

laser. Though, the technique described here is based on edge detection<br />

rather than on stereoscopic disparity calculation. The consequence is<br />

that the image processing technique used in MOSTAR interface is ap-<br />

plicable even where only one camera is available. In addition, edge<br />

detection algorithms are usually faster in terms of performance than<br />

stereo disparity calculation algorithms.<br />

8.1 Edge detection algorithm<br />

The process used for edge detection in camera images is divided into<br />

two steps.<br />

First, the image is converted to grayscale and preprocessed by a con-<br />

trast stretching function. Two different gray value thresholds (lowT h <<br />

highT h) are applied to the image: the intensity of pixels whose original<br />

value is lower than lowT h is set to the minimum value (black), while<br />

the intensity of pixels whose original value is higher than highT h is set<br />

67


to the maximum value (white). Intensity values of all the other pix-<br />

els are linearly mapped to the range between minimum and maximum<br />

gray value (figure 28). This preprocessing step has two benefits. First,<br />

Mapped intensity<br />

MAX<br />

MIN<br />

MIN lowTh highTh MAX<br />

Original intensity<br />

Figure 28: Contrast stretching function used before actual edge detection.<br />

it improves the quality of the image for edge detection, by suppressing<br />

gradients in very bright or very dark areas and increasing contrast in<br />

the rest of the image. Secondly, if the floor of the working environment<br />

is much brighter (or darker) than the rest of the image, it can be used<br />

to neatly separate the intensity range of the floor from the intensity<br />

range of the workspace object, simplifying the individuation of edges<br />

on the ground, which correspond to the bases of obstacles.<br />

The second step is processing the contrast-stretched grayscale image<br />

with the Canny edge-detection algorithm [70]. The Canny algorithm<br />

is a simple and very popular algorithm for edge-detection, based on an<br />

intensity hysteresis process. First, the Sobel operator [71] is applied<br />

to the image along the horizontal and vertical directions. The Sobel<br />

operator has two purposes: averaging intensity values of the pixels along<br />

one direction - so reducing image noise by blurring - and calculating<br />

the gradient of the pixels along the perpendicular direction. This way,<br />

68


a gradient value and an edge direction are retrieved for each pixel.<br />

Then, a non-maximum suppression is executed, by checking whether<br />

each pixel has the maximum gradient among its neighbors taken along<br />

its edge direction. Non-maximum pixels are excluded from the edge<br />

detection. Finally, gradient values of remaining pixels are compared<br />

with a couple of thresholds (th1 < th2):<br />

• pixels with a gradient value higher than th2 are immediately<br />

marked as edge pixels;<br />

• pixels with a gradient value lower than th2 but higher than th1<br />

are marked as edge pixels only if they are encountered along an<br />

edge which contains pixels whose gradient value is higher than<br />

th2; otherwise, they are excluded from the edge detection;<br />

• pixels with a gradient value lower than th2 are always excluded<br />

from the edge detection.<br />

The advantage of the hysteresis process in respect to a single-threshold<br />

approach is that it permits to individuate only reliable edges (pixels<br />

with a high gradient, and pixels with a low gradient which are likely<br />

to belong to a real edge because they are connected to a strong pixel).<br />

This avoids a typical problem of using a single threshold, that is the<br />

creation of discontinuous edges; this happens where some pixels along<br />

an edge have a gradient value slightly higher than the threshold, while<br />

some others have a gradient slightly lower than it.<br />

Optimal threshold values for contrast stretching and Canny algo-<br />

rithm are strongly dependent on the features of the captured images<br />

(illumination, objects and floor textures, etc.). No tried and tested ap-<br />

proach to their determination exists yet. In the MOSTAR interface, it<br />

is possible to choose a value for each threshold by another feedback-<br />

based procedure. The edge detection calibration feature permits to<br />

69


variate the single parameters while visualizing the resulting contrast-<br />

stretched grayscale image and edge image (figure 29).<br />

Figure 29: (a) Original image. (b) Contrast-stretched grayscale and edge image,<br />

visualized during edge detection parameters calibration.<br />

The edge detection algorithm has been implemented using the Op-<br />

enCV library functions.<br />

8.2 Nearest edges discovery<br />

After an edge image is extracted by the method described in the previ-<br />

ous section, the NED (Nearest Edges Discovery) algorithm is executed<br />

on it. The aim of the NED algorithm is to detect the nearest objects<br />

within the area viewed by the robot camera through the analysis of<br />

edges present in the camera image.<br />

The NED algorithm begins processing each of the laser measures.<br />

For each laser point, the correspondent virtual ray (the line laying on<br />

the ground between the laser origin and the point itself) is projected<br />

to the camera image. Each virtual ray corresponds to a bidimensional<br />

line on the image plane, though usually only a part (or none) of it will<br />

lay inside the actual border of the image (figure 30).<br />

For each ray, the corresponding segment on the image is located (if it<br />

exists) with the help of the gluProject function of the OpenGL Utility<br />

70


Figure 30: Projection of a virtual ray to the camera image.<br />

library (GLU). The input arguments which the gluProject function<br />

needs are a point of the 3D space and the OpenGL camera transforma-<br />

tion parameters. The function calculates the coordinate transformation<br />

of the point, and returns the pixel coordinates and its depth value. It is<br />

subsequently possible to invert the projection and to go up to the orig-<br />

inal 3D coordinates of the point, since the pixel depth value eliminates<br />

the ambiguity.<br />

Each of the pixels composing the just retrieved segment will cor-<br />

respond to a 3D point on the virtual ray between the robot and a<br />

determinate laser point. Points further than the laser point along the<br />

same direction are also includes by prolonging the segment in the im-<br />

age. Then, the segment is scanned along this direction until a pixel<br />

corresponding to an edge is found.<br />

Once the nearest edge pixel along the virtual ray projection is found,<br />

its image coordinates are used to retrieve the 3D coordinates of the<br />

corresponding workspace point, by means of the gluUnProject func-<br />

tion (figure 31). The retrieved workspace point will be a point on the<br />

ground, usually corresponding to the base of an obstacle, and it will<br />

have a certain distance from the robot.<br />

71


Figure 31: Unprojection of the edge pixel to the 3D space, through pixel coordinates<br />

and depth value.<br />

At the end of the NED algorithm, a set of nearest edge points<br />

(NEPs) will have been calculated, each of which will have a correspond-<br />

ing laser point. The distance between each NEP and its correspond-<br />

ing laser point is calculated, and it is compared to a parameterizable<br />

threshold edgeT h:<br />

• if the distance between the NEP and the laser point is lesser<br />

than edgeT h, it is assumed that the NEP and the laser point<br />

correspond to the same real object;<br />

• if the distance between the NEP and the laser point is greater<br />

than edgeT h, it is assumed that the NEP and the laser point<br />

correspond to different objects.<br />

NEPs which fall into the first category are used for overlay align-<br />

ment. NEPs which fall into the second category are further divided into<br />

those which are nearer to the robot than the corresponding laser point<br />

and those which are further from the robot, and are used to detect<br />

obstacles missed by the laser and possible laser outliers.<br />

72


8.3 Improving alignment with edges<br />

In an ideal case, assuming perfect calibration and a completely disto-<br />

rtion-free camera, virtual objects borders would be perfectly aligned<br />

with the corresponding real objects edges. Though, a slightly imprecise<br />

calibration and/or small differences between the ideal camera model<br />

and the real camera may often cause little misalignments.<br />

NEPs which are located near to the corresponding laser points can<br />

be used to correct this misalignments. For each of these NEP, the<br />

coordinate of the corresponding laser point are corrected in order to be<br />

coincident with the NEP coordinates. This way, when the rendering of<br />

the overlay is performed, the virtual object based on that laser point<br />

will be precisely aligned with the edge of the real object underneath.<br />

(figure 32).<br />

8.4 Improving reliability with edges<br />

If an edge is detected in a point much nearer to the robot than the laser<br />

point, this can mean that:<br />

• an object is present in the workspace which has not been detected<br />

by the laser, and its base contains the NEP, or<br />

• a false edge has been detected in a point between the robot and<br />

the laser point.<br />

There is no trivial way to determine whether the NEP is a true edge<br />

or not, and whether it indicates an object which could be an obsta-<br />

cle for the robot or not (it could be, for example, a drawing or some<br />

pattern on the floor). Although the safest decision for the teleopera-<br />

tion would be to assume that an obstacle is present on the NEP, it is<br />

not actually convenient to consider each NEP an obstacle, since false<br />

edges are rather common even in rather structured environments and<br />

73


Figure 32: (a) Alignment using feedback-based calibration only: the margin of the<br />

overlay is slightly detached from the base of the box. (b) Alignment using feedbackbased<br />

calibration and NEPs: position of the laser points is corrected with NEPs in<br />

order to coincide with the base of the box in the image.<br />

74


after a careful tuning of edge detection parameters. Indeed, informa-<br />

tion retrieved from edge detection is far less reliable than information<br />

obtained by the laser sensor. Therefore, NEPs interpreted as potential<br />

obstacles are still indicated by an overlay, but in a different way from<br />

laser data. Specifically, a single colored point is rendered above each<br />

NEP which could indicate an obstacle (figure 33).<br />

Figure 33: NEPs nearer to the robot than the relative laser points are highlighted<br />

with colored dots.<br />

If an edge is detected in a point much further from the robot than<br />

the laser point, this can mean that:<br />

• an object is present in the workspace which has been detected by<br />

the laser but it is high above the ground (therefore, since its base<br />

edge is higher than it is expected to be, it is interpreted as further<br />

than it actually is), or<br />

• the edge detection algorithm has missed the real edge of the ob-<br />

ject, or<br />

• the laser measure is wrong and lower than it should be.<br />

75


Since the last case is rather unlikely, in this case the best (and safest)<br />

decision is to trust the laser measure, therefore the NEP is simply<br />

ignored.<br />

In both cases, the presence of a NEP which disagrees with the cor-<br />

respondent laser point casts doubt on the validity of the correspondent<br />

laser measurement. Therefore, if the laser point had previously been<br />

marked as a potential outlier (see section 6.2) it will be considered as<br />

a probable outlier and excluded from the rendering (figure 34).<br />

8.5 Testing<br />

The integration of image processing in the AR system had good results<br />

as regards alignment improvement and laser data correction.<br />

The edge detection algorithm exposed in section 8.1 correctly indi-<br />

viduates objects bases in cases where the floor is plain-textured. Pa-<br />

rameter tuning is necessary in cases where the floor presents a faint<br />

pattern, in order to suppress the range of gray values of the floor and<br />

highlight objects borders. The algorithm is not supposed to have good<br />

results in cases where the floor presents strongly-contrasted patterns<br />

(e.g. black-and-white tiles). In fact, in such cases contrast stretching<br />

would not be able to suppress floor edges, which would interfere with<br />

objects base detection. For those cases, a more sophisticated form of<br />

contrast stretching and intensity suppression function should be used.<br />

Edge data integration for overlay alignment had good overall per-<br />

formances. Though, it proved to be fairly sensitive to the quality of<br />

the edge detection. In cases where false edges where detected near to<br />

the border with which the 3D overlay should have been aligned, the<br />

overlay tended to follow false edges, producing unpleasant artifacts.<br />

On the other hand, NEPs non-corresponding to laser data proved<br />

to be a very effective visual aid. In fact, objects invisible to the laser<br />

use to generate many highlighted NEPs of the same color along a line<br />

76


Figure 34: (a) Outliers are detected as discontinuities, but they are still rendered.<br />

(b) Outliers are confirmed and excluded from the rendering.<br />

77


(see the box on the left of figure 33). On the contrary, false edges<br />

use to generate isolated NEPs (see NEPs in front of the box on the<br />

right of figure 33). The attention of the user is usually drawn by clus-<br />

ters of similar dots rather than by isolated dots; therefore, while real<br />

objects are strongly enhanced by NEPs, false edges remain relatively<br />

inconspicuous. This helps operators to focus their attention on real<br />

obstacles without distracting them with striking false edges.<br />

Several values have been tested for the edgeT h parameter. The<br />

higher the values of edgeT h, the more NEPs are considered as consistent<br />

with corresponding laser measures. Therefore, when a high value is<br />

used, more NEPs will be used for overlay alignment, thus the 3D overlay<br />

will closely follow image edges. Instead, when a low value is used, more<br />

NEPs will be used for laser correction, thus the overlay will follow laser<br />

values and more NEPs will be highlighted as inconsistent with the laser<br />

data. Values around 10 cm for edgeT h are generally well-performing.<br />

Timing performance was good. Contrast stretching and edge detec-<br />

tion on a single frame had an average duration of 45 ms. Since camera<br />

image and virtual overlay are displayed together at the same time, this<br />

does not cause dynamic registration errors, but it limits the maximum<br />

framerate to 25 fps. This value is sufficient for the teleguide of the<br />

MORDUC, which does not require very quick manoeuvres. Besides,<br />

during the tests the framerate was already limited to about 2 fps by<br />

network delay, therefore the delay introduced by image processing was<br />

neglectable.<br />

78


9 Stereoscopic augmented reality<br />

The techniques presented in the previous sections have good perfor-<br />

mances if they are used on a single video image, but their efficacy can<br />

be increased by using them in synchronization with stereoscopic cap-<br />

turing and visualization. Though, using a stereo couple of images raises<br />

several issues, derived from potential inconsistencies between left and<br />

right image (see, for example, [13]). This section describes these issues<br />

and the methods used in the MOSTAR interface to solve them. Fur-<br />

thermore, it shows how stereo information can be used to improve the<br />

results achieved by the algorithms exposed above.<br />

9.1 Stereo AR alignment<br />

The AR alignment problem in the case of a stereo couple of cameras<br />

is substantially analogue to the single-camera case. The difference is<br />

that in the stereo case the aims of the alignment procedure are three<br />

instead of one:<br />

• align the left camera image with the right camera image, so that<br />

correspondent pixels disparity is correct and comfortable;<br />

• align the AR overlay on the left image with the one on the right<br />

image, so that they are visualized with the correct disparity and<br />

they are seen by the human operator as a unique 3D overlay;<br />

• align the stereoscopic couple of overlays with the stereoscopic cou-<br />

ple of images, so that virtual objects are correctly positioned over<br />

real objects (which is the same aim to achieve in monoscopic AR).<br />

In ideal conditions (left and right cameras are perfectly identical,<br />

parallel and at the same height) the first aim would be automatically<br />

satisfied. Unfortunately, things are different in the real case: cameras<br />

79


elonging to the same model may have slightly different intrinsic pa-<br />

rameters, and they could be not perfectly positioned. Therefore, the<br />

MOSTAR interface gives the possibility to introduce a certain horizon-<br />

tal/vertical offset between the images, in order to correct inaccuracies<br />

in the cameras positions or differences in the cameras principal points.<br />

Depending on the values of the offset, the images are shifted in opposite<br />

directions one respect to the other, and the parts of them which lack a<br />

correspondent in the other image are cropped (figure 35).<br />

The second aim is automatically reached by exploiting the features<br />

of the OpenGL library. In fact, it is sufficient to render left and right<br />

overlays as two equal viewpoints on the same virtual scene, parallel<br />

between each other and one horizontally shifted respect to the other,<br />

to obtain a stereoscopic couple of virtual viewpoints.<br />

The third aim is obtained through the same feedback-based cali-<br />

bration procedure used in the monoscopic case. The user can adjust<br />

the parameters of the virtual cameras while the stereoscopic overlay is<br />

rendered on the images, and correct their values depending on what he<br />

sees. Although each virtual camera should have a separate set of intrin-<br />

sic and extrinsic parameters, most of them (specifically, focal length, y<br />

and z coordinates and rotation around the x axis) are kept equal for<br />

both, in order to preserve the correctness of the stereoscopic virtual<br />

couple. Only the x coordinates of the cameras are independent. As a<br />

general rule, the offset between the two final x values should be equal<br />

to the baseline between the real stereo cameras.<br />

9.2 NEP correspondence and suppression<br />

The NED algorithm described in section 8 is designed to be used on<br />

a single image. Executing it on both images independently is likely to<br />

give results conflicting with each other: in fact, weak edges can easily<br />

be recognized in one image and missed in the other, while one image<br />

80


Figure 35: (a) Original images; the left image is some pixel higher in respect to<br />

the right one. (b) Left image is shifted downward (and its higher part is cropped),<br />

while the right image is shifted upward (and its lower part is cropped), so that they<br />

are vertically aligned.<br />

81


could contain artifacts which the other could not not (especially if the<br />

quality of the captured images is low). Therefore, it has been necessary<br />

to implement a method to conciliate NEPs retrieved from the left and<br />

the right image.<br />

A NESC (Nearest Edges Stereo Correspondence) algorithm is run<br />

by the MOSTAR interface after the NED algorithm has been performed<br />

on both images, and before the NEPs are rendered. The NESC algo-<br />

rithm is based on a simple assumption: the real edges of the objects<br />

within the robot workspace are likely to be rather strong features and<br />

appear in both images, while false edges, like the ones produced by<br />

image artifacts, are likely to appear in only one image. Therefore, the<br />

algorithm searches for NEPs which appear in both images, and are<br />

likely to represent the same point in the space.<br />

The algorithm starts iterating over the laser points which have a<br />

correspondent NEP in one or both images. For each laser point, there<br />

are three possible alternatives:<br />

1. if the laser point has a correspondent NEP in only one of the<br />

images, that NEP is counted as an unreliable NEP;<br />

2. if the laser point has a correspondent NEP in both images, and<br />

the distance between left and right NEP is lesser than a param-<br />

eterizable threshold (stereoT h), the NEPs are considered as cor-<br />

responding to the same point in the space: therefore, a reliable<br />

NEP is counted, and its coordinates are set as the middle point<br />

between the left and the right NEP;<br />

3. if the laser point has a correspondent NEP in both images, but<br />

the distance between left and right NEP is greater than stereoT h,<br />

the NEPs are considered as corresponding to two different points<br />

in the space, and are both counted as unreliable NEPs.<br />

82


At the end of the iteration, laser points which fall into the first and<br />

the second categories exposed above will have one correspondent NEP<br />

(reliable or unreliable), while the ones belonging to the third will have<br />

two NEPs. In their case, only the NEP which corresponds to the laser<br />

measure (that is, the distance between the NEP is lesser than edgeT h),<br />

if any, is considered. In the unlikely case a laser point has two different<br />

unreliable NEPs, and both correspond to the laser measure (it can hap-<br />

pen if the chosen threshold values are such that stereoT h < 2edgeT h),<br />

only the NEP closer to the robot is considered <strong>–</strong> as a safety measure.<br />

After reliability of NEPs has been assessed through the NESC al-<br />

gorithm, reliable NEPs are used to refine alignment and double-check<br />

laser measures as described in sections 8.3 and 8.4. Instead, NEPs<br />

which have proved to be unreliable are used only if they agree with the<br />

correspondent laser measure, but they are disregarded if they contra-<br />

dict the laser sensor. Therefore, while reliable NEPs are used for both<br />

overlay alignment and laser correction, unreliable NEPs are used only<br />

for alignment. The idea is that if a NEP is unreliable (i.e. it appears<br />

only in one of the image), but it is strengthened by some other fac-<br />

tor (in our case, the laser measure), it is likely to correspond to a real<br />

edge, so it can be used for alignment; instead, if it disagrees with other<br />

measures, it is likely to be a false edge, so it should be neglected.<br />

9.3 Testing<br />

Evaluation of stereoscopic visualization went as expected. Participants<br />

found the stereoscopic modality of MOSTAR interface more realistic<br />

than the monoscopic modality, and felt an increased sense of awareness<br />

of the remote environment. No quantitative results where collected,<br />

though a systematic evaluation has been planned for the future.<br />

The stereo alignment technique has proven to be helpful to correct<br />

small misalignments between camera images. They have been espe-<br />

83


cially useful to eliminate vertical disparity between images (for the<br />

STH-MDCS2-VAR-C cameras, vertical disparity was due to a slight<br />

difference in the position of the principal point; for the Microsoft Life-<br />

cams, it was due to inaccurate positioning).<br />

Also the rendering of virtual objects took advantage of stereo. Par-<br />

ticipants observed that virtual walls appeared much more realistic when<br />

observed with stereoscopic visualization. On the other hand, virtual<br />

rays and lines on walls were found confusing and tiresome to look at.<br />

This is probably due to the fact that rays and lines were not clearly<br />

visible. However, as stated in section 7, it has been observed that<br />

virtual rays and lines were useful during laser-camera calibration, so<br />

participants usually preferred to visualize them during calibration.<br />

The application of the NESC algorithm was successful. As it can<br />

be seen in figure 36, after the application of NESC algorithm several<br />

stray NEPs are eliminated, while reliable NEPs (the ones coincident<br />

with borders of real objects) are left relatively untouched.<br />

NESC algorithm results were influenced by the value of the stereoT h<br />

parameter. Given a laser point having a corresponding NEP in the left<br />

image and another in the right image, the value of stereoT h determines<br />

how far (in the 3D space) the two NEPs have to be in order to be<br />

considered reliable (i.e. coincident). Higher values of stereoT h force<br />

the NESC algorithm to “trust” edge detection and to output more<br />

reliable edges, which means that more NEPs will ultimately be used<br />

during the rendering of the overlay. Therefore, high values of stereoT h<br />

should be used when edge detection has reliable results. During tests,<br />

a value of edgeT h/2 was used for stereoT h, with excellent results.<br />

Since in stereoscopic mode the image processing algorithms oper-<br />

ated on two different images, the delay introduced was double (about<br />

90 ms). However, as stated in section 8, it was neglectable in respect<br />

to the one introduced by the communication over the network.<br />

84


Figure 36: (a) Highlighted NEPs in left image. (b) Highlighted NEPs in left image<br />

after applying NESC algorithm.<br />

85


10 Conclusions<br />

This work has presented a new approach to visualization of video and<br />

sensor data in a teleguide interface. The approach is based on aug-<br />

mented reality and further enhanced by stereoscopic visualization. The<br />

approach has been implemented within the MOSTAR interface, and<br />

tested by teleoperating the MORDUC mobile robot from a distance of<br />

over 2500 km.<br />

The proposed approach displays visual and range data from a laser<br />

scanner in a unified, AR-based representation. The aim is to assist<br />

mobile robot navigation by providing depth information to the oper-<br />

ator, in an intuitive and effective way. Virtual colored tridimensional<br />

objects, built using laser data, are overlaid on the video image to high-<br />

light obstacles. Virtual objects are registered with real objects thanks<br />

to a simple and effective semi-automatic calibration procedure. Edge<br />

detection is used to individuate nearest edge points (NEPs), which in<br />

turn are used to refine the AR registration and to point out obstacles<br />

which the laser is not aware of. The approach can be used in both<br />

monoscopic and stereoscopic display solutions. If stereo cameras are<br />

available, stereo information is used to verify reliability of edge detec-<br />

tion.<br />

The proposed approach has been implemented and a pilot test has<br />

been performed to assess its validity. The test has had excellent results.<br />

Virtual objects have proven to be a valuable aid for distance estima-<br />

tion and for acquiring awareness of the remote environment. Semi-<br />

automatic calibration was sufficient to obtain a good alignment degree<br />

in the vast majority of cases. Edge detection highlighted obstacles in-<br />

visible to the laser, generating few neglectable false positives; though,<br />

the alignment correction feature proved to be too sensitive to noisy<br />

edges. The approach performed well both in monoscopic and stereos-<br />

86


copic modes. Tests showed that it is possible to significantly reduce the<br />

number of highlighted false edges by using stereo information.<br />

Planned further developments include the refinement of features<br />

which showed to perform poorly. Specifically, we mean to investigate<br />

computer vision methods to reliably detect object bases even in pres-<br />

ence of strong patterns on the floor. Besides, a method is being designed<br />

in order to make feedback-based calibration more intuitive. Finally, a<br />

systematical evaluation of the approach as in [59, 60] has been planned,<br />

in order to quantify the performance increment introduced by the ap-<br />

proach.<br />

87


References<br />

[1] B. Davies. A review of robotics in surgery. Proceedings of the In-<br />

stitution of Mechanical Engineers, Part H: Journal of Engineering<br />

in Medicine, 214(1):129<strong>–</strong>140, 2000.<br />

[2] A.R. Lanfranco, A.E. Castellanos, J.P. Desai, and W.C. Meyers.<br />

Robotic surgery: a current perspective. Annals of Surgery, 239(1):<br />

14, 2004.<br />

[3] P. Arena, P. Di Giamberardino, L. Fortuna, F. La Gala, S. Monaco,<br />

G. Muscato, A. Rizzo, and R. Ronchini. Toward a mobile au-<br />

tonomous robotic system for Mars exploration. Planetary and<br />

Space Science, 52(1-3):23<strong>–</strong>30, 2004.<br />

[4] G. Astuti, G. Giudice, D. Longo, C.D. Melita, G. Muscato, and<br />

A. Orlando. An Overview of the “Volcan Project”: An UAS for<br />

Exploration of Volcanic Environments. Journal of Intelligent and<br />

Robotic Systems, 54(1):471<strong>–</strong>494, 2009.<br />

[5] RR Murphy. Human-robot interaction in rescue robotics. IEEE<br />

Transactions on Systems, Man, and Cybernetics, Part C: Applica-<br />

tions and Reviews, 34(2):138<strong>–</strong>153, 2004.<br />

[6] G. Muscato, D. Caltabiano, S. Guccione, D. Longo, M. Coltelli,<br />

A. Cristaldi, E. Pecora, V. Sacco, P. Sim, GS Virk, et al. ROBO-<br />

VOLC: a robot for volcano exploration result of first test campaign.<br />

Industrial Robot: An International Journal, 30(3):231<strong>–</strong>242, 2003.<br />

[7] Z. Zhang, S. Ma, Z. Lu, and B. Cao. Communication Mechanism<br />

Study of a Multi-Robot Planetary Exploration System. In IEEE<br />

International Conference on Robotics and Biomimetics (ROBIO),<br />

pages 49<strong>–</strong>54, 2006.<br />

88


[8] P. Milgram, S. Yin, and J.J. Grodski. An augmented reality based<br />

teleoperation interface for unstructured environments. In Proc.<br />

American Nuclear Society 7th Topical Meeting on Robotics and<br />

Remote Systems, 1997.<br />

[9] M. Baker, R. Casey, B. Keyes, and H.A. Yanco. Improved inter-<br />

faces for human-robot interaction in urban search and rescue. In<br />

Proceedings of the IEEE Conference on Systems, Man and Cyber-<br />

netics, volume 3, pages 2960<strong>–</strong>2965, 2004.<br />

[10] J. Scholtz, J. Young, J. Drury, and H. Yanco. Evaluation of human-<br />

robot interaction awareness in search and rescue. In IEEE Inter-<br />

national Conference on Robotics and Automation, volume 3, pages<br />

2327<strong>–</strong>2332, 2004.<br />

[11] H.A. Yanco and J. Drury. ‘Where am I?’Acquiring situation aware-<br />

ness using a remote robot platform. In IEEE Conference on Sys-<br />

tems, Man and Cybernetics, pages 2835<strong>–</strong>2840, 2004.<br />

[12] M.W. Kadous, R.K.M. Sheh, and C. Sammut. Effective user in-<br />

terface design for rescue robotics. In Proceedings of the 1st ACM<br />

SIGCHI/SIGART conference on Human-robot interaction, page<br />

257. ACM, 2006.<br />

[13] S. Livatino, G. Muscato, D. De Tommaso, and M. Macaluso. Aug-<br />

mented reality stereoscopic visualization for intuitive robot tele-<br />

guide. In IEEE International Symposium on Industrial Electronics<br />

(ISIE), 2010.<br />

[14] R.T. Azuma et al. A survey of augmented reality. Presence-<br />

Teleoperators and Virtual Environments, 6(4):355<strong>–</strong>385, 1997.<br />

[15] R. Azuma, Y. Baillot, R. Behringer, S. Feiner, S. Julier, and<br />

89


B. MacIntyre. Recent advances in augmented reality. IEEE Com-<br />

puter Graphics and Applications, pages 34<strong>–</strong>47, 2001.<br />

[16] WS Kim, PS Schenker, AK Bejczy, and S. Hayati. Advanced<br />

graphics interfaces for telerobotic servicing and inspection. In<br />

Proc. IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems,<br />

Yokohama, pages 303<strong>–</strong>309, 1993.<br />

[17] P. Milgram, S. Zhai, D. Drascic, and J. Grodski. Applications of<br />

augmented reality for human-robot communication. In Proceedings<br />

of the IEEE/RSJ International Conference on Intelligent Robots<br />

and Systems (IROS)., volume 3, 1993.<br />

[18] S. Otmane, M. Mallem, A. Kheddar, and F. Chavand. Active vir-<br />

tual guides as an apparatus for augmented reality based telemanip-<br />

ulation system on the Internet. In Annual Simulation Symposium,<br />

volume 33, pages 185<strong>–</strong>191, 2000.<br />

[19] JWS Chong, SK Ong, AYC Nee, and K. Youcef-Youmi. Robot<br />

programming using augmented reality: An interactive method for<br />

planning collision-free paths. Robotics and Computer Integrated<br />

Manufacturing, 25(3):689<strong>–</strong>701, 2009.<br />

[20] T.H.J. Collett and B.A. MacDonald. Developer oriented vis-<br />

ualisation of a robot program. In Proceedings of the 1st<br />

ACM SIGCHI/SIGART conference on Human-robot interaction,<br />

page 56. ACM, 2006.<br />

[21] B. Giesler, T. Salb, P. Steinhaus, and R. Dillmann. Using aug-<br />

mented reality to interact with an autonomous mobile platform.<br />

In Proceedings of IEEE International Conference on Robotics and<br />

Automation (ICRA), volume 1, 2004.<br />

90


[22] D.J. Bruemmer, D.D. Dudenhoeffer, and J. Marble. Dynamic au-<br />

tonomy for urban search and rescue. In Proceedings of the AAAI<br />

Mobile Robot Workshop, 2002.<br />

[23] V. Brujic-Okretic, J.Y. Guillemaut, LJ Hitchin, M. Michielen, and<br />

GA Parker. Remote vehicle manoeuvring using augmented reality.<br />

In International Conference on Visual Information Engineering<br />

(VIE), pages 186<strong>–</strong>189, 2003.<br />

[24] R. Meier, T. Fong, C. Thorpe, and C. Baur. A sensor fusion<br />

based user interface for vehicle teleoperation. In Proceedings of<br />

the IEEE International Conference on Field and Service Robotics<br />

(FSR), 1999.<br />

[25] F. Ferland, F. Pomerleau, C.T. Le Dinh, and F. Michaud. Ego-<br />

centric and exocentric teleoperation interface using real-time, 3D<br />

video projection. In Proceedings of the 4th ACM/IEEE interna-<br />

tional conference on Human robot interaction, pages 37<strong>–</strong>44. ACM,<br />

2009.<br />

[26] C.W. Nielsen, M.A. Goodrich, and R.W. Ricks. Ecological inter-<br />

faces for improving mobile robot teleoperation. IEEE Transactions<br />

on Robotics, 23(5):927, 2007.<br />

[27] C. Demiralp, CD Jackson, DB Karelitz, S. Zhang, and DH Laid-<br />

law. Cave and fishtank virtual-reality displays: A qualitative and<br />

quantitative comparison. IEEE Transactions on Visualization and<br />

Computer Graphics, 12(3):323<strong>–</strong>330, 2006.<br />

[28] D. Drascic. Skill acquisition and task performance in teleoperation<br />

using monoscopic and stereoscopic video remote viewing. In Hu-<br />

man Factors and Ergonomics Society Annual Meeting Proceedings,<br />

volume 35, pages 1367<strong>–</strong>1371, 1991.<br />

91


[29] M. Ferre, R. Aracil, and MA Sanchez-Uran. Stereoscopic human<br />

interfaces. IEEE Robotics & Automation Magazine, 15(4):50<strong>–</strong>57,<br />

2008.<br />

[30] G.S. Hubona, G.W. Shirah, and D.G. Fout. The effects of motion<br />

and stereopsis on three-dimensional visualization. International<br />

journal of human-computer studies, 47(5):609<strong>–</strong>627, 1997.<br />

[31] G. Jones, D. Lee, N. Holliman, and D. Ezra. Controlling perceived<br />

depth in stereoscopic images. In Stereoscopic Displays and Virtual<br />

Reality Systems VIII, Proceedings of SPIE, volume 4297, pages<br />

42<strong>–</strong>53, 2001.<br />

[32] I. Sexton and P. Surman. Stereoscopic and autostereoscopic display<br />

systems. IEEE Signal Processing Magazine, 16(3):85<strong>–</strong>99, 1999.<br />

[33] Wikipedia. Augmented reality.<br />

http://en.wikipedia.org/wiki/Augmented_reality, 2010.<br />

[34] M. Billinghurst, I. Poupyrev, H. Kato, and R. May. Mixing realities<br />

in shared space: An augmented reality interface for collaborative<br />

computing. In ICME 2000, pages 1641<strong>–</strong>1644, 2000.<br />

[35] P. Milgram and F. Kishino. A taxonomy of mixed reality visual<br />

displays. IEICE Transactions on Information and Systems E series<br />

D, 77:1321<strong>–</strong>1321, 1994.<br />

[36] R. Azuma. Tracking requirements for augmented reality. Commu-<br />

nications of the ACM, 36(7):51, 1993.<br />

[37] AJ Davison, ID Reid, ND Molton, and O. Stasse. MonoSLAM:<br />

Real-time single camera SLAM. IEEE Transactions on Pattern<br />

Analysis and Machine Intelligence, 29(6):1052<strong>–</strong>1067, 2007.<br />

92


[38] W.A. Hoff, K. Nguyen, and T. Lyon. Computer vision-based regis-<br />

tration techniques for augmented reality. Proceedings of Intelligent<br />

Robots and Computer Vision XV (SPIE), 2904:538<strong>–</strong>548, 1996.<br />

[39] H. Kato and M. Billinghurst. Marker tracking and hmd calibra-<br />

tion for a video-based augmented reality conferencing system. In<br />

Proceedings of the 2nd IEEE and ACM International Workshop<br />

on Augmented Reality, volume 99, pages 85<strong>–</strong>94, 1999.<br />

[40] Wikipedia. Pinhole camera.<br />

http://en.wikipedia.org/wiki/Pinhole_camera, 2010.<br />

[41] J. Heikkila and O. Silven. A four-step camera calibration proce-<br />

dure with implicit imagecorrection. In Proceedings of the IEEE<br />

Computer Society Conference on Computer Vision and Pattern<br />

Recognition, pages 1106<strong>–</strong>1112, 1997.<br />

[42] C.C. Slama, C. Theurer, and S.W. Henriksen. Manual of pho-<br />

togrammetry. American Society of Photogrammetry Falls Church,<br />

Virginia, 1980.<br />

[43] T. Melen. Geometrical modelling and calibration of video cam-<br />

eras for underwater navigation. PhD thesis, Institutt for teknisk<br />

kybernetikk, Universitetet i Trondheim, 1994.<br />

[44] W. Faig. Calibration of close-range photogrammetry systems:<br />

Mathematical formulation. Photogrammetric engineering and re-<br />

mote sensing, 41(12):1479<strong>–</strong>1486, 1975.<br />

[45] J. Weng, P. Cohen, and M. Herniou. Camera calibration with<br />

distortion models and accuracy evaluation. IEEE Transactions on<br />

Pattern Analysis and Machine Intelligence, 14(10):965<strong>–</strong>980, 1992.<br />

93


[46] L. Lipton. Stereographics, Developers Handbook. StereoGraphics<br />

Corporation, 1991.<br />

[47] L. Lipton. Foundations of the stereoscopic cinema: a study in<br />

depth. Van Nostrand Reinhold, 1982.<br />

[48] Wikipedia. Anaglyph image.<br />

http://en.wikipedia.org/wiki/Anaglyph_image, 2010.<br />

[49] S.E.B. Sorensen, P.S. Hansen, and N.L. Sorensen. Method for<br />

recording and viewing stereoscopic images in color using multi-<br />

chrome filters, February 3 2004. US Patent 6,687,003.<br />

[50] Wikipedia. Stereoscopy.<br />

http://en.wikipedia.org/wiki/Stereoscopy, 2010.<br />

[51] M. Halle. Autostereoscopic displays and computer graphics. In<br />

ACM SIGGRAPH Courses, page 104. ACM, 2005.<br />

[52] Wikipedia. HSL and HSV color spaces.<br />

http://en.wikipedia.org/wiki/HSL_and_HSV, 2010.<br />

[53] J. Borenstein and Y. Koren. Histogramic in-motion mapping for<br />

mobile robot obstacle avoidance. IEEE Journal of Robotics and<br />

Automation, 7(4):535<strong>–</strong>539, 1991.<br />

[54] H. Baltzakis, A. Argyros, and P. Trahanias. Fusion of laser and<br />

visual data for robot motion planning and collision avoidance. Ma-<br />

chine Vision and Applications, 15(2):92<strong>–</strong>100, 2003.<br />

[55] D.J. Bruemmer, R.L. Boring, D.A. Few, J. Marble, and M.C. Wal-<br />

ton. “I call shotgun!”: An evaluation of mixed-initiative control<br />

for novice users of a search and rescue robot. In Proceedings of the<br />

IEEE Conference on Systems, Man & Cybernetics, 2004.<br />

94


[56] J. J. Gibson. The ecological approach to visual perception.<br />

Houghton Mifflin Boston, 1979.<br />

[57] R.J. Rost. OpenGL R⃝ Shading Language. Addison Wesley Long-<br />

man Publishing Co., Inc. Redwood City, CA, USA, 2004.<br />

[58] DIEES University of Catania. 3MORDUC.<br />

http://www.robotic.diees.unict.it/robots/morduc/<br />

morduc.htm, 2010.<br />

[59] S. Livatino, G. Muscato, S. Sessa, C. Koffel, C. Arena, A. Pennisi,<br />

D. Di Mauro, and E. Malkondu. Mobile robotic teleguide based<br />

on video images. IEEE Robotics & Automation Magazine, 15(4):<br />

58<strong>–</strong>67, 2008.<br />

[60] S. Livatino, G. Muscato, S. Sessa, and V. Neri. Depth-enhanced<br />

mobile robot teleguide based on laser images. Mechatronics, In<br />

Press, 2010.<br />

[61] J. Corde Lane, R. Carignan, B.R. Sullivan, D.L. Akin, T. Hunt,<br />

and R. Cohen. Effects of time delay on telerobotic control of neu-<br />

tral buoyancy vehicles. In Proceedings of IEEE International Con-<br />

ference on Robotics and Automation, volume 3, pages 2874<strong>–</strong>2879,<br />

2002.<br />

[62] Y. Bok, Y. Hwang, and I.S. Kweon. Accurate motion estimation<br />

and high-precision 3d reconstruction by sensor fusion. In IEEE In-<br />

ternational Conference on Robotics and Automation, pages 4721<strong>–</strong><br />

4726, 2007.<br />

[63] Q. Zhang and R. Pless. Extrinsic calibration of a camera and<br />

laser range finder (improves camera calibration). In Proceedings<br />

of IEEE/RSJ International Conference on Intelligent Robots and<br />

Systems(IROS), volume 3, 2004.<br />

95


[64] OpenGL website.<br />

http://www.opengl.org, 2010.<br />

[65] GLUT website.<br />

http://www.opengl.org/resources/libraries/glut/, 2010.<br />

[66] OpenCV website.<br />

http://sourceforge.net/projects/opencvlibrary/, 2010.<br />

[67] R. Williams and B. Andrews. The non-designer’s design book.<br />

Peachpit Press Berkeley, 1994.<br />

[68] Paul Bourke. Nonlinear Lens Distortion.<br />

http://local.wasp.uwa.edu.au/~pbourke/miscellaneous/<br />

lenscorrection/#opengl, August 2000.<br />

[69] Graphics Size Coding. Tiny distortion shader.<br />

http://sizecoding.blogspot.com/2007/10/tiny-<br />

distortion-shader.html, October 2007.<br />

[70] J. Canny. A computational approach to edge detection. Readings<br />

in computer vision: issues, problems, principles, and paradigms,<br />

page 184, 1987.<br />

[71] I. Sobel and G. Feldman. A 3x3 isotropic gradient operator for im-<br />

age processing. Presentation for Stanford Artificial Project, 1968.<br />

96

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!