Scarica (PDF – 6.19 MB)

UNIVERSITÀ DEGLI STUDI DI CATANIA 

facoltà di Ingegneria 

corso di laurea specialistica in ingegneria informatica 

Filippo Bannò 

STEREOSCOPIC AUGMENTED REALITY 

TO ASSIST ROBOT TELEOPERATION 

Tesi di laurea 

anno accademico 2008/2009 

Relatore: 

Prof. Ing. G. Muscato 

Correlatore: 

Dr. S. Livatino

Contents 

1 Introduction 3 

2 Background 7 

2.1 Augmented reality . . . . . . . . . . . . . . . . . . . . . 7 

2.2 Pinhole camera model . . . . . . . . . . . . . . . . . . . 10 

2.3 Stereoscopic visualization . . . . . . . . . . . . . . . . . . 14 

3 Augmented reality visual Interfaces in robot teleopera- 

tion 19 

3.1 A sensor fusion based user interface for vehicle teleoper- 

ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

3.2 Fusion of laser and visual data for robot motion planning 

and collision avoidance . . . . . . . . . . . . . . . . . . . 22 

3.3 Using augmented reality to interact with an autonomous 

mobile platform . . . . . . . . . . . . . . . . . . . . . . . 23 

3.4 Improved interfaces for human-robot interaction in ur- 

ban search and rescue . . . . . . . . . . . . . . . . . . . . 25 

3.5 Ecological interfaces for improving mobile robot teleop- 

eration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 

3.6 Egocentric and exocentric teleoperation interface using 

real-time, 3D video projection . . . . . . . . . . . . . . . 29 

3.7 Summary and analysis . . . . . . . . . . . . . . . . . . . 31 

4 Previous work on 3MORDUC teleoperation 33 

4.1 The 3MORDUC platform . . . . . . . . . . . . . . . . . 34 

4.2 Mobile robotic teleguide based on video images . . . . . 37 

4.3 Depth-enhanced mobile robot teleguide based on laser 

images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 

4.4 Augmented reality stereoscopic visualization for intuitive 

robot teleguide . . . . . . . . . . . . . . . . . . . . . . . 41 

1

4.5 Summary and analysis . . . . . . . . . . . . . . . . . . . 43 

5 Proposed method: AR stereoscopic visualization 44 

5.1 Core idea and motivation . . . . . . . . . . . . . . . . . . 44 

5.2 Research development strategy . . . . . . . . . . . . . . 46 

6 Effective multi-sensor visual representation 52 

6.1 Visualization of laser data through AR features . . . . . 52 

6.2 Detection of discontinuities . . . . . . . . . . . . . . . . . 56 

6.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 

7 Laser-camera alignment and calibration 60 

7.1 Laser-camera model . . . . . . . . . . . . . . . . . . . . . 60 

7.2 Feedback-based calibration procedure . . . . . . . . . . . 62 

7.3 Comparison with automatic calibration . . . . . . . . . . 63 

7.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 

8 Integrating 3D Graphics with image processing 67 

8.1 Edge detection algorithm . . . . . . . . . . . . . . . . . . 67 

8.2 Nearest edges discovery . . . . . . . . . . . . . . . . . . . 70 

8.3 Improving alignment with edges . . . . . . . . . . . . . . 73 

8.4 Improving reliability with edges . . . . . . . . . . . . . . 73 

8.5 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 

9 Stereoscopic augmented reality 79 

9.1 Stereo AR alignment . . . . . . . . . . . . . . . . . . . . 79 

9.2 NEP correspondence and suppression . . . . . . . . . . . 80 

9.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 

10 Conclusions 86 

References 88 

2

1 Introduction 

Robot teleoperation is a solution for many problems which cannot be 

solved neither by a robot alone, nor by the sole human intervention. 

Teleoperation of a robotic manipulator is widely used for tasks where 

a high precision of movements is required, or when the scale of the 

task forbids direct human intervention, as in robotic surgery [1, 2]. 

Besides, robots can be teleoperated to execute exploration or manipu- 

lation tasks in unknown, inaccessible, or dangerous environments where 

human beings could not operate safely, e.g. in deep waters, in plane- 

tary or volcanoes exploration, in USAR (Urban Search And Rescue) 

applications, for bomb finding and their deactivation [3–7]. 

Figure 1: Telerobotics applications: robotic surgery, exploration of volcanoes, deep 

waters, planets. 

On the other hand, as the sophistication of techniques for manag- 

3

ing telerobotic systems continues to grow, it is nevertheless clear to 

those familiar with control technologies that complex robotic tasks are 

unlikely to be achievable using fully autonomous robotic systems, and 

especially not in highly unstructured and dynamically varying environ- 

ments. In these cases, human-cognition is irreplaceable because of the 

high operational accuracy that is required, as well as deep environment 

understanding and fast decision-making [8]. 

When piloting a mobile robot accurate robot navigation is necessary. 

Errors and collisions must be minimized since the robot could receive 

unpredictable damage, and in most cases repairs would be difficult if not 

impossible (a representative example is space/planetary exploration). 

The same is valid during tasks where the robot has to physically interact 

with people, since a careless teleoperation may cause harm to them. 

Accuracy and reactivity of a robot teleoperator can be improved by 

enhancing his sense of presence in the remote environment. Therefore, 

a relevant aspect of a telerobotic system is the user interface, which 

must be designed in order to be as immersive as possible. 

Vision being the dominant human sensor modality, large attention 

has been paid to the visualization aspect in literature. The video sen- 

sor is an essential part in most telerobotic systems, since it provides a 

considerable amount of highly contrasted information in a way which is 

easy for the user to assimilate. Though, there are a number of other sen- 

sors which can well complement visual sensor output, e.g. range sensors 

(laser-based, sonar-based), odometric sensors, bumpers, etc. Numer- 

ous works (see for example [9–13]) study interface design and propose 

methods to effectively display visual and sensor data in a teleoperation 

interface. 

This work proposes a novel approach to visualization of video and 

sensor data in a teleguide interface. The proposed approach exploits 

augmented reality and stereoscopic visualization to assist tele-naviga- 

4

tion of a mobile robot. 

Augmented reality consists in enhancing a real world representa- 

tion with virtual graphical additions. It gives the possibility to display 

sensor data together with visual data in an intuitive and quickly com- 

prehensible way. Up to now, AR has found application in several fields. 

It can be used in medical and manufacturing fields for intuitive training 

and for assistance during precision tasks, or to display annotations over 

the real workspace in collaborative applications. It is frequently used in 

military (e.g. Head-Up Displays for aircraft and helicopter pilots) and 

commercial applications, e.g. to enhance sporting events on television 

[14, 15]. Numerous applications of augmented reality in robotics are 

found in literature. It has been frequently used to introduce visual aids 

into telemanipulation tasks [16–18], to facilitate robot programming 

[19–21] or to assist mobile robots teleguide [9, 22–26]. 

Stereoscopic visualization is today well-known thanks to the spread 

of “3D movies”. Stereoscopy is a group of technologies which permit 

to reproduce the tridimensional depth effect given by binocular vision 

using a bidimensional display. Several works demonstrate that stere- 

oscopic visualization may provide a teleoperator with a higher sense 

of presence in remote environments because of higher depth percep- 

tion [27–32]. This leads to higher comprehension of distance as well as 

aspects related to it, e.g., ambient and obstacle layout 

The proposed visualization approach has been implemented at the 

3D Visualization and Robotics Lab at the University of Hertfordshire, 

United Kingdom. It has been tested by teleoperating the 3MORDUC 

platform, a wheeled mobile robot located at DIEES (Dipartimento di 

Ingegneria Elettrica, Elettronica e dei Sistemi), in University of Cata- 

nia, Italy, more than 2500 km far from the operator location. 

This thesis is structured as follows. Section 2 introduces some pre- 

liminary notions about augmented reality and stereoscopic visualiza- 

5

tion. Section 3 describes the state of the art in visualization interfaces 

for mobile robot teleguide and presents points of strength and weak- 

nesses of the proposed approaches. Section 4 describes the past work 

performed on the teleoperation of 3MORDUC robotic platform, expos- 

ing results and limitations. Section 5 presents the proposed stereoscopic 

augmented reality approach, outlining the adopted development strat- 

egy. Sections 6 to 9 describe the various steps of the implementation in 

detail and present test results. Section 10 draws the conclusions, and 

introduces further developments. 

6

2 Background 

2.1 Augmented reality 

Augmented reality (AR) is a term for a live, direct or indirect view of 

a physical real-world environment whose elements are augmented by 

virtual computer-generated imagery [33]. 

Figure 2: Example of AR: a graphical model is rendered on a real fiducial marker 

[34]. 

Many definitions have been proposed in literature for AR. Azuma 

et al. [14] define AR as a variant of Virtual Reality (VR). VR tech- 

nologies completely immerse a user inside a synthetic environment. In 

contrast, AR allows the user to see the real world, with virtual ob- 

jects superimposed upon or composited with the real world. Therefore, 

AR supplements reality, rather than completely replacing it. Ideally, it 

would appear to the user that the virtual and real objects coexisted in 

the same space. Azuma et al. [14] states that the main requirements 

for a visualization interface to fall within the AR category are: 

• to combine real and virtual; 

• to be interactive in real-time; 

7

• to be registered in 3D (that is, virtual overlays are integrated in 

3D with real world). 

Milgram and Kishino [35] ideate the Reality-Virtuality (RV) con- 

tinuum (figure 3) to draw a coherent definition of VR and AR environ- 

ments. VR and real environments constitute the two ends of the con- 

tinuum. The commonly held view of a VR environment is one in which 

Figure 3: Mixed reality display continuum [35]. 

the participant-observer is totally immersed in a completely synthetic 

world, which may or may not mimic the properties of a real-world envi- 

ronment, but which may also exceed the bounds of physical reality. In 

contrast, a strictly real-world environment clearly must be constrained 

by the laws of physics. All the environments between these two ex- 

tremes are considered Mixed Reality (MR) forms. AR, which consists 

in the addition of several virtual overlays to a real environment is con- 

sidered a form of MR near to the “real” end. The reverse of AR is 

augmented virtuality (AV), which consists in the addition of (video or 

texture mapped) elements from a real environment to a virtual, totally 

synthetic environment. 

2.1.1 Alignment and registration 

Augmented reality does not simply mean the superimposition of a 

graphic object over a real world scene. This is technically an easy task. 

One significant difficulty in augmenting reality is the need to maintain 

accurate registration of the virtual objects with the real world image. 

This often requires detailed knowledge of the relationship between the 

8

frames of reference for the real world, the camera and the user. The 

correct registration must also be maintained while the user (or the 

user’s viewpoint) moves within the real environment. Discrepancies 

or changes in the apparent registration will range from distracting to 

physically disturbing for the user, making the system unusable. AR 

demands much more accurate registration than VR, because humans 

are much more sensitive to visual differences between virtual and real 

objects than to inconsistencies between vision and other senses [36]. 

According to [14], sources of registration errors can be divided into 

two types: static and dynamic. Static sources are the ones that cause 

registration errors even when the user’s viewpoint and the objects in the 

environment remain completely still. Dynamic sources are the ones that 

have no effect until either the viewpoint or the objects begin moving. 

Static errors are usually caused by distortions in the optics, track- 

ing errors, mechanical misalignments in the employed hardware and/or 

incorrect estimation of viewing parameters. Distortions and viewing 

parameters inaccuracy usually cause systematic errors, which can be 

estimated and corrected. The other factors can cause errors which are 

difficult to predict and correct, therefore it is recommended to take 

precautions against them during development phase (for example, by 

an accurate design of the tracking system and an accurate alignment 

of the hardware devices). 

Dynamic errors occur essentially because of system delays in the 

rendering of the overlays. If the user’s viewpoint is in motion and a 

significant delay is present between the moment when the viewpoint 

position/orientation is sampled and the moment when the virtual over- 

lay is rendered, the virtual objects will not “move” in sync with real 

objects, causing misalignments. Dynamic errors can be reduced by re- 

ducing system delay, or by predicting the future position/orientation of 

the viewpoint and rendering the corresponding part of the virtual over- 

9

lay in advance. In video-based AR systems (i.e. when the user does 

not see the real world directly, but through a camera) it is possible 

to eliminate dynamic errors by synchronizing the video stream with 

the rendering of the overlay. This is the case of teleoperation systems, 

where the real world is seen through a camera mounted on the robot. 

Vision-based techniques are often used to detect the viewpoint po- 

sition from the real view, and then correctly register the overlay with 

the image. Usually, these approaches use fiducials, well-known objects 

whose position and orientation can be easily recognized within an image 

[37–39]. 

2.2 Pinhole camera model 

A camera model determines a projection function from scene points 

(points of the 3D real world viewed by the camera) to image points 

(points within the 2D camera image). Correspondence between scene 

points and image points is needed in computer graphics to know where 

on the screen virtual 3D objects have to be rendered. 

The most popular and simple camera model is the pinhole model. 

A pinhole camera is a camera with no lens and a single very small 

aperture. Simply explained, it is a light-proof box with a small hole in 

one side (figure. 

Figure 4: Simple representation of a pinhole camera [40]. 

10

Light from a scene passes through this single point and projects an 

inverted image on the opposite side of the box. Cameras using small 

apertures and the human eye in bright light both act like a pinhole 

camera [40]. 

The pinhole camera model is based on the principle of collinearity, 

where each point in the object space is projected by a straight line 

through the projection center into the image plane. Figure 5 shows the 

geometric model of a pinhole camera. The camera coordinate system 

Q 

Y 

R 

1 

Image plane 

Y 

2 

O 

X 

x 1 

Figure 5: Geometric model of the pinhole camera. [40] 

(O, X1, X2, X3) has its origin at the camera aperture (which is consid- 

ered infinitely small, coincident with a point). Axis X3 is pointing in 

the viewing direction of the camera and is referred to as the optical 

axis. The plane which intersects with axes X1 and X2 is the front side 

of the camera, or principal plane. 

1 

The image plane is where the 3D world is projected through the 

aperture of the camera. It is parallel to axes X1 and X2 and is located 

at distance f from the origin O in the negative direction of the optical 

axis. f is also referred to as the focal length of the pinhole camera. 

11 

x 2 

f 

X 

2 

x 3 

X 

P

The point R at the intersection of the optical axis and the image plane 

is referred to as the principal point of the camera, or center of the 

image. The 2D image coordinate system (R, Y1, Y2) has the origin at 

the principal point and the axes parallel to X1 and X2. 

For each point P = (x1, x2, x3) such that x3 > 0 a projection 

Q = (y1, y2) is defined on the image plane. It is easy to calculate 

Q 

O 

-y1 x3 

Y 

1 

f 

X 

1 

Figure 6: Geometric model of a pinhole camera as seen from the X2 axis. [40] 

the coordinates of the projection from those of the original point using 

similar triangles (see figure 6 for clarity): 

−y1 : f = x1 : x3 → y1 = −f x1 

−y2 : f = x2 : x3 → y2 = −f x2 

x3 

Since a coordinate is lost during the projection, it is not possible to 

retrieve the original 3D coordinates of P from the image coordinates of 

its projection. In fact, a point in the image corresponds to a line in the 

space (see the green line in figure 5 and 6). 

When rendering on the screen, the image coordinates of the projec- 

tion are converted to pixel coordinates by discretizing them and sum- 

ming them to an offset (since the screen coordinate center is usually in 

the upper left corner of the screen, instead than in the center of the 

image). 

12 

x 

P 

1 

x3 

X 

3

Focal length and size of the image are referred to as intrinsic pa- 

rameters of the camera, since they depend only on the specific camera 

and on nothing else. In the general case, coordinates of scene points 

are defined in respect to a world coordinate system, which is different 

from the camera system. In this case, it is necessary to convert the 

coordinates to their expression in the camera system before calculating 

the projection on the image plane. Therefore, a set of extrinsic param- 

eters (position and orientation of the camera in respect to the world 

system) are needed to obtain image coordinates. Given the extrinsic 

parameters, it is possible to determine the transformation matrix be- 

tween world and camera systems, thus the function to map points of 

the world system to the camera system. 

The pinhole model does not consider camera distortions, thus it 

accurately correspond to reality only in those cases where distortions 

are neglectable (typically, in good quality cameras or in the central zone 

of the image). Several camera models exist which include distortion 

factors in the projection function. For example, the model used in 

Heikkila and Silven [41] differs from the pinhole model in two aspects: 

• the position of the principal point can be different from the center 

of the image; 

• radial and tangential distortion (as defined in [42]) are present, 

and apply a non-linear transformation to final projection coordi- 

nates. 

Other models [43] consider the possibility for the camera axes to be non- 

orthogonal; others [44, 45] introduce a prism distortion due to camera 

imperfect manufacturing. 

13

2.3 Stereoscopic visualization 

A stereoscopic image presents the left and right eyes of the viewer with 

different perspective viewpoints, just as the viewer sees the real world. 

From these two slightly different views, the eye-brain synthesizes an 

image of the world with stereoscopic depth [46]. 

When a human looks at an object in the space, his eyes converge on 

that object, i.e. they rotate until both their optical axes cross the object. 

It is easy to notice that, when eyes converge on something, objects much 

nearer or further than the convergence point appear as double (figure 

7). This is because the images projected on the left and right retinae 

Figure 7: (a) Eyes converge on the thumb; the flag, which is further, appears as 

double. (b) Eyes converge on the flag; the thumb, which is nearer, appears as 

double. [46] 

are slightly different, since they correspond to two slightly different 

viewpoints. If the retinal images are overlaid, corresponding points 

will be separated by a horizontal offset. This offset is referred to as 

retinal disparity. Points of the retinal images corresponding to an object 

on which the eyes converge will have zero disparity. Nearer objects 

will have negative disparity, while further objects will have positive 

disparity. Retinal disparity is interpreted by the brain to produce a 

sense of depth, with a process called stereopsis. 

14

Stereopsis works together with monocular depth cues to produce 

depth perception. Monocular cues are elements of a 2D image which can 

provide depth information. Some monocular cues are motion parallax, 

perspective, occlusion, relative size of objects [47]. 

Stereoscopic displays obtain a depth effect by displaying a parallax 

value for each image pixel. Given two views of the same scene from 

slightly different side-by-side viewpoints, parallax is the horizontal off- 

set, measured on the display, between pixels corresponding in the left 

and in the right. It produces a directly proportional disparity on the 

retinae. Pixels having zero parallax (figure 8a) will produce zero dis- 

Figure 8: (a) Zero parallax. (b) Positive parallax. (c) Negative parallax. (d) 

Divergent parallax. [46] 

parity on the retinae, and will be seen as lying on the plane of the 

display. Pixels having positive parallax (figure 8b) will produce posi- 

tive disparity, and will be seen as if they were behind the display. Vice 

versa, pixels having negative parallax (figure 8b) will produce negative 

15

disparity and will be seen as if they were in front of the screen. Finally, 

pixels having divergent parallax, i.e. parallax higher than the distance 

between the viewer’s eyes (figure 8d) do not have a valid correspon- 

dent disparity value. Trying to fuse objects having a divergent parallax 

requires an unusual muscular effort, and often results in discomfort. 

Only horizontal parallax/disparity produces a sense of depth. Verti- 

cal disparity between left and right images is not natural, and has anal- 

ogous effects to divergent disparity (eye strain, discomfort). Therefore, 

it should be avoided in the generation of stereoscopic images. 

2.3.1 Visualization devices 

Numerous technologies have been developed for the visualization of 

parallax on planar displays. Stereo visualization devices are mainly 

divided into: 

• passive glasses; 

• active glasses; 

• autostereoscopic displays. 

Passive stereo technologies are based on the use of glasses very sim- 

ple and without electronics. The cheapest kind of passive stereo is 

anaglyph stereo. This consists in filtering the two images with oppo- 

site colors, and viewing them through special glasses with oppositely 

colored lenses, so that each eye sees only the corresponding image. 

The most common couple of colors is red-cyan. Anaglyph stereo 

does not require a special display, and anaglyph glasses are very cheap, 

but the resulting quality of the image is rather low. Moreover, tradi- 

tional anaglyph cannot display the full visible color range (although a 

patented technique has been developed to provide perceived full color 

viewing with simple colored glasses [49]). 

16

Figure 9: Paper anaglyph glasses [48]. 

A more complex and performing passive stereo technology is based 

on differently polarized light. Two projectors are used to display the 

two images using orthogonally polarized light, and the images are view- 

ed through glasses with orthogonally polarized lenses. Each lens lets 

through light having the same direction of polarization, while filtering 

all light whose polarization is orthogonal. Thus, each eye sees the 

corresponding image, in full color but with half its brightness. Polarized 

glasses are relatively cheap and the resulting image has a good quality. 

For these reasons, polarized stereo is commonly used in cinemas for 3D 

movies. 

Active stereo is based on the use of more complex visualization 

devices. The two most notable examples are shutter glasses and Head- 

Mounted Displays (HMD). Shutter glasses are based on alternatively 

displaying left and right images on the same display, at a very high 

frequency, alternatively occluding the left and right eyes in sync with 

the display. This way, each eye sees only the corresponding image. If 

the alternating frequency is sufficiently high, the brain fuses the images 

as two continuous streams. 

Stereo-enabled HMDs use separate displays for each of the two eyes, 

so that the eyes actually see two different video streams. Active stereo 

device usually provide an image quality superior to passive stereo, al- 

17

Figure 10: (a) CristalEyes shutter glasses. (b) Emagin Z800 HMD. [50] 

though they are much more expensive. 

Some technologies have been developed to build autostereoscopic 

displays, which do not need for the user to wear glasses in order to 

view stereo images [51]. Though, some of these technologies are still 

very expensive, while the others provide a very low image quality. 

18

3 Augmented reality visual Interfaces in 

robot teleoperation 

Example of use of augmented reality in telerobotics are numerous in lit- 

erature, as regards both telemanipulation and mobile robots teleguide. 

This section contains a review of the current state of the art in 

visualization techniques for teleguide interfaces based on augmented 

reality and sensor fusion. Major contributions are resumed and their 

main points are highlighted and discussed. 

19

3.1 A sensor fusion based user interface for vehicle 

teleoperation 

The work of Meier et al. [24] describes a technique of sensor fusion for 

mobile teleoperation which uses different sensors in a complementary 

manner, balancing respective points of strength and weaknesses. 

A brief analysis of sensor fusion for teleoperation is carried out. 

Sensor fusion needs to be human-oriented, and the representation of the 

data has to be accessible and understandable. Fusing data in a single 

display, rather than representing each different sensor in a different 

display, makes perception quicker and reduces cognitive workload. The 

most important kind of information which can be shown by sensor 

fusion to an operator who is driving a mobile robot is depth information. 

This work considers color intensity the most efficient way in which this 

kind of information can be delivered to a human; in particular, HSV 

color model [52] is considered to be the one which best mimics human 

color perception. 

The teleoperation system described in this paper uses a stereo vision 

system, a ring of ultrasonic sonars and odometric sensors; sensor data 

are processed by a Kalman filter. The teleoperation interface contains 

a display showing the video stream from the robot cameras and a bidi- 

mensional map of the environment, gradually created as an occupancy 

grid using Histogramic In-Motion Mapping [53]. 

Depth information is overlaid on the video display as a layer com- 

posed by differently colored pixels. Stereo having a higher angular 

resolution than the sonar, it is used by default to create the overlay. 

Instead, sonar is used in regions of the image where stereo disparity is 

not reliable, i.e. regions with scarce texturing or where the sonar de- 

tects very close objects. A grid is projected on the region of the image 

identified as ground, in order to improve distance estimation (figure 

11a). The bidimensional map (figure 11b) is created by combining 

20

Figure 11: (a) Image display processing. (b) Bidimensional map of the environment. 

[24] 

sonar distance data with stereo disparity; disparity is calculated along 

a horizontal line taken at a chosen height in the stereo images. 

Main points: 

• Using color as an immediate and efficient mean to convey infor- 

mation 

• Using geometric overlays to enhance distance estimation 

• Sensor fusion balances weaknesses of single sensors 

21

3.2 Fusion of laser and visual data for robot motion 

planning and collision avoidance 

The paper of Baltzakis et al. [54] proposes a SLAM (Simultaneous 

Localization and Mapping) algorithm based on fusion of 2D laser range 

data and stereo visual data. The proposed method uses stereo disparity 

to correct laser measurements where they are evidently wrong. 

The algorithm initially creates a 3D model of the environment as a 

series of vertical walls based on the 2D laser scan. This model forcibly 

omits all the objects which do not intersect the plane of the laser scan, 

since they cannot be detected by the laser sensor. 

Then, the pixels of one of the stereo images are ray-traced to the 

3D model, and the 3D coordinates correspondent to each pixel are ob- 

tained. Finally, the algorithm re-projects each pixel onto the second 

image. If the attributes (color, intensity, etc.) of the pixel in the second 

image are similar to the attributes of the correspondent one in the first 

image, then the value measured by the laser for that pixel is assumed to 

be correct. Instead, if the attributes of the pixels are different, the pix- 

els are assumed to belong to an object which is nearer or further than 

the distance measured by the laser. In this case, a distance estimation, 

based on the disparity between the images, is performed. Range esti- 

mates are accumulated on a 2D occupancy grid (in order to decrease 

the inaccuracy deriving by image noise or lack of texture). 

A simple collision avoidance algorithm, using the mapping method 

just exposed, is presented. The algorithm is tested both in artificial and 

in real environments, showing to have good results in aiding navigation. 

Main points: 

• 3D map of the environment based on range sensors and video 

• Using stereo to correct and integrate data from range sensors 

22

3.3 Using augmented reality to interact with an 

autonomous mobile platform 

The work of Giesler et al. [21] presents an AR-based, speech-based 

technique to quickly and intuitively program paths for a mobile robot 

in a wide environment. 

The operator who programs the robot needs a HMD and a tool 

(“magic wand”), both of which have to be tracked around the envi- 

ronment where the paths will be set up. The operator may define and 

view paths in form of nodes, which correspond to points on the ground, 

and edges, straight lines which connect couples of nodes (figure 12). 

Nodes and edges are created by pointing to the ground with the wand 

and issuing verbal commands (e.g. “Connect this node...”, “...with this 

node”) and are visualized by the HMD worn by the operator. 

Figure 12: Robot follows AR path nodes, redirects when obstacle in way. [21] 

The operator may issue commands to the mobile robot in the same 

manner. It is possible to command the robot to move from one node 

of the graph to another, or to move autonomously between two deter- 

minate points of the ground. When the robot has to navigate within 

the graph, it calculates automatically the shortest sequence of edges 

between the start node and the end node. If it detects an obstacle on 

the path it has just chosen, it calculates an alternative path through 

23

the edges of the graph. Once the robot has chosen a path, nodes and 

edges belonging to it are depicted with a different color, so that the 

operator is able to see which path the robot is going to take. 

Main points: 

• AR used as an efficient method to convey data about the envi- 

ronment (e.g. nodes position) 

• AR used as an efficient method to exchange information with the 

robot 

24

3.4 Improved interfaces for human-robot interaction 

in urban search and rescue 

The work of Baker et al. [9] proposes several modifications to the INEEL 

interface [22, 55] for telerobotics in urban search and rescue (USAR). 

The modifications are designed to decrease complexity and increase 

usability of the interface by non-experienced users. This work is based 

on the results of several past works of the same authors [10, 11], which 

analyse several teleguide interfaces used in international competitions 

and outline their points of strength and weaknesses. 

Most of the modifications are oriented to reduce the cognitive work- 

load imposed by the interface. For example, pan and tilt angles of the 

camera are indicated by the position of a light cross overlaid on the 

video display, rather than by separate meters. Proximity/collision in- 

dicators are visualized as colored blocks around the video display, and 

each one of them becomes visible only when an obstacle in the corre- 

spondent direction is sufficiently near. Rarely consulted information 

(e.g. battery charge) is treated as a system alert and visualized only 

when it is necessary. The environment map is placed at the same level 

of the video display, so that it is not tiring for the operator to shift his 

attention from the video display to the map and vice versa (figure 13). 

As a future work is indicated the possibility to fuse heat, sound and 

CO2 sensor data as a color map overlaid on the video display. 

Since it has been shown that most of collisions happen to the rear 

of the robot, a rear camera is included and its video stream is displayed 

above the video display (like a rear view mirror on a car). 

Main points: 

• Integration of sensor data in the same window to reduce cognitive 

workload 

25

Figure 13: Modified INEEL interface. [9] 

• Sensor data representation should be: 

– non invasive; 

– quickly comprehensible (e.g. resemble known/conventional 

symbols). 

26

3.5 Ecological interfaces for improving mobile robot 

teleoperation 

The work of Nielsen et al. [26] describes an interface for mobile robot 

teleoperation based on ecological [56] interface design and augmented 

virtuality. Different versions of the same interface are compared, show- 

ing that integration of sensor data gives better results for navigation 

than displaying results separately. 

The presented interface displays a map of the environment recon- 

structed from range sensors (laser, sonar) together with a video image 

from the remote site. The 2D version of the interface shows video and 

map one beside the other; instead, the 3D version shows a 3D model 

of the robot within a 3D representation of the map. The 3D map is 

created by elevating obstacles to a fixed height. The viewpoint is posi- 

tioned little behind the robot, and video data is visualized in a window 

in front of the robot model (figure 14). 

Figure 14: 3D interface presented in [26]. 

Tests performed on the different versions prove that the 3D version 

of the interface generates always better results than the 2D version. 

Moreover, it is shown that operators which use the 2D version do not 

27

enefit from having both video and map, since the two displays compete 

for their attention. The difference in performance is explained by the 3D 

version complying to three important principles of HRI: 1) presenting a 

common reference frame; 2) providing visual support for the correlation 

between action and response; 3) allowing an adjustable perspective. 

Main points: 

• 3D map of the environment based on range sensors 

• Integration of sensor data in the same window reduces cognitive 

workload 

• More information does not necessarily imply better performance 

28

3.6 Egocentric and exocentric teleoperation interface 

using real-time, 3D video projection 

The paper of Ferland et al. [25] presents an augmented-virtuality-based 

interface for mobile robot teleoperation. As the one described in [26], 

it displays data from range and video sensors. In addition, it makes use 

of different projection methods of the video image in order to increase 

the quality of the information provided. 

Sensors used consist in a laser range sensor and a couple of stereo 

cameras. The laser is used to build a global 2D map of the environ- 

ment. The operator interface displays the map as a 3D environment, 

visualizing the obstacles detected by the laser as fixed-height walls. A 

3D model of the robot is displayed within the 3D environment. Two 

viewpoints are available to the operator: an egocentric viewpoint, co- 

incident with the position of the stereo camera (figure 15a), and an 

exocentric viewpoint, freely positionable in the zone behind and above 

the robot (figure 15b). 

Figure 15: Egocentric (a) and exocentric (b) viewpoints of the interface presented 

in [25]. 

The video image is mapped on the 3D environment using one of 

two projection methods. The laser-based method first projects the 3D 

29

mesh of the environment onto the left video image frame, then simply 

maps the single left image on the resulting vertices. The stereoscopic 

method uses the disparity values from the stereo camera to project the 

stereo image to the 3D space, then maps it to the mesh using a set of 

OpenGL Shading Language [57] fragment shaders. 

Testing results show that both the egocentric and the exocentric 

points of view are considered useful by most of the users. Most of the 

time, viewpoints positioned little behind the robot are used; vertical, 

bird’s eye-like viewpoints are preferred in tight navigation situations or 

to obtain a global view of the map. The laser mapping proves to be the 

most useful source of information for navigation; laser-based projection 

is also considered useful, differently from stereoscopic projection which 

is too sensitive to the quality of disparity data. 

Main points: 

• 3D map of the environment based on range sensors 

• It is necessary to design a reliable method for image projection 

in the virtual workspace 

30

3.7 Summary and analysis 

The works presented above highlight several benefits provided by sensor 

fusion. Sensor fusion techniques help to balance strong and weak points 

of different types of sensors and to retrieve more reliable information 

from the robot and the surrounding environment [24, 54]. Besides, 

fused sensor data can be displayed to the user in a unified form. 

Unified sensor representation has many advantages in respect to 

visualization in separate displays. Presenting data inside a unique dis- 

play, within a common reference frame, avoids competition for the 

user’s attention. Interfaces that separately visualize different sensors 

data force operators to continuously switch between different displays, 

reference frames and visualization modalities. Instead, a unified rep- 

resentation prevents this switching, thus strongly reducing the user’s 

cognitive workload [9, 26]. 

Augmented reality is a form of unified representation which presents 

a further advantage. Namely, visualizing complex data (as positions 

and paths in [21]) as a graphic overlay on an image of the real worlds 

permits a faster and more intuitive interpretation by a human operator. 

Several approaches to AR-based representation of visual and range 

data in telerobotics have been described. Some of them ([9, 24]) use 

bidimensional augmentations to the video image. They use the color of 

these overlays as a quick and effective way to communicate a distance 

measure to the user. Though, since bidimensional overlays display in- 

formation only on a single plane, their capacity to communicate a depth 

value is intrinsically limited. 

Others approaches [25, 26] create a bidimensional map of the envi- 

ronment using laser data, and display a 3D representation of the map 

by elevating virtual 3D walls. This approach has several advantages 

in respect to using 2D overlays. First, a 3D map usually looks more 

realistic, and can communicate depth in a more intuitive way because 

31

of monocular depth cues. Besides, while 2D overlays display raw range 

information and leave to the user the responsibility of deducing the 

shape of the environment, the 3D approach relieves the user from this 

work by presenting range data in a more quickly understandable form, 

namely as a 3D map. 

The drawback of the described approaches is the scarce quality of 

the integration between laser and visual data in the user interface. In 

[26] the video image is visualized on the display, but no correspondence 

between elements in the image and the laser-generated map is estab- 

lished. Therefore, the user must manually associate obstacles in the 

video image with obstacles in the laser map. Instead, in [25] corre- 

spondence between laser and video is automatically calculated through 

projection (see section 3.6). Though, the quality of the projection is 

strongly dependent on laser-camera calibration, on the correctness of 

laser measurements and, in the case of stereoscopic projection, on dis- 

parity data. Since laser-camera calibration always involves a certain 

degree of inaccuracy, laser sensor can miss some objects (e.g. low or 

transparent objects) and disparity data is strongly dependent on envi- 

ronment features, we consider these requirements to be too strict to be 

enforced in a general case. 

32

4 Previous work on 3MORDUC teleoperation 

The 3MORDUC (3rd version of the MObile Robot DIEES University of 

Catania) is a wheeled mobile robot located at DIEES (Dipartimento di 

Ingegneria Elettrica, Elettronica e dei Sistemi), in University of Catania 

[58]. It has been used over several years in order to perform research 

work within the field of telerobotics and teleguide visual interfaces. 

This sections gives a brief description of the robotic platform, and 

exposes past work where it has been involved. Then, past work is 

discussed and main issues are outlined. 

33

4.1 The 3MORDUC platform 

The 3MORDUC uses two Maxon F2260 motors (40W DC) for move- 

ment. The motors are connected to two rubber wheels through a shaft. 

A castor wheel is employed to facilitate curve execution. Two lead 

batteries (12V/18Ah) provide an autonomy of about 30-40 minutes. 

Figure 16: The 3MORDUC platform 

Several sensors on board monitor the workspace and the robot state. 

Here we give a brief description of these sensors. 

Laser scanner A Sick LMS200 laser measurement sensor system (fig- 

ure 17) is set on the front part of the 3MORDUC. The LMS operates 

by emitting a pulsed laser beam towards a definite direction. The re- 

flected pulse is received and registered, and the distance between the 

robot and the obstacle which reflected the pulse is estimated by mea- 

suring the time of flight of laser light. The procedure is repeated for 

several different directions on a plane, to generate a scan of the sur- 

roundings of the sensor. It is possible to configure angular resolution 

34

Figure 17: The Sick LMS200 laser sensor. 

(0.25 ◦ , 0.5 ◦ , 1 ◦ ) and maximum scan angle (100 ◦ /180 ◦ ). Each scan is 

executed in clockwise mode. Measurement data are available in real 

time for further evaluation via RS232/RS422 serial interface. 

Laser sensors are usually very accurate (each distance measure has 

an accuracy of some millimeters) and reliable. Though, they can be 

deceived by transparent or very dark surfaces, which do not adequately 

reflect laser light to the receiver, and generate outliers. Besides, laser 

sensors obviously cannot detect objects which do not intersect their 

scan plane. 

Stereo cameras The STH-MDCS2-VAR-C (figure 18) is a low power 

compact digital stereo camera system, which can be connected to a PC 

via IEEE 1394. Each camera has a resolution of 1.3 megapixel, and it 

is equipped with a fixed focus lens (4.5 mm). The CCD sensors of the 

cameras provide a good noise immunity. Capturing parameters (e.g. 

exposure gain, frame rate, resolution) are adjustable. 

The cameras are mounted on a rigid support, which permits to set 

the cameras baseline to any value in the range 5-20 cm. Their optical 

axes are maintained parallel. The cameras are positioned on the top 

layer of the robot, about 95 cm above the ground. They are pointed 

towards the direction in front of the robot, and they are slightly tilted 

towards the ground. 

35

Figure 18: The STH-MDCS2-VAR-C stereo cameras. 

Encoders An incremental rotary encoder with a resolution of 500 

pulses/turn is placed on each wheel of the robot. Incremental encoders 

convert movement into a sequence of digital pulses. Movement/rotation 

of the robot in respect to a determined start position/orientation can 

be calculated by counting the pulses generated by each encoder. 

Proximity sensors A belt of 8 SRF08 sonar sensors is positioned 

around the robot. Sonars measure the distance from obstacles by cal- 

culating the time of flight of a reflected sonic signal originally produced 

by a vibration of a piezoelectric sensor. Sonars field of view has a conic 

shape, so the sensitive area increases proportionately to the distance 

from the robot. For this reason, sonars have a far lower angular resolu- 

tion than the laser sensor. Furthermore, an inhibition time is necessary 

between the generation of the sonic signal and its reception, and this 

introduces a lower limit to measurable distances. 

A belt of bumpers (16 switches) is mounted around the entire pe- 

rimeter of the robot base, just over the wheels level. These sensors 

recognize and reduce damage in case of a collision. 

36

4.2 Mobile robotic teleguide based on video images 

The work of Livatino et al. [59] performs a systematic evaluation of the 

impact of different stereoscopic visualization modes on performance in 

telerobotics tasks. The paper describes the design of the evaluation 

experiment and presents and analyses its results. 

The experiment involved 12 participants. Each of them executed a 

simple teleguide task, which consisted in teleoperating the 3MORDUC 

platform (located at DIEES, Catania, in Italy) from the University of 

Aalborg (Denmark). The participants were able to visualize the video 

data from 3MORDUC cameras. Each of the participants executed the 

task using two different visualization setups: a 15” laptop and a 2 

× 2 projected wall display. Besides, the task was executed twice for 

each setup, using respectively monoscopic and stereoscopic visualiza- 

tion. Within the laptop setup, stereoscopic visualization used colored 

anaglyph, while within the wall display setup it used polarized projec- 

tion. 

A set of both qualitative and quantitative parameters were evalu- 

ated during the trials. A 2-way ANOVA (ANalysis Of VAriance) was 

performed to measure statistical significance of quantitative results. 

The results show that stereo visualization introduces a significant reduc- 

tion of the collision rate. This is because stereo visualization strongly 

enhances the sense of depth of the visualized scene. Furthermore, re- 

alism and sense of presence of the user in the remote environment are 

higher in respect to monoscopic visualization. 

As regards the comparison between the laptop and wall display 

setups, it has been shown that in the laptop setup users benefit from 

a stronger depth perception and obtain a lower number of collisions. 

Instead, since the wall display causes a wider use of peripheral vision, 

it generates a higher sense of presence and confidence, which implicate 

37

higher mean speeds. 

Main points: 

• Stereoscopic visualization enhances collision avoidance 

• Stereoscopic visualization increases realism and sense of presence 

38

4.3 Depth-enhanced mobile robot teleguide based 

on laser images 

The work of Livatino et al. [60] performs a systematical evaluation 

analogue to the one described in [59]. Though, the evaluated teleguide 

interface displays synthetic images generated by laser scans instead than 

real camera images. In this telerobotic system, laser data is processed 

Figure 19: The process of generating 3D graphical environment views from laser 

range information. The top-left image shows a 2D floor map generated by the laser 

sensor. The bottom-left image shows a 3D extrapolation of a portion of it. The 

right-image shows a portion of the workspace visible to a user during navigation. 

[60] 

on the robot to construct in real-time 2D maps of the robot surrounding 

workspace in real-time. A 3D representation is extrapolated from the 

2D maps by elevating wall lines and obstacle posts. A current front- 

view of the robot workspace is then generated and displayed to the user 

by using graphical software (figure 19). The teleoperation task to be 

executed by the participants and the visualization setups used during 

the experiment were the same used in [59]. 

As regards the stereo-mono and the laptop-wall comparisons for the 

laser-based visualization interface, the results obtained are analogue to 

the ones exposed in [59]. It can be deduced that stereoscopic visualiza- 

39

tion permits a significant decrease in collision rates independently from 

the fact that the interface is visual-based or laser-based. Besides, it is 

shown that participants using the laser-based interface perform better 

in terms of completion time. This is supposed to be due to the real- 

time performance provided by the laser-based interface. In fact, visual 

data requiring a significant bandwidth to be transmitted, the average 

delay between the display of two consecutive video images is about 

one second. As exposed in [61], this strongly decreases teleoperation 

performances. Instead, since laser data requires a much smaller band- 

width than visual data, a teleoperation client can receive and process 

it in real-time, thus increasing the operator’s performance in terms of 

average speed. 

Main points: 

• Stereoscopic visualization benefits are present also in the case of 

laser-generated images 

• Laser-data can be used in real-time for an increase in performance 

40

4.4 Augmented reality stereoscopic visualization 

for intuitive robot teleguide 

The paper of Livatino et al. [13] proposes a methodology for fusion of 

laser and visual data in a teleoperation interface. This methodology 

exploits augmented reality to realize a coherent and intuitive visualiza- 

tion of integrated data, and uses stereoscopy to increase teleoperation 

efficiency. 

The interface presented in this work represents laser data as virtual 

overlays on the video images received by the robot cameras. Three 

different kinds of virtual overlays are used: 

• proximity planes, semi-transparent colored layers superimposed 

on the objects within the scene (figure 20a); 

• rays, colored lines departing approximately from the camera po- 

sition and reaching the closest objects (figure 20b); 

• distance values, indications of the absolute distance between the 

robot and the objects (figure 20b). 

The virtual overlays have a different color depending on the distance 

between the robot and the real objects to which they correspond. Red 

overlays correspond to the nearest objects, yellow overlays to objects 

at medium distances, green overlays to the furthest objects. 

The laser measures are linearly mapped to image pixels between the 

left and the right margin of the image. A semi-automatic calibration 

permits to the user to adjust the first and the last mapped angles. Then, 

edge detection is executed on the image in order to individuate the 

bases of the objects in the image (by taking the first edge pixels from 

the bottom of the image) and to vertically align the virtual overlays 

with the real objects. 

41

Figure 20: (a) Proximity planes overlaid on the image. (b) Rays and distance values 

overlaid on the image. [13] 

A pilot test has been carried out by teleoperating the 3MORDUC 

from the 3D Visualization and Robotics Lab at the University of Hert- 

fordshire, using both monoscopic and stereoscopic visualization. Al- 

though the results have been encouraging as regards the use of aug- 

mented reality and the semi-automatic calibration, the system has 

revealed not to be ready for stereoscopic visualization yet. In fact, 

since edge detection is performed on the left and right images inde- 

pendently, it generates different results (especially if the quality of the 

images is low and artifacts are present). This often causes the drawing 

of non-correspondent virtual overlays, which need to be detected and 

deleted/recomputed before rendering. 

Main points: 

• effective methodology to integrate laser and visual data in a co- 

herent representation 

• use of different AR features to highlight distances from objects 

• necessity of a technique for excluding non-correspondent measures 

42

4.5 Summary and analysis 

In [59] and [60] two different approaches to visualization interfaces for 

mobile robot teleoperation are described. The visual-based approach 

consists in displaying a (mono or stereo) video stream from the remote 

site, while the laser-based approach consists in displaying a synthetic 

view of the robot workspace generated from laser range data. 

Each approach has its points of strength and weaknesses. Visual 

data is highly contrasted, and provides a high amount of information 

and a wide field of view to the operator, but the massive quantity of 

data to be transmitted implicates a delay between the visualization of 

consecutive frames if the available bandwidth is low. Laser-generated 

images are much poorer in detail than video, but they can be generated 

and visualized in real-time. 

The works exposed above show that both visualization methods 

greatly benefit from stereoscopic visualization. Stereo increases the 

sense of presence of the operator in the remote environment and the 

perceived sense of depth, thus increasing driving accuracy. 

Livatino et al. [13] introduce an innovative methodology to join the 

advantages of these two approaches by using augmented reality. Col- 

ored overlays are used to fuse visual and laser data in a unique, coherent 

and intuitive representation. Though, application of stereoscopic visu- 

alization to this methodology has proven not to be straight-forward. In 

fact, a technique has to be developed to conciliate the results of left and 

right image processing in order to obtain a correct stereo rendering. 

43

5 Proposed method: AR stereoscopic visualization 

5.1 Core idea and motivation 

The purpose of this work has been to project and implement a laser- 

and-vision-based visualization approach for mobile robot teleguide. The 

proposed approach is meant to fully exploit the benefits provided by 

augmented reality and stereoscopic visualization described in the pre- 

vious sections in order to assist robot navigation. 

The proposed visualization approach has been developed with the 

following aims: 

1. the approach should communicate appropriately distance infor- 

mation from a laser sensor to the user; laser data should be rep- 

resented in a way as intuitive and easy to interpret as possible; 

2. the approach should be fully applicable to both monoscopic and 

stereoscopic visualization setups; it should be flexible and perform 

well even using a single camera image; at the same time, it should 

be designed in order to take advantage from stereo visualization 

features, where stereo is available; 

3. the approach should avoid every useless increment of the opera- 

tor’s cognitive workload; the interface design should be such that 

the operator does not need to frequently shift attention between 

different elements, and teleguide should be as little tiring as pos- 

sible; 

4. the approach should be robust relatively to sensor inaccuracy and 

errors; the visualization method should perform well even when 

sensor data is noisy or incorrect (e.g. in case of invalid disparity 

or laser outliers). 

44

In order to achieve the above-described aims, techniques described 

in literature can be used and improved. Specifically, the developed 

interface is based on: 

Augmented reality As described in section 3, augmented reality is 

an extremely convenient method for representation of sensor data. 

Since it integrates sensor and visual data inside a single display, 

competition for user attention is avoided. Moreover, if sensor data 

are represented as immersed in the real workspace, correlation 

between sensor data and real objects becomes easy and intuitive. 

In the case of a laser-video unified representation, laser distance 

data can be visualized directly on the correspondent zones of the 

camera image, thus giving a depth dimension to the image. 

3D overlays Differently from bidimensional overlays, 3D objects can 

be rendered in order to look nearer or further from the view- 

point. Depth of 3D graphical objects can be represented through 

stereo visualization or, in cases where a single camera is avail- 

able, through monocular depth cues (e.g. perspective, occlusion). 

Therefore, 3D objects are ideal for communicating depth infor- 

mation. 

Colors colors being a very effective mean to convey information to 

humans, they can be used to make data interpretation faster and 

more intuitive. As in [9, 13, 24], the proposed visualization ap- 

proach associates different colors to different distance values. 

Image processing As described in [54], image processing can be used 

to retrieve distance information. This information can be inte- 

grated with laser measures to increase range data reliability. 

45

5.2 Research development strategy 

The visualization approach introduces in the previous section has been 

implemented within the MOSTAR (MORDUC teleguide through STer- 

eoscopic Augmented Reality) interface for teleoperation of the 3MOR- 

DUC platform. 

The development of the MOSTAR interface has been divided into 

four main steps. This section gives a brief overview of these steps and 

of their main issues. Sections from 6 to 9 describe each step in detail. 

Design of 

laser-camera 

model 

Definition of 

calibration 

procedure 

Design of stereo 

alignment method 

Design of stereo 

correspondence 

algorithm 

Choice of 

3D objects 

Design of 

colored 

3D objects 

Development of 

the AR-based 

interface 

Extension for 

stereoscopic 

visualization 

Testing 

Choice of 

color range 

Design of 

edge detection 

algorithm 

Design of 

nearest objects 

detection 

algorithm 

Figure 21: Diagram of development and implementation steps. 

46

5.2.1 Design of an AR-based sensor fusion visualization 

The first step of the development of the MOSTAR interface has been 

the definition of a method to convert laser data to a set of graphical 

objects to be overlaid onto camera images. 

Colored 3D virtual objects have been chosen to represent laser data 

on the image. Virtual objects are positioned within the virtual 3D 

workspace according to laser measures, and they are rendered as a semi- 

transparent overlay above the camera image. It has been necessary to 

select appropriate 3D objects to represent laser data, and colors which 

were suitable to map distance in an intuitive way (section 6.1). 

As it has been shown in section 3, determining which visualization 

method is the most effective for complex data like those provided by a 

mobile robot is not a straight-forward issue. It is necessary to take into 

account numerous factors, among which the specific context of applica- 

tion and the user particular preferences. For example, a 2D bird’s eye 

view map of the robot workspace can be a very effective visualization 

method for the exploration of an environment, but it would usually not 

be sufficient for obstacle avoidance manoeuvres. Therefore, several rep- 

resentation modes have been designed for the MOSTAR interface, and 

several brief tests have been performed in order to determine points of 

strength and weaknesses of each mode. 

Furthermore, an algorithm to refine the graphical appearance of the 

overlay by detecting potential laser outliers has been developed (section 

6.2). 

5.2.2 Definition of a laser-camera model and a calibration 

procedure 

A calibration procedure is clearly needed in order to correctly align 

virtual objects defined in section 6 with the real objects in the camera 

images. 

47

Several approaches that permit an automatic determination of the 

intrinsic parameters of a camera, and of its extrinsic parameters in 

respect to a world system of coordinates, have been proposed in litera- 

ture [62, 63]. These approaches calculate the parameters by analysing 

a set of chosen calibration images, and guarantee the optimality of the 

parameters in term of accuracy through several statistical techniques. 

Though, they require rather long and complicated calibration proce- 

dures, which have to be repeated every time that the camera and/or 

the laser sensor are moved, if the alignment accuracy is to be main- 

tained. 

For these reasons, a semi-automatic feedback-based calibration has 

been preferred for the MOSTAR interface. This kind of calibration 

consists in varying manually a restricted set of parameters, while see- 

ing in real time the results of these variations. In other words, while 

the calibration parameters are adjusted, the virtual overlay is drawn ac- 

cording to the values selected each time. The user can gradually align 

the AR overlay with the objects in the image, before or during the tel- 

eguide, and he is able to get a good degree of accuracy in some minutes 

of adjustments, without any particular effort. Section 7 describes the 

developed feedback-based calibration procedure. 

The alignment precision is slightly inferior to the one which may be 

obtained with an automatic procedure, but it has shown to meet the 

requirements in most cases, and can be increased by exploiting image 

processing (see section 8). 

5.2.3 Development of a method for integration of image features 

Image processing has been employed within the MOSTAR interface for 

two different purposes: to improve the alignment of the overlay with 

the camera image, and to increase the reliability of the sensor data by 

48

detecting possible erroneous measurements of the laser sensor. 

The image processing technique chosen for the MOSTAR interface 

is edge detection. Analysis of the edges inside the images of the robot 

workspace have been used to detect walls and potential obstacles. A 

technique has been implemented for finding the nearest objects that 

the robot is facing, by individuating the borders of the bases of these 

objects (section 8.2). 

Once the edges of nearest objects are detected, it is necessary to 

integrate them with laser data somehow. Two techniques of integration 

have been developed and implemented: 

• a technique employing edges to improve camera alignment (sec- 

tion 8.3); 

• a technique employing edges to correct wrong laser measurements 

and individuate obstacles which are invisible to laser (section 8.4); 

5.2.4 Extension of the AR interface to stereoscopic visualization 

As exposed in section 4, stereoscopic visualization helps to reduce colli- 

sions during robot teleguide, by enhancing the user’s depth estimation. 

Since stereo can positively influence visualization of both real and syn- 

thetic images (as it was proven in [59, 60]), the MOSTAR interface can 

definitely benefit from it. 

Stereoscopic visualization has been easy to implement in the MOS- 

TAR interface. MORDUC cameras already provide a synchronized 

stereo couple of images, which can be directly displayed to the user. 

On the other hand, 3D virtual objects can be easily rendered from two 

different viewpoints, and each view of the objects can be overlaid on 

the corresponding real image. 

The main issue in the stereo extension of the MOSTAR visualiza- 

tion interface has been guaranteeing a suitable disparity level between 

49

the left and the right image. Real and synthetic images must be dis- 

played so that pixels which correspond in the images have no vertical 

parallax, and their horizontal parallax is correct (i.e. non-divergent) 

and comfortable for the user. Moreover, the couple of real images and 

the couple of synthetic images must be correctly aligned. Section 9.1 

explains how this issues have been managed. 

Furthermore, since camera left and right images are different be- 

tween each other, different edges are usually detected within them. 

Edge detection being intrinsically imperfect, some edges are detected 

in one image but not in the other. Therefore, a method has been imple- 

mented within the MOSTAR interface to deal with non-corresponding 

edges (section 9.2). 

5.2.5 Implementation and testing 

The MOSTAR interface has been implemented as a Visual C++ .NET 

application for Microsoft Windows. 3D rendering has been realized 

by means of OpenGL [64], and the GLUT library [65] has been used 

for windows and input handling. Image processing operations have 

been performed using functions from the OpenCV library [66]. The 

HTTP protocol and the WinHTTP library have been used to exchange 

driving commands and sensor data with the server program resident on 

the MORDUC platform. 

The MOSTAR interface has been subjected to several offline and 

online tests. During offline tests, the MOSTAR interface was used to 

display visual and laser data collected during previous teleguide ses- 

sions. During online tests, the MOSTAR interface was used to actively 

teleoperate the MORDUC platform in real-time. Online teleguide tests 

were performed from the 3D Visualization and Robotics Lab at the 

University of Hertfordshire, United Kingdom. 

During all tests, the MORDUC laser sensor was configured to sam- 

50

ple 181 distance values in the zone in front of the robot, with an angular 

resolution of 1 ◦ . The STH-MDCS2-VAR-C stereo cameras were used 

as visual sensors during most of the tests, using an image resolution 

of 640 × 480 pixels (per single image). During some of the tests, two 

Microsoft Lifecam Show webcams were used, mounted in a stereo con- 

figuration with a slightly different position and vertical inclination from 

the original setup. 

The visualization interface was run on a medium range laptop (In- 

tel Core 2 Duo T7500 processor, 2 GB RAM, ATI Mobility Radeon 

HD2600 graphic card). The timing values exposed in the next sections 

refer to this configuration. 

Several informal tests were conducted during the various implemen- 

tation steps by the developers, in order to validate design choices. A 

pilot test was conducted on the final version of the interface. Partic- 

ipants were 4, all with medium knowledge about augmented reality 

and stereoscopic visualization, and with no experience in robot tele- 

operation. Developers observed the performance of each participant, 

collecting impressions and comments. The results of the tests are re- 

ported in sections 6 to 9, depending on the aspect of the interface they 

are related to. 

51

6 Effective multi-sensor visual representation 

We describe here the set of augmented reality features developed for 

joint visualization of laser and video data. The features have been 

designed to assist the user during navigation and obstacle avoidance. 

Visualization methods for the other MORDUC sensors are in course of 

development, but have not been implemented yet. 

6.1 Visualization of laser data through AR features 

The MORDUC laser sensor provides a precise estimate of the distance 

between the robot and the surrounding obstacles. The MOSTAR in- 

terface uses AR to visualize this estimate in a way that facilitates im- 

mediate comprehension. 

Each single set of laser measures is processed independently by the 

laser visualization algorithm. Given each point p detected by the laser 

sensor on its plane, the 2D coordinates of p in respect to the laser origin 

are calculated: 

xp = dp cos αp 

zp = dp sin αp 

where dp and αp are, respectively the distance value and the laser ro- 

tation correspondent to the measurement of point p. 

Each laser point is assigned a particular color depending on its dis- 

tance value. As in [13] nearer points are assigned a color with a higher 

red component and a lower green component, while further points are 

assigned a color with a higher green component and a lower red compo- 

nent. A minimum and maximum distance, depending on the applica- 

tion, are set. Points with distance equal to or lower than the minimum 

52

distance will be pure red, points with distance equal to or higher than 

the maximum distance will be pure green. Distances between the two 

extremes are linearly mapped to the red-green range (figure 22). As 

Distance 

7000 

6000 

5000 

4000 

3000 

2000 

1000 

0 

0 50 100 150 180 

Angle 

Figure 22: Assigning colors to a set of laser points. Red line and green line represent, 

respectively, minimum and maximum distance limits. 

stated in [24], the human eye is more sensitive to variations in the HSV 

color space than in the RGB space; though, colors in the red-yellow- 

green range have a stronger impact on the user than the other colors, 

since they are conventionally associated to danger-caution-safety [67]. 

Since the perception of distance through this range of color is supposed 

to be more intuitive, and since we consider immediacy of interpretation 

more important than a very high resolution in highlighting distances, 

our choice has fallen on this range rather than on the HSV range. 

Colored laser points are used to create semi-transparent tridimen- 

sional objects, which are aligned with the image (as described in section 

7) and rendered onto it as an overlay. Two kinds of AR objects have 

been implemented: 

• virtual walls are created between each couple of consecutive point- 

53

s, and elevated from the ground level to a fixed height; the color 

of each wall is determined by the laser points whichcolor of each 

wall is determined by the laser points which delimit it (figure 

23c); optionally, vertical lines are drawn in correspondence with 

the laser points (figure 23d); 

• virtual rays are drawn on the ground from the base of the robot 

(coincident with the projection of the laser origin on the ground) 

to the laser points, at regular intervals, and each of them takes 

the color of the correspondent point (figure 23e). 

Virtual walls position themselves over walls and obstacles in the robot 

workspace, highlighting objects depth with their colors. Virtual rays 

point out the bases of the obstacles, and give a hint to the user for 

estimating the distance between the robot and them. Several concentric 

circonferences, each of which is at a fixed distance from the previous 

one (0.5 m), are also drawn at the ground level, and serve as a further 

hint for estimating distances (figure 23 c-e). 

The rendering of tridimensional objects by means of the OpenGL 

libraries is much more effective than the methods used in [24] and [13], 

which are based on the sole drawing of bidimensional aids. In fact, as 

explained in section 5, tridimensional rendering can represent depth in 

a more intuitive way than bidimensional overlays. 

The methods of representing laser data together with video data de- 

scribed in [26] and [25] are similar to the one just described. Though, 

they are based on augmented virtuality rather than on augmented re- 

ality. As described in section 3.7, those approaches present some lim- 

itations, which the approach proposed here does not present. In fact, 

differently from [26], since our approach directly superimposes virtual 

objects above the corresponding real objects, correlation is easily and 

automatically established by the operator. Besides, while the approach 

54

Figure 23: (a) Plain camera image. (b) Laser map of the environment surrounding 

the robot. (c) Virtual walls overlaid onto the camera image. (d) Virtual walls and 

laser lines overlaid onto the camera image. (e) Virtual rays overlaid onto the camera 

image. 

55

of [25] is sensitive to calibration inaccuracies and to validity of dispar- 

ity data, the method described here is independent on disparity and 

relatively robust to calibration inaccuracies. 

6.2 Detection of discontinuities 

Virtual walls are a valid hint for estimating depth of the objects within 

the workspace, but, if rendered as they are, they have a drawback. 

Connecting each pair of consecutive laser points with a wall implies the 

creation of a unique, large surface surrounding the robot. This means 

that separate objects are represented as “melted” with each other (fig- 

ure 24a). In order to minimise this problem, a simple discontinuity 

Figure 24: (a) The walls and the box are covered by the same virtual wall, giving 

the user the wrong clue that they constitute a unique surface. (b) Discontinuity 

detection separates different objects. 

detection algorithm is executed on the laser points before rendering. 

For each wall between a couple of points, a slope coefficient is calcu- 

lated: 

slopep = distp−1 − distp 

distp−1 + distp 

where distp−1 and distp are the distance values corresponding to the 

two points which delimit the wall. It can be observed that the slope 

can assume values in the [−1, 1] interval. If the slope of a wall is 

56

different from the slope of the adjacent walls for a quantity higher 

than a parameterizable threshold (slopeT h), that wall is marked as a 

discontinuity and it is excluded from the rendering (figure 24b). Virtual 

rays are always drawn in correspondence of a discontinuity, in order to 

point out the edges of the objects in the workspace. 

Discontinuity detection is also used to detect potential laser outliers 

(figure 25). Isolated laser outliers have often a distance value strongly 

different from their immediate neighbors; therefore, the couple of walls 

which contain a laser outlier are very likely to be marked as discontinu- 

ities. This fact is exploited by searching for consecutive discontinuities. 

As the discontinuity detection algorithm is performed, laser points be- 

tween two consecutive discontinuities are labeled as potential outliers. 

In section 8 a method will be described to divide almost sure outliers 

from correct measurements. 

6.3 Testing 

Participants to the pilot test confirmed virtual objects as a valuable aid 

for navigation. All the participants found easier to estimate distances 

and to understand the conformation of the workspace when the virtual 

overlay was enabled. 

Virtual walls were judged the most useful kind of virtual objects, 

since they clearly highlighted walls and obstacles within the environ- 

ment. Thus, they made easier for users to detect position and size of 

obstacles. Instead, virtual rays were considered a little confusing, espe- 

cially when they were not strengthened by virtual walls and when the 

alignment with real objects was not precise. One of the participants 

also found that they were not clearly visible, especially in respect to 

virtual walls. 

Virtual circles on the ground were found useful for determining dis- 

tances. Still, participants felt the need for some hint to indicate abso- 

57

Figure 25: (a) Visual artifacts created by some outliers (caused by black square of 

the chessboard, which do not reflect laser beams). (b) The artifacts are partially 

eliminated by the discontinuity detection. 

58

lute distances. 

The discontinuity detection algorithm had excellent results, even 

when objects with a not regular shape (e.g. people) were in the range 

of the laser sensor. A fixed value of 0.05 for slopeT h proved to work 

well in most cases. 

Calculation and rendering of the overlays had excellent timing per- 

formances. Processing of laser data and rendering of the virtual objects 

took always less than 10 ms, which is a perfectly acceptable delay for 

AR applications according to [36]. 

59

7 Laser-camera alignment and calibration 

The 3D virtual objects described in the previous section have coordi- 

nates defined in respect to the laser sensor origin. Laser-camera align- 

ment permits to determine where their coordinates are to be drawn 

within the camera image. 

The alignment is performed by adjusting a set of intrinsic and ex- 

trinsic parameters in order to replicate the ones of the real camera, then 

using these parameters to define a virtual viewpoint on the virtual scene 

overlaid onto the image. This way, the virtual camera will look at the 

virtual workspace with approximately the same position/orientation of 

the real camera in respect to the real workspace, so that the rendered 

scene will coincide with the camera scene. 

7.1 Laser-camera model 

In order to maintain the calibration procedure simple, the alignment 

algorithm is based on the undistorted pinhole camera model, and poses 

several constraints over the orientation of the camera. 

Since the dimensions of the camera image is fixed (it depends on the 

camera output resolution), and since the pinhole camera model does not 

consider distortions, the only variable intrinsic parameter for camera 

calibration is focal length. The focal length of a camera in millimeters 

is often available among the constructor data of the camera, therefore 

it is usually easy to retrieve. Though, for calibration it is necessary to 

convert its value into pixels. It is possible to effectuate the conversion 

from the value in millimeters, but for this it is necessary to know the 

dimensions of the CCD/CMOS camera sensor, which are not usually 

published by constructors. 

As regards the orientation of the camera in respect to the laser 

system of coordinates, three assumptions are made: 

60

• the x axis of the camera system of coordinates is parallel to the 

x axis of the laser system of coordinates; that is, neither panning 

movements nor roll rotations of the camera in respect to the laser 

are permitted; 

• the x axis of the camera system and the x axis of the laser system 

point towards the same direction; 

• tilt movements of the camera (i.e. rotations around the x axis) in 

respect to the laser system of coordinates are confined between 

−90 ◦ and 90 ◦ ; that is, camera optical axis and laser z axis“look” 

approximately towards the same direction. 

This assumptions are satisfied by the MORDUC platform (stereo cam- 

eras are parallel and oriented in the direction of the laser z axis, and 

they have a slight tilt angle towards the ground), and are generally 

reasonable. 

The variable calibration parameters left by the exposed approxima- 

tions are only four: the position coordinates of the camera center in 

respect to the laser origin (x, y and z) and the tilt angle of the camera 

(figure 26). In the case of a panning-enabled camera it is possible to 

add a fifth parameter, that is the pan angle of the camera. 

The resulting laser-camera model presents therefore five (six, if pan- 

ning is possible) parameters to configure. They are sufficient for the 

definition of the OpenGL viewing transforms for the rendering of the 

overlay on the image: camera position and tilt are used to position and 

orientate the point of view, while focal length and image ratio (which 

is known a priori) are used to determine the perspective projection 

matrix. 

The next section describes a simple procedure for the manual cali- 

bration of the parameters. 

61

Figure 26: Graphical representation of the laser-camera system and of calibration 

parameters. 

7.2 Feedback-based calibration procedure 

Given the model and the configuration parameters described in the 

previous section, it is possible to define a sequence of steps to obtain 

a satisfying set of values for those parameters by the feedback-based 

calibration. 

1. Adjust focal length of the virtual camera until the level of zoom 

on the overlay is equal to the level of zoom on the real image. Take 

as a reference the furthest object within the scene, and modify 

the parameter until the horizontal dimension of the corresponding 

virtual wall matches (figure 27b). 

2. Adjust y camera coordinate and tilt angle until the “virtual 

floor” is aligned with the real floor (figure 27c). Adjusting the 

y coordinate moves the virtual camera up and down in the 3D 

space (that is, it moves the overlay down/up in respect to the 

62

eal camera image). Virtual rays and circles can be used as aids 

to obtain a good alignment. 

3. Adjust z camera coordinate until the horizontal dimensions of 

both far and near objects match with the correspondent virtual 

walls (figure 27d). Adjusting the z coordinate moves the vir- 

tual camera forward and backward (that is, it moves the overlay 

backward/forward in respect to the real camera image). While 

modifying focal length controls the dimension of near and far 

parts of the overlay the same way, modifying the z coordinate 

of the virtual camera has a strong effect on near virtual objects 

and almost no influence on far ones. Therefore, is is suggested to 

set focal length first using a very far object as a reference, (as in 

point 1), then adjust the z coordinate using a very near object as 

a reference. 

4. Adjust x camera coordinate to eliminate the horizontal offset 

between the overlay and the real image (figure 27e). Adjusting 

the x coordinate moves the virtual camera to the left and to the 

right (that is, it moves the overlay to the right/left in respect to 

the image). 

This specific order of calibration permits to minimize the interfer- 

ence of adjustments of a calibration parameter with the others, so it 

avoids the necessity for the user to return to his own steps and adjust 

the same parameters again and again. However, if the final result is 

not satisfying, the user is free to modify any parameter at any moment, 

even during the teleguide. 

7.3 Comparison with automatic calibration 

The ease of calibration of the MOSTAR interface is bound to the re- 

duced number of parameters the user has to deal with. Therefore, it is 

63

Figure 27: (a) Overlay at the beginning of the calibration, (b) after the adjustment 

of the focal length, (c) after the adjustment of the y coordinate and the tilt angle, 

(d) after the adjustment of the z coordinate, (e) after the adjustment of the x 

coordinate. 

64

directly dependent on the approximations and constraints on the laser- 

camera model described in section 7.1. The feedback-based calibration 

procedure presented here could be also applied to more general cases, 

but at the expense of its simplicity. Automatic calibration may be 

preferable in the most general cases, or when high accuracy is needed. 

On the other hand, the point of strength of the feedback-based calibra- 

tion is the possibility to achieve an accuracy amply sufficient in most 

cases, without requiring a significant time or effort from the user. 

The feedback calibration procedure exposed in this section has been 

implemented in the MOSTAR interface, but the sensor representation 

logic described in section 6 is independent of the specific calibration 

procedure. In fact, since – as described at the beginning of this section 

– the alignment between the virtual overlay and the image is performed 

simply by setting OpenGL viewpoint parameters (specifically frustum 

shape and size, position and orientation of the viewpoint), there is no 

constraint on the method used to calculate these parameters. OpenGL 

camera model does not model lens distortion; though, several meth- 

ods exist to simulate distortion by texture mapping [68] or by OpenGL 

Shading Language [57, 69]. Consequently, the AR features of the MOS- 

TAR interface are applicable also to the case of a general laser-camera 

model, and can be used together with any calibration method. 

7.4 Testing 

The feedback-based calibration confirmed the expectations, showing 

to have good performances with both camera setups (STH-MDCS2- 

VAR-C stereo cameras and Microsoft Lifecams). With some minutes 

of calibration it was possible to obtain a sufficient degree of alignment. 

Subtle misalignments could not be completely eliminated, but usually 

they were not bothering for users. 

Participants who did know in advance the meaning of the calibration 

65

parameters found the calibration procedure easy and effective. Though, 

participants who did not have any advance knowledge about the camera 

model and the meaning of the calibration parameters found it slightly 

counterintuitive. In fact, it was not easy to deduce the nature of a 

parameter only from watching the effect of its adjustments. Therefore, 

it has been taken into consideration the possibility to add some (maybe 

graphical) hints to the interface in order to make clearer to users the 

nature and the effect of each calibration parameter. 

Finally, participants discovered that virtual rays and lines on vir- 

tual walls may assist calibration. In fact, their ends mark the precise 

position of each laser point. Participants found easier to position the 

overlay over the camera image when knowing precisely the point hit by 

each laser beam. 

66

8 Integrating 3D Graphics with image 

processing 

This section exposes the technique used within MOSTAR interface to 

integrate information from the edges in the camera image with laser 

data. Briefly, edges of the objects occupying the robot field of view are 

located inside the image; then, edge pixels are unprojected to the 3D 

world coordinate system and their position is calculated. The distance 

between the robot and each of these points is calculated and compared 

with the corresponding laser measures, so that the correctness of each 

laser measure can be double-checked. 

The technique used by the MOSTAR interface to test laser mea- 

surements is analogue to the one described in [54], which uses disparity 

information retrieved from a stereo couple to individuate points of the 

images whose depth does not correspond to the value detected by the 

laser. Though, the technique described here is based on edge detection 

rather than on stereoscopic disparity calculation. The consequence is 

that the image processing technique used in MOSTAR interface is ap- 

plicable even where only one camera is available. In addition, edge 

detection algorithms are usually faster in terms of performance than 

stereo disparity calculation algorithms. 

8.1 Edge detection algorithm 

The process used for edge detection in camera images is divided into 

two steps. 

First, the image is converted to grayscale and preprocessed by a con- 

trast stretching function. Two different gray value thresholds (lowT h < 

highT h) are applied to the image: the intensity of pixels whose original 

value is lower than lowT h is set to the minimum value (black), while 

the intensity of pixels whose original value is higher than highT h is set 

67

to the maximum value (white). Intensity values of all the other pix- 

els are linearly mapped to the range between minimum and maximum 

gray value (figure 28). This preprocessing step has two benefits. First, 

Mapped intensity 

MAX 

MIN 

MIN lowTh highTh MAX 

Original intensity 

Figure 28: Contrast stretching function used before actual edge detection. 

it improves the quality of the image for edge detection, by suppressing 

gradients in very bright or very dark areas and increasing contrast in 

the rest of the image. Secondly, if the floor of the working environment 

is much brighter (or darker) than the rest of the image, it can be used 

to neatly separate the intensity range of the floor from the intensity 

range of the workspace object, simplifying the individuation of edges 

on the ground, which correspond to the bases of obstacles. 

The second step is processing the contrast-stretched grayscale image 

with the Canny edge-detection algorithm [70]. The Canny algorithm 

is a simple and very popular algorithm for edge-detection, based on an 

intensity hysteresis process. First, the Sobel operator [71] is applied 

to the image along the horizontal and vertical directions. The Sobel 

operator has two purposes: averaging intensity values of the pixels along 

one direction - so reducing image noise by blurring - and calculating 

the gradient of the pixels along the perpendicular direction. This way, 

68

a gradient value and an edge direction are retrieved for each pixel. 

Then, a non-maximum suppression is executed, by checking whether 

each pixel has the maximum gradient among its neighbors taken along 

its edge direction. Non-maximum pixels are excluded from the edge 

detection. Finally, gradient values of remaining pixels are compared 

with a couple of thresholds (th1 < th2): 

• pixels with a gradient value higher than th2 are immediately 

marked as edge pixels; 

• pixels with a gradient value lower than th2 but higher than th1 

are marked as edge pixels only if they are encountered along an 

edge which contains pixels whose gradient value is higher than 

th2; otherwise, they are excluded from the edge detection; 

• pixels with a gradient value lower than th2 are always excluded 

from the edge detection. 

The advantage of the hysteresis process in respect to a single-threshold 

approach is that it permits to individuate only reliable edges (pixels 

with a high gradient, and pixels with a low gradient which are likely 

to belong to a real edge because they are connected to a strong pixel). 

This avoids a typical problem of using a single threshold, that is the 

creation of discontinuous edges; this happens where some pixels along 

an edge have a gradient value slightly higher than the threshold, while 

some others have a gradient slightly lower than it. 

Optimal threshold values for contrast stretching and Canny algo- 

rithm are strongly dependent on the features of the captured images 

(illumination, objects and floor textures, etc.). No tried and tested ap- 

proach to their determination exists yet. In the MOSTAR interface, it 

is possible to choose a value for each threshold by another feedback- 

based procedure. The edge detection calibration feature permits to 

69

variate the single parameters while visualizing the resulting contrast- 

stretched grayscale image and edge image (figure 29). 

Figure 29: (a) Original image. (b) Contrast-stretched grayscale and edge image, 

visualized during edge detection parameters calibration. 

The edge detection algorithm has been implemented using the Op- 

enCV library functions. 

8.2 Nearest edges discovery 

After an edge image is extracted by the method described in the previ- 

ous section, the NED (Nearest Edges Discovery) algorithm is executed 

on it. The aim of the NED algorithm is to detect the nearest objects 

within the area viewed by the robot camera through the analysis of 

edges present in the camera image. 

The NED algorithm begins processing each of the laser measures. 

For each laser point, the correspondent virtual ray (the line laying on 

the ground between the laser origin and the point itself) is projected 

to the camera image. Each virtual ray corresponds to a bidimensional 

line on the image plane, though usually only a part (or none) of it will 

lay inside the actual border of the image (figure 30). 

For each ray, the corresponding segment on the image is located (if it 

exists) with the help of the gluProject function of the OpenGL Utility 

70

Figure 30: Projection of a virtual ray to the camera image. 

library (GLU). The input arguments which the gluProject function 

needs are a point of the 3D space and the OpenGL camera transforma- 

tion parameters. The function calculates the coordinate transformation 

of the point, and returns the pixel coordinates and its depth value. It is 

subsequently possible to invert the projection and to go up to the orig- 

inal 3D coordinates of the point, since the pixel depth value eliminates 

the ambiguity. 

Each of the pixels composing the just retrieved segment will cor- 

respond to a 3D point on the virtual ray between the robot and a 

determinate laser point. Points further than the laser point along the 

same direction are also includes by prolonging the segment in the im- 

age. Then, the segment is scanned along this direction until a pixel 

corresponding to an edge is found. 

Once the nearest edge pixel along the virtual ray projection is found, 

its image coordinates are used to retrieve the 3D coordinates of the 

corresponding workspace point, by means of the gluUnProject func- 

tion (figure 31). The retrieved workspace point will be a point on the 

ground, usually corresponding to the base of an obstacle, and it will 

have a certain distance from the robot. 

71

Figure 31: Unprojection of the edge pixel to the 3D space, through pixel coordinates 

and depth value. 

At the end of the NED algorithm, a set of nearest edge points 

(NEPs) will have been calculated, each of which will have a correspond- 

ing laser point. The distance between each NEP and its correspond- 

ing laser point is calculated, and it is compared to a parameterizable 

threshold edgeT h: 

• if the distance between the NEP and the laser point is lesser 

than edgeT h, it is assumed that the NEP and the laser point 

correspond to the same real object; 

• if the distance between the NEP and the laser point is greater 

than edgeT h, it is assumed that the NEP and the laser point 

correspond to different objects. 

NEPs which fall into the first category are used for overlay align- 

ment. NEPs which fall into the second category are further divided into 

those which are nearer to the robot than the corresponding laser point 

and those which are further from the robot, and are used to detect 

obstacles missed by the laser and possible laser outliers. 

72

8.3 Improving alignment with edges 

In an ideal case, assuming perfect calibration and a completely disto- 

rtion-free camera, virtual objects borders would be perfectly aligned 

with the corresponding real objects edges. Though, a slightly imprecise 

calibration and/or small differences between the ideal camera model 

and the real camera may often cause little misalignments. 

NEPs which are located near to the corresponding laser points can 

be used to correct this misalignments. For each of these NEP, the 

coordinate of the corresponding laser point are corrected in order to be 

coincident with the NEP coordinates. This way, when the rendering of 

the overlay is performed, the virtual object based on that laser point 

will be precisely aligned with the edge of the real object underneath. 

(figure 32). 

8.4 Improving reliability with edges 

If an edge is detected in a point much nearer to the robot than the laser 

point, this can mean that: 

• an object is present in the workspace which has not been detected 

by the laser, and its base contains the NEP, or 

• a false edge has been detected in a point between the robot and 

the laser point. 

There is no trivial way to determine whether the NEP is a true edge 

or not, and whether it indicates an object which could be an obsta- 

cle for the robot or not (it could be, for example, a drawing or some 

pattern on the floor). Although the safest decision for the teleopera- 

tion would be to assume that an obstacle is present on the NEP, it is 

not actually convenient to consider each NEP an obstacle, since false 

edges are rather common even in rather structured environments and 

73

Figure 32: (a) Alignment using feedback-based calibration only: the margin of the 

overlay is slightly detached from the base of the box. (b) Alignment using feedbackbased 

calibration and NEPs: position of the laser points is corrected with NEPs in 

order to coincide with the base of the box in the image. 

74

after a careful tuning of edge detection parameters. Indeed, informa- 

tion retrieved from edge detection is far less reliable than information 

obtained by the laser sensor. Therefore, NEPs interpreted as potential 

obstacles are still indicated by an overlay, but in a different way from 

laser data. Specifically, a single colored point is rendered above each 

NEP which could indicate an obstacle (figure 33). 

Figure 33: NEPs nearer to the robot than the relative laser points are highlighted 

with colored dots. 

If an edge is detected in a point much further from the robot than 

the laser point, this can mean that: 

• an object is present in the workspace which has been detected by 

the laser but it is high above the ground (therefore, since its base 

edge is higher than it is expected to be, it is interpreted as further 

than it actually is), or 

• the edge detection algorithm has missed the real edge of the ob- 

ject, or 

• the laser measure is wrong and lower than it should be. 

75

Since the last case is rather unlikely, in this case the best (and safest) 

decision is to trust the laser measure, therefore the NEP is simply 

ignored. 

In both cases, the presence of a NEP which disagrees with the cor- 

respondent laser point casts doubt on the validity of the correspondent 

laser measurement. Therefore, if the laser point had previously been 

marked as a potential outlier (see section 6.2) it will be considered as 

a probable outlier and excluded from the rendering (figure 34). 

8.5 Testing 

The integration of image processing in the AR system had good results 

as regards alignment improvement and laser data correction. 

The edge detection algorithm exposed in section 8.1 correctly indi- 

viduates objects bases in cases where the floor is plain-textured. Pa- 

rameter tuning is necessary in cases where the floor presents a faint 

pattern, in order to suppress the range of gray values of the floor and 

highlight objects borders. The algorithm is not supposed to have good 

results in cases where the floor presents strongly-contrasted patterns 

(e.g. black-and-white tiles). In fact, in such cases contrast stretching 

would not be able to suppress floor edges, which would interfere with 

objects base detection. For those cases, a more sophisticated form of 

contrast stretching and intensity suppression function should be used. 

Edge data integration for overlay alignment had good overall per- 

formances. Though, it proved to be fairly sensitive to the quality of 

the edge detection. In cases where false edges where detected near to 

the border with which the 3D overlay should have been aligned, the 

overlay tended to follow false edges, producing unpleasant artifacts. 

On the other hand, NEPs non-corresponding to laser data proved 

to be a very effective visual aid. In fact, objects invisible to the laser 

use to generate many highlighted NEPs of the same color along a line 

76

Figure 34: (a) Outliers are detected as discontinuities, but they are still rendered. 

(b) Outliers are confirmed and excluded from the rendering. 

77

(see the box on the left of figure 33). On the contrary, false edges 

use to generate isolated NEPs (see NEPs in front of the box on the 

right of figure 33). The attention of the user is usually drawn by clus- 

ters of similar dots rather than by isolated dots; therefore, while real 

objects are strongly enhanced by NEPs, false edges remain relatively 

inconspicuous. This helps operators to focus their attention on real 

obstacles without distracting them with striking false edges. 

Several values have been tested for the edgeT h parameter. The 

higher the values of edgeT h, the more NEPs are considered as consistent 

with corresponding laser measures. Therefore, when a high value is 

used, more NEPs will be used for overlay alignment, thus the 3D overlay 

will closely follow image edges. Instead, when a low value is used, more 

NEPs will be used for laser correction, thus the overlay will follow laser 

values and more NEPs will be highlighted as inconsistent with the laser 

data. Values around 10 cm for edgeT h are generally well-performing. 

Timing performance was good. Contrast stretching and edge detec- 

tion on a single frame had an average duration of 45 ms. Since camera 

image and virtual overlay are displayed together at the same time, this 

does not cause dynamic registration errors, but it limits the maximum 

framerate to 25 fps. This value is sufficient for the teleguide of the 

MORDUC, which does not require very quick manoeuvres. Besides, 

during the tests the framerate was already limited to about 2 fps by 

network delay, therefore the delay introduced by image processing was 

neglectable. 

78

9 Stereoscopic augmented reality 

The techniques presented in the previous sections have good perfor- 

mances if they are used on a single video image, but their efficacy can 

be increased by using them in synchronization with stereoscopic cap- 

turing and visualization. Though, using a stereo couple of images raises 

several issues, derived from potential inconsistencies between left and 

right image (see, for example, [13]). This section describes these issues 

and the methods used in the MOSTAR interface to solve them. Fur- 

thermore, it shows how stereo information can be used to improve the 

results achieved by the algorithms exposed above. 

9.1 Stereo AR alignment 

The AR alignment problem in the case of a stereo couple of cameras 

is substantially analogue to the single-camera case. The difference is 

that in the stereo case the aims of the alignment procedure are three 

instead of one: 

• align the left camera image with the right camera image, so that 

correspondent pixels disparity is correct and comfortable; 

• align the AR overlay on the left image with the one on the right 

image, so that they are visualized with the correct disparity and 

they are seen by the human operator as a unique 3D overlay; 

• align the stereoscopic couple of overlays with the stereoscopic cou- 

ple of images, so that virtual objects are correctly positioned over 

real objects (which is the same aim to achieve in monoscopic AR). 

In ideal conditions (left and right cameras are perfectly identical, 

parallel and at the same height) the first aim would be automatically 

satisfied. Unfortunately, things are different in the real case: cameras 

79

elonging to the same model may have slightly different intrinsic pa- 

rameters, and they could be not perfectly positioned. Therefore, the 

MOSTAR interface gives the possibility to introduce a certain horizon- 

tal/vertical offset between the images, in order to correct inaccuracies 

in the cameras positions or differences in the cameras principal points. 

Depending on the values of the offset, the images are shifted in opposite 

directions one respect to the other, and the parts of them which lack a 

correspondent in the other image are cropped (figure 35). 

The second aim is automatically reached by exploiting the features 

of the OpenGL library. In fact, it is sufficient to render left and right 

overlays as two equal viewpoints on the same virtual scene, parallel 

between each other and one horizontally shifted respect to the other, 

to obtain a stereoscopic couple of virtual viewpoints. 

The third aim is obtained through the same feedback-based cali- 

bration procedure used in the monoscopic case. The user can adjust 

the parameters of the virtual cameras while the stereoscopic overlay is 

rendered on the images, and correct their values depending on what he 

sees. Although each virtual camera should have a separate set of intrin- 

sic and extrinsic parameters, most of them (specifically, focal length, y 

and z coordinates and rotation around the x axis) are kept equal for 

both, in order to preserve the correctness of the stereoscopic virtual 

couple. Only the x coordinates of the cameras are independent. As a 

general rule, the offset between the two final x values should be equal 

to the baseline between the real stereo cameras. 

9.2 NEP correspondence and suppression 

The NED algorithm described in section 8 is designed to be used on 

a single image. Executing it on both images independently is likely to 

give results conflicting with each other: in fact, weak edges can easily 

be recognized in one image and missed in the other, while one image 

80

Figure 35: (a) Original images; the left image is some pixel higher in respect to 

the right one. (b) Left image is shifted downward (and its higher part is cropped), 

while the right image is shifted upward (and its lower part is cropped), so that they 

are vertically aligned. 

81

could contain artifacts which the other could not not (especially if the 

quality of the captured images is low). Therefore, it has been necessary 

to implement a method to conciliate NEPs retrieved from the left and 

the right image. 

A NESC (Nearest Edges Stereo Correspondence) algorithm is run 

by the MOSTAR interface after the NED algorithm has been performed 

on both images, and before the NEPs are rendered. The NESC algo- 

rithm is based on a simple assumption: the real edges of the objects 

within the robot workspace are likely to be rather strong features and 

appear in both images, while false edges, like the ones produced by 

image artifacts, are likely to appear in only one image. Therefore, the 

algorithm searches for NEPs which appear in both images, and are 

likely to represent the same point in the space. 

The algorithm starts iterating over the laser points which have a 

correspondent NEP in one or both images. For each laser point, there 

are three possible alternatives: 

1. if the laser point has a correspondent NEP in only one of the 

images, that NEP is counted as an unreliable NEP; 

2. if the laser point has a correspondent NEP in both images, and 

the distance between left and right NEP is lesser than a param- 

eterizable threshold (stereoT h), the NEPs are considered as cor- 

responding to the same point in the space: therefore, a reliable 

NEP is counted, and its coordinates are set as the middle point 

between the left and the right NEP; 

3. if the laser point has a correspondent NEP in both images, but 

the distance between left and right NEP is greater than stereoT h, 

the NEPs are considered as corresponding to two different points 

in the space, and are both counted as unreliable NEPs. 

82

At the end of the iteration, laser points which fall into the first and 

the second categories exposed above will have one correspondent NEP 

(reliable or unreliable), while the ones belonging to the third will have 

two NEPs. In their case, only the NEP which corresponds to the laser 

measure (that is, the distance between the NEP is lesser than edgeT h), 

if any, is considered. In the unlikely case a laser point has two different 

unreliable NEPs, and both correspond to the laser measure (it can hap- 

pen if the chosen threshold values are such that stereoT h < 2edgeT h), 

only the NEP closer to the robot is considered – as a safety measure. 

After reliability of NEPs has been assessed through the NESC al- 

gorithm, reliable NEPs are used to refine alignment and double-check 

laser measures as described in sections 8.3 and 8.4. Instead, NEPs 

which have proved to be unreliable are used only if they agree with the 

correspondent laser measure, but they are disregarded if they contra- 

dict the laser sensor. Therefore, while reliable NEPs are used for both 

overlay alignment and laser correction, unreliable NEPs are used only 

for alignment. The idea is that if a NEP is unreliable (i.e. it appears 

only in one of the image), but it is strengthened by some other fac- 

tor (in our case, the laser measure), it is likely to correspond to a real 

edge, so it can be used for alignment; instead, if it disagrees with other 

measures, it is likely to be a false edge, so it should be neglected. 

9.3 Testing 

Evaluation of stereoscopic visualization went as expected. Participants 

found the stereoscopic modality of MOSTAR interface more realistic 

than the monoscopic modality, and felt an increased sense of awareness 

of the remote environment. No quantitative results where collected, 

though a systematic evaluation has been planned for the future. 

The stereo alignment technique has proven to be helpful to correct 

small misalignments between camera images. They have been espe- 

83

cially useful to eliminate vertical disparity between images (for the 

STH-MDCS2-VAR-C cameras, vertical disparity was due to a slight 

difference in the position of the principal point; for the Microsoft Life- 

cams, it was due to inaccurate positioning). 

Also the rendering of virtual objects took advantage of stereo. Par- 

ticipants observed that virtual walls appeared much more realistic when 

observed with stereoscopic visualization. On the other hand, virtual 

rays and lines on walls were found confusing and tiresome to look at. 

This is probably due to the fact that rays and lines were not clearly 

visible. However, as stated in section 7, it has been observed that 

virtual rays and lines were useful during laser-camera calibration, so 

participants usually preferred to visualize them during calibration. 

The application of the NESC algorithm was successful. As it can 

be seen in figure 36, after the application of NESC algorithm several 

stray NEPs are eliminated, while reliable NEPs (the ones coincident 

with borders of real objects) are left relatively untouched. 

NESC algorithm results were influenced by the value of the stereoT h 

parameter. Given a laser point having a corresponding NEP in the left 

image and another in the right image, the value of stereoT h determines 

how far (in the 3D space) the two NEPs have to be in order to be 

considered reliable (i.e. coincident). Higher values of stereoT h force 

the NESC algorithm to “trust” edge detection and to output more 

reliable edges, which means that more NEPs will ultimately be used 

during the rendering of the overlay. Therefore, high values of stereoT h 

should be used when edge detection has reliable results. During tests, 

a value of edgeT h/2 was used for stereoT h, with excellent results. 

Since in stereoscopic mode the image processing algorithms oper- 

ated on two different images, the delay introduced was double (about 

90 ms). However, as stated in section 8, it was neglectable in respect 

to the one introduced by the communication over the network. 

84

Figure 36: (a) Highlighted NEPs in left image. (b) Highlighted NEPs in left image 

after applying NESC algorithm. 

85

10 Conclusions 

This work has presented a new approach to visualization of video and 

sensor data in a teleguide interface. The approach is based on aug- 

mented reality and further enhanced by stereoscopic visualization. The 

approach has been implemented within the MOSTAR interface, and 

tested by teleoperating the MORDUC mobile robot from a distance of 

over 2500 km. 

The proposed approach displays visual and range data from a laser 

scanner in a unified, AR-based representation. The aim is to assist 

mobile robot navigation by providing depth information to the oper- 

ator, in an intuitive and effective way. Virtual colored tridimensional 

objects, built using laser data, are overlaid on the video image to high- 

light obstacles. Virtual objects are registered with real objects thanks 

to a simple and effective semi-automatic calibration procedure. Edge 

detection is used to individuate nearest edge points (NEPs), which in 

turn are used to refine the AR registration and to point out obstacles 

which the laser is not aware of. The approach can be used in both 

monoscopic and stereoscopic display solutions. If stereo cameras are 

available, stereo information is used to verify reliability of edge detec- 

tion. 

The proposed approach has been implemented and a pilot test has 

been performed to assess its validity. The test has had excellent results. 

Virtual objects have proven to be a valuable aid for distance estima- 

tion and for acquiring awareness of the remote environment. Semi- 

automatic calibration was sufficient to obtain a good alignment degree 

in the vast majority of cases. Edge detection highlighted obstacles in- 

visible to the laser, generating few neglectable false positives; though, 

the alignment correction feature proved to be too sensitive to noisy 

edges. The approach performed well both in monoscopic and stereos- 

86

copic modes. Tests showed that it is possible to significantly reduce the 

number of highlighted false edges by using stereo information. 

Planned further developments include the refinement of features 

which showed to perform poorly. Specifically, we mean to investigate 

computer vision methods to reliably detect object bases even in pres- 

ence of strong patterns on the floor. Besides, a method is being designed 

in order to make feedback-based calibration more intuitive. Finally, a 

systematical evaluation of the approach as in [59, 60] has been planned, 

in order to quantify the performance increment introduced by the ap- 

proach. 

87

References 

[1] B. Davies. A review of robotics in surgery. Proceedings of the In- 

stitution of Mechanical Engineers, Part H: Journal of Engineering 

in Medicine, 214(1):129–140, 2000. 

[2] A.R. Lanfranco, A.E. Castellanos, J.P. Desai, and W.C. Meyers. 

Robotic surgery: a current perspective. Annals of Surgery, 239(1): 

14, 2004. 

[3] P. Arena, P. Di Giamberardino, L. Fortuna, F. La Gala, S. Monaco, 

G. Muscato, A. Rizzo, and R. Ronchini. Toward a mobile au- 

tonomous robotic system for Mars exploration. Planetary and 

Space Science, 52(1-3):23–30, 2004. 

[4] G. Astuti, G. Giudice, D. Longo, C.D. Melita, G. Muscato, and 

A. Orlando. An Overview of the “Volcan Project”: An UAS for 

Exploration of Volcanic Environments. Journal of Intelligent and 

Robotic Systems, 54(1):471–494, 2009. 

[5] RR Murphy. Human-robot interaction in rescue robotics. IEEE 

Transactions on Systems, Man, and Cybernetics, Part C: Applica- 

tions and Reviews, 34(2):138–153, 2004. 

[6] G. Muscato, D. Caltabiano, S. Guccione, D. Longo, M. Coltelli, 

A. Cristaldi, E. Pecora, V. Sacco, P. Sim, GS Virk, et al. ROBO- 

VOLC: a robot for volcano exploration result of first test campaign. 

Industrial Robot: An International Journal, 30(3):231–242, 2003. 

[7] Z. Zhang, S. Ma, Z. Lu, and B. Cao. Communication Mechanism 

Study of a Multi-Robot Planetary Exploration System. In IEEE 

International Conference on Robotics and Biomimetics (ROBIO), 

pages 49–54, 2006. 

88

[8] P. Milgram, S. Yin, and J.J. Grodski. An augmented reality based 

teleoperation interface for unstructured environments. In Proc. 

American Nuclear Society 7th Topical Meeting on Robotics and 

Remote Systems, 1997. 

[9] M. Baker, R. Casey, B. Keyes, and H.A. Yanco. Improved inter- 

faces for human-robot interaction in urban search and rescue. In 

Proceedings of the IEEE Conference on Systems, Man and Cyber- 

netics, volume 3, pages 2960–2965, 2004. 

[10] J. Scholtz, J. Young, J. Drury, and H. Yanco. Evaluation of human- 

robot interaction awareness in search and rescue. In IEEE Inter- 

national Conference on Robotics and Automation, volume 3, pages 

2327–2332, 2004. 

[11] H.A. Yanco and J. Drury. ‘Where am I?’Acquiring situation aware- 

ness using a remote robot platform. In IEEE Conference on Sys- 

tems, Man and Cybernetics, pages 2835–2840, 2004. 

[12] M.W. Kadous, R.K.M. Sheh, and C. Sammut. Effective user in- 

terface design for rescue robotics. In Proceedings of the 1st ACM 

SIGCHI/SIGART conference on Human-robot interaction, page 

257. ACM, 2006. 

[13] S. Livatino, G. Muscato, D. De Tommaso, and M. Macaluso. Aug- 

mented reality stereoscopic visualization for intuitive robot tele- 

guide. In IEEE International Symposium on Industrial Electronics 

(ISIE), 2010. 

[14] R.T. Azuma et al. A survey of augmented reality. Presence- 

Teleoperators and Virtual Environments, 6(4):355–385, 1997. 

[15] R. Azuma, Y. Baillot, R. Behringer, S. Feiner, S. Julier, and 

89

B. MacIntyre. Recent advances in augmented reality. IEEE Com- 

puter Graphics and Applications, pages 34–47, 2001. 

[16] WS Kim, PS Schenker, AK Bejczy, and S. Hayati. Advanced 

graphics interfaces for telerobotic servicing and inspection. In 

Proc. IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems, 

Yokohama, pages 303–309, 1993. 

[17] P. Milgram, S. Zhai, D. Drascic, and J. Grodski. Applications of 

augmented reality for human-robot communication. In Proceedings 

of the IEEE/RSJ International Conference on Intelligent Robots 

and Systems (IROS)., volume 3, 1993. 

[18] S. Otmane, M. Mallem, A. Kheddar, and F. Chavand. Active vir- 

tual guides as an apparatus for augmented reality based telemanip- 

ulation system on the Internet. In Annual Simulation Symposium, 

volume 33, pages 185–191, 2000. 

[19] JWS Chong, SK Ong, AYC Nee, and K. Youcef-Youmi. Robot 

programming using augmented reality: An interactive method for 

planning collision-free paths. Robotics and Computer Integrated 

Manufacturing, 25(3):689–701, 2009. 

[20] T.H.J. Collett and B.A. MacDonald. Developer oriented vis- 

ualisation of a robot program. In Proceedings of the 1st 

ACM SIGCHI/SIGART conference on Human-robot interaction, 

page 56. ACM, 2006. 

[21] B. Giesler, T. Salb, P. Steinhaus, and R. Dillmann. Using aug- 

mented reality to interact with an autonomous mobile platform. 

In Proceedings of IEEE International Conference on Robotics and 

Automation (ICRA), volume 1, 2004. 

90

[22] D.J. Bruemmer, D.D. Dudenhoeffer, and J. Marble. Dynamic au- 

tonomy for urban search and rescue. In Proceedings of the AAAI 

Mobile Robot Workshop, 2002. 

[23] V. Brujic-Okretic, J.Y. Guillemaut, LJ Hitchin, M. Michielen, and 

GA Parker. Remote vehicle manoeuvring using augmented reality. 

In International Conference on Visual Information Engineering 

(VIE), pages 186–189, 2003. 

[24] R. Meier, T. Fong, C. Thorpe, and C. Baur. A sensor fusion 

based user interface for vehicle teleoperation. In Proceedings of 

the IEEE International Conference on Field and Service Robotics 

(FSR), 1999. 

[25] F. Ferland, F. Pomerleau, C.T. Le Dinh, and F. Michaud. Ego- 

centric and exocentric teleoperation interface using real-time, 3D 

video projection. In Proceedings of the 4th ACM/IEEE interna- 

tional conference on Human robot interaction, pages 37–44. ACM, 

2009. 

[26] C.W. Nielsen, M.A. Goodrich, and R.W. Ricks. Ecological inter- 

faces for improving mobile robot teleoperation. IEEE Transactions 

on Robotics, 23(5):927, 2007. 

[27] C. Demiralp, CD Jackson, DB Karelitz, S. Zhang, and DH Laid- 

law. Cave and fishtank virtual-reality displays: A qualitative and 

quantitative comparison. IEEE Transactions on Visualization and 

Computer Graphics, 12(3):323–330, 2006. 

[28] D. Drascic. Skill acquisition and task performance in teleoperation 

using monoscopic and stereoscopic video remote viewing. In Hu- 

man Factors and Ergonomics Society Annual Meeting Proceedings, 

volume 35, pages 1367–1371, 1991. 

91

[29] M. Ferre, R. Aracil, and MA Sanchez-Uran. Stereoscopic human 

interfaces. IEEE Robotics & Automation Magazine, 15(4):50–57, 

2008. 

[30] G.S. Hubona, G.W. Shirah, and D.G. Fout. The effects of motion 

and stereopsis on three-dimensional visualization. International 

journal of human-computer studies, 47(5):609–627, 1997. 

[31] G. Jones, D. Lee, N. Holliman, and D. Ezra. Controlling perceived 

depth in stereoscopic images. In Stereoscopic Displays and Virtual 

Reality Systems VIII, Proceedings of SPIE, volume 4297, pages 

42–53, 2001. 

[32] I. Sexton and P. Surman. Stereoscopic and autostereoscopic display 

systems. IEEE Signal Processing Magazine, 16(3):85–99, 1999. 

[33] Wikipedia. Augmented reality. 

http://en.wikipedia.org/wiki/Augmented_reality, 2010. 

[34] M. Billinghurst, I. Poupyrev, H. Kato, and R. May. Mixing realities 

in shared space: An augmented reality interface for collaborative 

computing. In ICME 2000, pages 1641–1644, 2000. 

[35] P. Milgram and F. Kishino. A taxonomy of mixed reality visual 

displays. IEICE Transactions on Information and Systems E series 

D, 77:1321–1321, 1994. 

[36] R. Azuma. Tracking requirements for augmented reality. Commu- 

nications of the ACM, 36(7):51, 1993. 

[37] AJ Davison, ID Reid, ND Molton, and O. Stasse. MonoSLAM: 

Real-time single camera SLAM. IEEE Transactions on Pattern 

Analysis and Machine Intelligence, 29(6):1052–1067, 2007. 

92

[38] W.A. Hoff, K. Nguyen, and T. Lyon. Computer vision-based regis- 

tration techniques for augmented reality. Proceedings of Intelligent 

Robots and Computer Vision XV (SPIE), 2904:538–548, 1996. 

[39] H. Kato and M. Billinghurst. Marker tracking and hmd calibra- 

tion for a video-based augmented reality conferencing system. In 

Proceedings of the 2nd IEEE and ACM International Workshop 

on Augmented Reality, volume 99, pages 85–94, 1999. 

[40] Wikipedia. Pinhole camera. 

http://en.wikipedia.org/wiki/Pinhole_camera, 2010. 

[41] J. Heikkila and O. Silven. A four-step camera calibration proce- 

dure with implicit imagecorrection. In Proceedings of the IEEE 

Computer Society Conference on Computer Vision and Pattern 

Recognition, pages 1106–1112, 1997. 

[42] C.C. Slama, C. Theurer, and S.W. Henriksen. Manual of pho- 

togrammetry. American Society of Photogrammetry Falls Church, 

Virginia, 1980. 

[43] T. Melen. Geometrical modelling and calibration of video cam- 

eras for underwater navigation. PhD thesis, Institutt for teknisk 

kybernetikk, Universitetet i Trondheim, 1994. 

[44] W. Faig. Calibration of close-range photogrammetry systems: 

Mathematical formulation. Photogrammetric engineering and re- 

mote sensing, 41(12):1479–1486, 1975. 

[45] J. Weng, P. Cohen, and M. Herniou. Camera calibration with 

distortion models and accuracy evaluation. IEEE Transactions on 

Pattern Analysis and Machine Intelligence, 14(10):965–980, 1992. 

93

[46] L. Lipton. Stereographics, Developers Handbook. StereoGraphics 

Corporation, 1991. 

[47] L. Lipton. Foundations of the stereoscopic cinema: a study in 

depth. Van Nostrand Reinhold, 1982. 

[48] Wikipedia. Anaglyph image. 

http://en.wikipedia.org/wiki/Anaglyph_image, 2010. 

[49] S.E.B. Sorensen, P.S. Hansen, and N.L. Sorensen. Method for 

recording and viewing stereoscopic images in color using multi- 

chrome filters, February 3 2004. US Patent 6,687,003. 

[50] Wikipedia. Stereoscopy. 

http://en.wikipedia.org/wiki/Stereoscopy, 2010. 

[51] M. Halle. Autostereoscopic displays and computer graphics. In 

ACM SIGGRAPH Courses, page 104. ACM, 2005. 

[52] Wikipedia. HSL and HSV color spaces. 

http://en.wikipedia.org/wiki/HSL_and_HSV, 2010. 

[53] J. Borenstein and Y. Koren. Histogramic in-motion mapping for 

mobile robot obstacle avoidance. IEEE Journal of Robotics and 

Automation, 7(4):535–539, 1991. 

[54] H. Baltzakis, A. Argyros, and P. Trahanias. Fusion of laser and 

visual data for robot motion planning and collision avoidance. Ma- 

chine Vision and Applications, 15(2):92–100, 2003. 

[55] D.J. Bruemmer, R.L. Boring, D.A. Few, J. Marble, and M.C. Wal- 

ton. “I call shotgun!”: An evaluation of mixed-initiative control 

for novice users of a search and rescue robot. In Proceedings of the 

IEEE Conference on Systems, Man & Cybernetics, 2004. 

94

[56] J. J. Gibson. The ecological approach to visual perception. 

Houghton Mifflin Boston, 1979. 

[57] R.J. Rost. OpenGL R⃝ Shading Language. Addison Wesley Long- 

man Publishing Co., Inc. Redwood City, CA, USA, 2004. 

[58] DIEES University of Catania. 3MORDUC. 

http://www.robotic.diees.unict.it/robots/morduc/ 

morduc.htm, 2010. 

[59] S. Livatino, G. Muscato, S. Sessa, C. Koffel, C. Arena, A. Pennisi, 

D. Di Mauro, and E. Malkondu. Mobile robotic teleguide based 

on video images. IEEE Robotics & Automation Magazine, 15(4): 

58–67, 2008. 

[60] S. Livatino, G. Muscato, S. Sessa, and V. Neri. Depth-enhanced 

mobile robot teleguide based on laser images. Mechatronics, In 

Press, 2010. 

[61] J. Corde Lane, R. Carignan, B.R. Sullivan, D.L. Akin, T. Hunt, 

and R. Cohen. Effects of time delay on telerobotic control of neu- 

tral buoyancy vehicles. In Proceedings of IEEE International Con- 

ference on Robotics and Automation, volume 3, pages 2874–2879, 

2002. 

[62] Y. Bok, Y. Hwang, and I.S. Kweon. Accurate motion estimation 

and high-precision 3d reconstruction by sensor fusion. In IEEE In- 

ternational Conference on Robotics and Automation, pages 4721– 

4726, 2007. 

[63] Q. Zhang and R. Pless. Extrinsic calibration of a camera and 

laser range finder (improves camera calibration). In Proceedings 

of IEEE/RSJ International Conference on Intelligent Robots and 

Systems(IROS), volume 3, 2004. 

95

[64] OpenGL website. 

http://www.opengl.org, 2010. 

[65] GLUT website. 

http://www.opengl.org/resources/libraries/glut/, 2010. 

[66] OpenCV website. 

http://sourceforge.net/projects/opencvlibrary/, 2010. 

[67] R. Williams and B. Andrews. The non-designer’s design book. 

Peachpit Press Berkeley, 1994. 

[68] Paul Bourke. Nonlinear Lens Distortion. 

http://local.wasp.uwa.edu.au/~pbourke/miscellaneous/ 

lenscorrection/#opengl, August 2000. 

[69] Graphics Size Coding. Tiny distortion shader. 

http://sizecoding.blogspot.com/2007/10/tiny- 

distortion-shader.html, October 2007. 

[70] J. Canny. A computational approach to edge detection. Readings 

in computer vision: issues, problems, principles, and paradigms, 

page 184, 1987. 

[71] I. Sobel and G. Feldman. A 3x3 isotropic gradient operator for im- 

age processing. Presentation for Stanford Artificial Project, 1968. 

96

Scarica (PDF – 6.19 MB)

Create successful ePaper yourself

Delete template?

Save as template?