Laser-based detection and tracking of multiple people in crowds

Computer Vision **and** Image Underst**and****in**g 106 (2007) 300–312

www.elsevier.com/locate/cviu

**Laser**-**based** **detection** **and** **track ing**

J**in**shi Cui a, *, Hongb**in** Zha a , Huij**in**g Zhao b , Ryosuke Shibasaki b

a National Laboratory on Mach**in**e Perception, Pek**in**g University, Beij**in**g, Ch**in**a

b Centre for Spatial Information Science, University **of** Tokyo, Tokyo, Japan

Received 5 December 2005; accepted 24 July 2006

Available onl**in**e 16 December 2006

Communicated by James Davis **and** Riad Hammoud

Abstract

**Laser**-**based** **people** **track ing** systems have been developed for mobile robotic,

on laser po**in**t cluster**in**g method to extract object locations. However, for dense crowd **track ing**, laser po

**in**terlaced **and** undist**in**guishable due to measurement noise **and** they can not provide reliable features. It causes current systems quite

fragile **and** unreliable. This paper presents a novel **and** robust laser-**based** dense crowd **track ing** method. Firstly, we

feature extraction method **based** on accumulated distribution **of** successive laser frames. With this method, the noise that generates split

**and** merged measurements is smoothed away **and** the pattern **of** rhythmic sw**in**g legs is utilized to extract each leg **of** persons. And then, a

region coherency property is **in**troduced to construct an efficient measurement likelihood model. The f**in**al tracker is **based** on the comb**in**ation

**of** **in**dependent Kalman filter **and** Rao-Blackwellized Monte Carlo data association filter (RBMC-DAF). In real experiments,

we obta**in** raw data from **multiple** registered laser scanners, which measure two legs for each **people** on the height **of** 16 cm from horizontal

ground. Evaluation with real data shows that the proposed method is robust **and** effective. It achieves a significant improvement

compared with exist**in**g laser-**based** trackers. In addition, the proposed method is much faster than previous works, **and** can overcome

**track ing** errors resulted from mixed data

Ó 2006 Elsevier Inc. All rights reserved.

Keywords: People **detection**; People **track ing**;

1. Introduction

The **detection** **and** **track ing**

problem that arises **in** a variety **of** different contexts. Examples

**in**clude **in**telligent surveillance for security purposes,

scene analysis for service robot, **crowds**’ behavior analysis

for human behavior study, traffic flow analysis, **and** many

others.

Over the last several years, an **in**creas**in**g number **of**

laser-**based** **people** **track ing** systems have been developed

**in** both mobile robotics platforms [1–6] **and** fixed platforms

q This work was supported **in** part by the NKBRPC (No.

2006CB303100), NSFC Grant (No. 60333010) **and** NSFC Grant (No.

60605001).

* Correspond**in**g author.

E-mail address: cjs@cis.pku.edu.cn (J. Cui).

[7–9] us**in**g one or **multiple** laser scanners. It has been

proved that process**in**g on laser scanner data makes the

tracker much faster **and** more robust than vision **based**

one **in** complex situations with varied weather or light

condition.

However, all these systems are **based** on a basic assumption

that laser po**in**ts that belong to the same person can

easily be clustered or grouped as one feature po**in**t. And

then data association is used for **multiple** **people** **track ing**.

While **in** real experiments with unknown number **of** **people**,

especially when deal**in**g with crowded environment, such

systems will greatly suffer from poor features provided by

a laser scene. **Laser** po**in**ts **of** different objects are **of**ten

**in**terlaced **and** undist**in**guishable **and** cannot provide reliable

features.

The same problem aroused **in** our previous work [8].

The experimental result showed that **in** some cases, simple

1077-3142/$ - see front matter Ó 2006 Elsevier Inc. All rights reserved.

doi:10.1016/j.cviu.2006.07.015

J. Cui et al. / Computer Vision **and** Image Underst**and****in**g 106 (2007) 300–312 301

cluster**in**g method will fail **in** detect**in**g person due to clutters

from other objects that move with **people**, such as

nearby **people** or luggage. And when two **people** walk**in**g

across or their legs are too close together, it likely causes

**track ing** error or trajectory broken. To ease the underst

**of** laser scan data, we show a registered laser scan

image (see Fig. 1) with four laser scanners scann**in**g on the

height **of** 16 cm from horizontal ground. In Fig. 1, nearly

30 **people** are separately distributed **in** an open area. Red

circles denote laser scanners. White po**in**ts are foreground

laser po**in**ts, ma**in**ly humans’ legs. Green po**in**ts are background

laser po**in**ts, **in**clud**in**g walls, chairs, **and** etc.

Fig. 2 is one sample **of** fusion **of** image frame **and** laser scan

frame. This image is just used to illustrate what the laser

po**in**ts mean. Each cluster **of** laser po**in**ts that belong to

human leg has been manually circled.

In this paper, we propose a robust tracker to detect **and**

track **multiple** **people** **in** a crowded **and** open area. We first

obta**in** raw data that measures two legs for each **people** on

the height **of** 16 cm from horizontal ground with **multiple**

registered laser scanners. Then, a k**in**d **of** stable feature is

extracted us**in**g accumulated distribution **of** successive laser

frames. In this way, the noise that generates split **and**

Fig. 1. A laser scan image. Red circles denote laser scanners. White po**in**ts

are foreground laser po**in**ts, ma**in**ly humans’ legs. Green po**in**ts are

background laser po**in**ts.

Fig. 2. One sample **of** fusion **of** image frame **and** laser scan frame. Each

cluster **of** laser po**in**ts that belong to human leg has been manually circled.

merged measurements is smoothed away very well. Region

coherency property is utilized to construct an efficient measurement

likelihood model. And then, a tracker **based** on

the comb**in**ation **of** **in**dependent Kalman filter **and** Rao-

Blackwellized Monte Carlo data association filter

(RBMC-DAF) is **in**troduced. Evaluation with real data

shows that the proposed method is robust **and** effective

**and** deals with most well-known difficulties encountered

by conventional laser-**based** trackers very well, e.g. measurement

split/merge **and** temporal occlusion.

The rema**in**der **of** this paper is organized as follow**in**g.

After discussion **of** related work **in** the follow**in**g section,

system architecture **and** data collection will be briefly **in**troduced

**in** Section 3. And then, feature extraction approach

**and** evaluation on the approach are described, respectively,

**in** Sections 4 **and** 5. In Section 6, we will present the **track ing**

framework. At last, **track ing** results are presented

Section 7.

2. Related work

Research on laser-**based** **people** **track ing** was orig

from Prassler’s work [1]. In recent years, laser scanner

becomes much cheaper **and** the scan rate also becomes

higher than before (from 3 fps used **in** [1–4] to 30 fps used

**in** [8]). In the context **of** robotic technology, a laser-**based**

**people** tracker [2–6] has been a fundamental part **of** a

mobile robotic system. These trackers ma**in**ly focus on

how to make correct **detection** **of** mov**in**g **people** **and** dist**in**ction

**of** **people** from static objects with a mobile platform,

then **track ing** one or few mov

surround**in**g the mobile robot with the successive laser scan

images. On the other h**and**, **in** the context **of** **in**telligent

monitor**in**g **and** surveillance, **multiple** laser scanners [7,8]

are deployed to cover a wide area. In this case, the task

is to effectively extract **in**dividual person from cluttered

**and** scattered laser po**in**ts, **and** then simultaneously track

a large number **of** **people** robustly **and** reliably. Thus, we

can see that, for both k**in**d **of** laser trackers, no matter with

static or mobile platform, there are two fundamental

aspects: **people** extraction, i.e. **people** **detection**, **and** data

association.

Cluster**in**g or group**in**g **in** each scan image is the most

commonly used, **and** almost the only **people** extraction

strategy for exist**in**g laser-**based** trackers [1–8].

In [1], a grid-map representation is used for **detection** **of**

mov**in**g cells. One group **of** nearby cells is considered as

one person. Then, a trajectory **of** mov**in**g targets is

obta**in**ed by a nearest neighbor criterion between groups

**of** cells marked mov**in**g **in** consecutive scan samples.

In [2], at each time step, the laser scan image is segmented

**and** further split **in**to po**in**t sets represent**in**g objects. At

first the scan is segmented **in**to densely sampled parts. In

the second step these parts are split **in**to subsequences

describ**in**g ‘‘almost convex’’ objects. They used the assumption

that there are distance gaps between dist**in**ct objects.

A threshold value is used to f**in**d dist**in**ct objects. For

302 J. Cui et al. / Computer Vision **and** Image Underst**and****in**g 106 (2007) 300–312

**track ing**, they represent the motion

successive scan images as flows **in** bipartite graphs. And

then by network optimization techniques **in** graph theory,

they get plausible assignments **of** objects from successive

scans.

In [3], they f**in**d violation po**in**ts (that corresponds to

mov**in**g objects) from each range scan at first. Then view

all detected violations as a Gaussian impulse liv**in**g on

the two-dimensional world map. Create new hypotheses

at violations po**in**ts that have a function value over a certa**in**

threshold. In fact, this is a cont**in**uous version **of** po**in**t

cluster**in**g process **in**stead **of** discrete one. Propagate previous

hypotheses **of** mov**in**g objects by a gradient ascent

method on the function. In essential, this is a local search**in**g

method, similar to nearest neighbor search**in**g.

In [4], a mobile platform is equipped with two laserrange

scanners mounted at a height **of** 40 cm. Each local

m**in**imum **in** the range pr**of**ile **of** the laser range scan is considered

as a feature that represents an object. Mov**in**g

objects such as person is dist**in**guished from static objects

by comput**in**g local occupancy grid maps. In [5,6], a simple

cluster**in**g method is used for object extraction similar with

the first step described **in** [3]. For **track ing**,

novel data association **and** **track ing** algorithms are proposed

that **in**corporates particle filters **and** JPDA filter.

This sampl**in**g-**based** approach has the advantage that

there are no restrictions for the analytic form **of** model,

although the required number **of** particles for a given accuracy

can be very high. And due to the weakness **of** the measurement

likelihood model, the **track ing** performance is

greatly dependent on the performance **of** filter.

Above works ma**in**ly focus on **track ing** one or few mov

persons surround**in**g the mobile robot, **and** up to now,

only a few works [7,8] aim at **track ing** a large number

**people** with fixed laser scanners. In [7], the authors used

**multiple** laser scanners on the height **of** waist. Subtraction

from background model, foreground was obta**in**ed. They

def**in**e a blob as a group**in**g **of** adjacent foreground read**in**gs

that appear to be on a cont**in**uous surface. And

assume that measurements that are spatially separated by

less than 10 cm belong to the same blob. Scann**in**g on the

height **of** waist will suffer greatly from the occlusions by

nearby **people** **and** the unpredictable range reflections from

sw**in**g**in**g arms, h**and** bag, coats, etc., which are difficult to

be modeled for an accurate **track ing**. For

associate a Kalman filter with each object to alleviate the

consequences **of** occlusions **and** to reduce the impact **of**

occlusions **and** model **in**accuracies.

Compared with other works, the system described **in** [8]

gives the most promis**in**g result for **track ing** a large number

**of** **people** simultaneously. The laser scanners are on the

ground with the height **of** 20 cm. In this height, one person

generates two po**in**t clusters, each for one foot. Simple cluster**in**g

is also used to extract the mov**in**g feet. Then a given

distance range is used to group two nearby feet as one step.

The follow**in**g conditions are used for data association.

Firstly, two step c**and**idates **in** successive frames overlap

at the position **of** at least one foot c**and**idate. Secondly,

the motion vector decided by the other pair **of** non-overlapp**in**g

foot c**and**idates changes smoothly along the frame

sequence. The experimental result shows that **in** some

cases, the simple cluster**in**g will fail **in** detect**in**g person

due to clutters from other objects that move with **people**,

such as nearby **people** or luggage. And when two **people**

walk**in**g across or their feet are too close together, it likely

causes **track ing** error or trajectory broken.

3. System architecture **and** data collection

Multiple s**in**gle-row laser-range scanners are exploited **in**

our experiments. For each laser scan, one laser scanner

pr**of**iles 360 range distances equally **in** 180° on the scann**in**g

plane with the frequency **of** 30 fps. The range data can be

easily converted **in**to rectangular coord**in**ates (laser po**in**ts)

**in** the sensor’s local coord**in**ate system [8]. Scanners are set

do**in**g horizontal scann**in**g at ground level, so that crosssections

at the same horizontal level **of** about 16 cm conta**in****in**g

the data **of** mov**in**g (e.g., human legs) as well as still

objects (e.g., build**in**g walls, desks, chairs, **and** so on) are

obta**in**ed **in** a rectangular coord**in**ate system **of** real dimension.

Those laser po**in**ts **of** mov**in**g are obta**in**ed from background

image subtraction. And then mov**in**g po**in**ts from

**multiple** laser scanners are temporally **and** spatially **in**tegrated

**in**to a global coord**in**ate system.

For registration, laser scans keep a degree **of** overlay

between each others. Relative transformations between

the local coord**in**ate system **of** neighbor**in**g laser scanners

are calculated by pair-wisely match**in**g their background

images us**in**g the measurements to common objects. In

the case that common features **in** overlapp**in**g area are

too few for automated registration, an **in**itial value is first

assigned through manual operation, followed by an

automated f**in**e-tun**in**g. Assign**in**g an **in**itial value to laser

scanners’ relative pose is not a tough task here, as twodimensional

laser scans are assumed to co**in**cide **in** the same

horizontal plane, operators can shift **and** rotate one laser

scan on the other one to f**in**d the best Match**in**g between

them. Specify**in**g one local coord**in**ate system as the global

one, transformations from each local coord**in**ate system to

the global one are calculated by sequentially align**in**g the

relative transformations, followed by a least-square **based**

adjustment to solve the error accumulation problem. A

detailed address on register**in**g **multiple** laser scanners can

be found **in** [15].

One **of** the major differences **of** our system with other

research efforts is that we put laser scanners on ground

level (about 16 cm above the ground surface), scan pedestrian’s

feet **and** track the pattern **of** rhythmic sw**in**g feet.

There are two reasons for us target**in**g on pedestrian’s feet.

The sw**in**g feet **of** a normal pedestrian, no matter a child or

an adult, no matter a tall man or a short man, can be

scanned on ground level with least occlusion. In addition,

the data **of** sw**in**g feet can be modeled as the same pattern

**and** tracked simply **and** uniformly.

J. Cui et al. / Computer Vision **and** Image Underst**and****in**g 106 (2007) 300–312 303

Fig. 3. (a) A picture **of** the demonstration site **in** (b) with two laser scanners located **in** the middle. (b) A map **of** sensor’s location **and** measurement

coverage **in** the experiment.

Four laser scanners are located on the floor. **Laser** scans

cover an area **of** about 30 · 30 m 2 around our demonstration

corner as showed **in** Fig. 3(a). On the other h**and**, a

video camera is set on the top **of** a booth, about 3.0 m high

from the floor, monitor**in**g visitors at a slant angle, **and**

cover**in**g a floor area **of** about 5 · 6m 2 . A map **of** sensor’s

locations **and** measurement coverage is shown **in** Fig. 3(b).

In addition, the illustration **of** the system architecture is

shown **in** Fig. 4. The laser scanners used **in** the experiment

is LMS200 by SICK. Each sensor is controlled by an IBM

Th**in**kPad X30 or X31. They are connected through 10/100

Base LAN to a server PC.

Here, camera has two functions. The first one is to be

used for demonstration augmented with laser po**in**ts **and**

estimated trajectories. The second is to be used as data

resource that is fused with laser po**in**ts to improve

the **track ing** performance as presented

work [9].

4. Feature extraction **and** **people** **detection**

4.1. S**in**gle-frame cluster**in**g

People **detection** from a laser scan frame suffers greatly

from its poor feature provided by the laser po**in**ts. Exist**in**g

cluster**in**g-**based** **people** extraction methods use the

assumption that there are distance gaps between dist**in**ct

objects. Then a threshold value is used for cluster**in**g to f**in**d

dist**in**ct objects. However, **in** real scenes with crowded **people**,

the cluster**in**g-**based** **detection** maps do not reflect real

positions **of** legs. Occlusions make the po**in**ts that belong to

one leg split to **multiple** po**in**t clusters or there is even no

po**in**ts belong to the leg. In addition, mutual **in**teractions

make the po**in**ts belong to two different legs merge **in**to

one cluster**in**g.

In Figs. 5 **and** 6, raw data **of** one s**in**gle laser scan frame

**and** the cluster**in**g result from these data are shown, respectively.

In the raw image, laser po**in**ts **of** each person are manually

circled. There are totally four persons **in** the scene. And

**in** the result, only one person is correctly detected, i.e. both

**of** two legs are extracted correctly with no noise. Only one

leg is extracted for two persons, **and** there is a noisy **detection**

for the rest person. And thus, it is rather difficult to

track **multiple** **people** **in** **crowds** with such a cluster**in**g result.

4.2. Accumulated distribution **and** leg **detection**

Fig. 4. An illustration **of** system architecture. Only two laser scanners are

figured out. In this paper, we use four laser scanners.

In this section, we propose a novel **people** extraction

method so-called accumulated distribution. Pr**of**it**in**g from

high data sampl**in**g rate, it is reasonable to get successive

range scan images with only subtle changes. Time-accumulation

means to accumulate the count **of** laser po**in**ts at the

304 J. Cui et al. / Computer Vision **and** Image Underst**and****in**g 106 (2007) 300–312

Fig. 5. Raw data **of** one s**in**gle-frame (**in** part).

Fig. 6. S**in**gle-frame cluster**in**g result.

same pixel **of** successive **multiple** frames as the **in**tensity **of**

that pixel (Fig. 7 shows a time-accumulation image). If an

object stops at a position for a while, the laser po**in**ts

belong**in**g to it will accumulate at the nearby position (consider**in**g

the measurement noise). And thus **in** the f**in**al time

accumulation image, the **in**tensity **of** pixel correspond**in**g to

the object position will be much higher than other pixels

**and** appears to be a maximum **in** a neighborhood.

Accord**in**g to the usual walk**in**g model **of** humans, when

a normal walk**in**g person steps forward, one **of** the typical

appearances is, at any moment, one foot sw**in**gs by pivot**in**g

on the other one, as shown **in** Fig. 10. Two feet **in**terchange

their duty by l**and****in**g **and** mov**in**g shifts at a rhythmic

pattern. It was reported that [11] muscles act only to establish

an **in**itial position **and** velocity **of** the feet at the beg**in**n**in**g

half **of** the sw**in**g phase then rema**in** **in**active

throughout the other half **of** the sw**in**g phase. Actually,

from Fig. 7 we can see, the brighter po**in**ts **in** the accumulated

image are directly related with the **in**active feet **of** persons.

If we can accurately locate these po**in**ts, they can

provide us very direct **and** stable cues to **in**fer the trajectories

**of** walk**in**g **people**.

The distribution **of** laser po**in**ts **in** the image is discrete

**and** it is hard to directly locate the po**in**ts with maximal

**in**tensity. Parzen w**in**dow density estimation [10] is a wellknown

non-parametric method to estimate distribution

from sample data **and** we utilize it to convert discrete sample

po**in**ts to a cont**in**uous density function. The general

form **of** the density is then:

^pðxÞ ¼ 1 n

X n

i¼1

/ðx x i ; hÞ ð1Þ

**in** which, {x 1 , ...,x n } is a set **of** d-dimensional samples

(d = 2 **in** this case), /(Æ) is the w**in**dow function **and** h is

the w**in**dow width parameter. Parzen showed that ^pðxÞ

converges to the true density if /(Æ) **and** h are selected properly

[10].

The most popular w**in**dow function is the Gaussian distribution,

**and** we also choose it because **of** its good

features:

1

z

/ðz; hÞ ¼

exp

ð2pÞ d=2 h d 1=2

jRj

T R 1 z

2h 2

where R is a covariance **and** we use the identity matrix simply,

consider**in**g its property **of** isotropy **in** two dimensions.

The selection **of** h here depends on the size **of** foot region.

One result **of** the accumulated image after Parzen w**in**dow

is shown **in** Fig. 8.

For the search **of** local maxima **in** a kernel-**based** density,

mean-shift appears to be a common method **and** should

be able to give promis**in**g result. Consider**in**g the real-time

requirement **of** **track ing**

data, we choose a simple local search strategy, which is

proven very effective **and** fast by our experiments.

In summary, we process the laser data with accumulation,

Parzen w**in**dow filter, **and** local maximum search at

ð2Þ

Fig. 7. Accumulated image at frame 364, the data are represented as a set

**of** discrete po**in**ts.

Fig. 8. After Parzen w**in**dow on Fig. 7, with cont**in**uous **in**tensity

distribution.

every time step. A number **of** measurements will be

obta**in**ed, each **of** which represents one foot that ma**in**ta**in**s

static for a while **in** a small region so that its **in**tensity is one

local maximum **in** the current accumulated image. S**in**ce

one local maxima might appear **in** several successive

frames, only newly appeared ones are considered as measurements

at current time (Fig. 9).

J. Cui et al. / Computer Vision **and** Image Underst**and****in**g 106 (2007) 300–312 305

5. Evaluations on **detection** algorithm

We evaluated our **detection** algorithm through a

sequence **of** 1000 frames (about 3 s). The **detection** performance

is evaluated by compar**in**g the true count **and** the

estimated count **of** persons **in** the scene. And two different

situations are considered separately. One is consider**in**g the

whole laser coverage **of** the sensors (situation I). **Laser** coverage

here means the area that is measured by at least one

laser scanner, consider**in**g only the occlusions from the

physical layout **of** the environment, **and** the area is about

30 m · 30 m large. The other situation (situation II) is consider**in**g

the central area **of** laser coverage that is measured

by at least two laser scanners **and** not too far away from

the locations **of** the laser scanners, consider**in**g occlusions

from mov**in**g **people** **and** the physical layout, **and** the area

is about 20 m · 18 m large (see Fig. 3(b)).

As a result **of** difficulty **and** hard workload **in** obta**in****in**g

ground-truth count **of** **multiple** **people**, we evaluated our

**detection** algorithm with sampl**in**g results at an **in**terval

**of** 50 frames. In Fig. 11, the count **of** correctly detected persons

is compared with the true count at every 50 frames.

In Table 1, the **detection** ratios are listed out. The **detection**

ratio **in** situation I is much lower than situation II. The

reason is that **in** situation I, persons with few po**in**ts are

also **in**cluded, as shown **in** Fig. 12. In some remote corners,

persons are almost occluded by the environment, **and** thus

only two or three po**in**ts **of** one leg are visible. In situation

II, only central area is considered, **and** these areas have

been taken out. Consequently, the **detection** ratios greatly

**in**crease.

The errors come from two reasons: occlusion **and** noise.

And most **of** the **detection** failures from occlusions can be

recovered with time accumulation computation with**in** the

follow**in**g 1–10 frames. Noisy measurements ma**in**ly come

Count **of** persons

50

40

30

20

10

Fig. 10. Walk**in**g model.

0

0 200 400 600 800 1000

**Laser** scan frame

Table 1

Detection ratios

Highest ratio (%) Lowest ratio (%) Average ratio (%)

Situation I 97.67 85 91.41

Situation II 100 90.48 96.41

T1

E1

T2

E2

Fig. 11. Evaluation **of** the **detection** results **in** two situations. T1: true

count **of** persons **in** the whole area (situation I). E1: **detection** result **in** the

whole area. T2: true count **of** persons **in** the central area (situation II). E2:

**detection** result **in** the central area.

Fig. 12. Difficulty **in** **people** **detection** for situation I. Only three po**in**ts are

measured for one leg.

Fig. 9. Leg extraction result on Fig. 8, each circle denotes an **in**active leg.

from luggage dragged by the persons or measurement split

by partial occlusion.

In Fig. 11 **and** Table 1, only **detection** result at one

frame is considered. Another evaluation can be done consider**in**g

the **detection** ratio **of** persons through the whole

306 J. Cui et al. / Computer Vision **and** Image Underst**and****in**g 106 (2007) 300–312

sequence **of** 1000 frames. There appeared totally 96 persons.

And only 1 person is missed **in** the whole sequences,

because **of** severe occlusion. 93 persons are successfully

detected with**in** 5 frames, **and** 95 persons are detected with**in**

10 frames after their first appearance **in** the scene. In

addition, there are 6 noisy measurements.

6. Bayesian **track ing**

After the **in**active legs are detected, they are stored as

measurements at current time. As there are **multiple** targets

**and** **multiple** measurements, the direct estimation **of** the

target states is difficult due to the unknown data associations.

So called data association is to associate these measurements

to the trajectories **in** previous frames. To

address the problem, the target states can be augmented

with the unknown associations, **and** the jo**in**t distribution

**of** states **and** associations are estimated sequentially.

In the follow**in**g, the probabilistic **track ing** model will be

**in**troduced firstly, **and** then sequential **in**ference process

us**in**g Bayes’ rule is described, which is difficult to compute

analytically. F**in**ally, two strategies are presented to

simplify the computation, respectively for the case **of** **in**dependent

**track ing**

targets.

6.1. Probabilistic **track ing** model

Here, we describe a probabilistic model for **track ing** that

addresses the problem **of** **multiple** measurements **and** **multiple**

targets. We assume that there are T targets, where T is

fixed, **and** write their jo**in**t state as X k . At each time step we

have M measurements Y k , where M can change at each

time step. Data-association set is denoted as h k . In this

paper, we assume one target can generate up to one measurement,

**and** one measurement can be generated from

up to one target.

First, we specify the jo**in**t distribution P(X 0:k , Y 1:k , h 1:k )

over the actual measurements Y 1:k , data associations h 1:k ,

**and** states X 0:k **of** the targets between time steps 0 to k,

PðX 0:K ; Y 1:K ; h 1:K Þ¼PðX 0 Þ YK

k¼1

PðX k jX k

1 ÞPðY k jh k ; X k ÞPðh k Þ

where we assumed that the target motion is Markov, each

measurement set Y k is conditionally **in**dependent given the

current state X k , **and** X k depends only on the previous time

step. S**in**ce the actual state X k **of** the targets does not provide

us with any **in**formation on the data association. Consequently,

we also assume that the prior over data

associations P(h k ) does not depend on the target state

PðY k ; h k jX k Þ¼PðY k jh k ; X k ÞPðh k Þ

ð3Þ

It is convenient to write **in**ference **in** this model recursively

via the Bayes filter. The objective is to **in**fer the current

position X k **of** the targets given all **of** the

measurements Y 1:k observed so far. In particular, the posterior

distribution P(X k |Y 1:k ) over the jo**in**t state X k **of** all

present targets given all observations Y 1:k ={Y 1 , ...,Y k }

up to **and** **in**clud**in**g time k is updated accord**in**g to the

recursive formula

PðX k jY 1:k Þ¼c X PðX k ; h k jY 1:k Þ

h k

¼ c X PðY k jX k ; h k ÞPðh k Þ

h k

Z

PðX k jX k 1 ÞPðX k 1 jY k 1 Þ ð4Þ

X k 1

where c is a normaliz**in**g constant.

In usual, this expression **of** sequential update equation

cannot be solved analytically. Further assumptions are

required to simplify this model, which will be **in**troduced

**in** Sections 6.2 **and** 6.3. In the sections below we concentrate

on deriv**in**g an expression for the posterior P(X k |Y 1:k )

on both X k **and** the data association h k , by provid**in**g further

details on the motion model P(X k |X k 1 ) **and** the measurement

model P(Y k |X k , h k ).

6.1.1. State space **and** observation space

The state space for each target **in**cluded both position

**and** velocity X k,i =[x k,i , y k,i , vx k,i , vy k,i ] T , i =1,...,T at

time step k. Measurements were simply 2-d positions

Y k,j =[u k,j , v k,j ] T , j =1,...,M.

6.1.2. The motion model

For the motion model, we assume a st**and**ard l**in**ear-

Gaussian model. That is, we assume that the **in**itial jo**in**t

state is Gaussian

PðX 0 Þ¼NðX 0 ; m 0 ; V 0 Þ

m 0 ¼fm 0;1 ; ...; m 0;T g; V 0 ¼fV 0;1 ; ...; V 0;T g ð5Þ

where m 0 is the mean **and** V 0 is the correspond**in**g covariance

matrix. In addition, we assume that targets move

accord**in**g to a l**in**ear model with additive Gaussian noise,

PðX k jX k 1 Þ¼NðX k ; AX k 1 ; Q k 1 Þ ð6Þ

where Q k 1 is the prediction covariance **and** A is a l**in**ear

prediction matrix. We model the motion **of** each target

**in**dependently with a constant velocity model, i.e.

A ¼ diagfA 1 ; ...; A T g; A i ¼ I 22 I 22

0 I 22

where I 2·2 denotes 2-by-2 identity matrix.

; i ¼ 1; ...; T

6.1.3. The measurement model

We represent a data association set h k by a

h k ¼fði; jÞjh k;j ¼ ig; ði; jÞ

2f0; ...; T gf1; ...; Mg ð8Þ

where h k,j = i denotes that the jth measurement is generated

by ith, **and** i = 0 implies that the measurement is a clutter.

ð7Þ

J. Cui et al. / Computer Vision **and** Image Underst**and****in**g 106 (2007) 300–312 307

Given the data association h k we can divide the measurements

**in**to clutter **and** observations, respectively [13],

PðY k jX k ; h k Þ¼PðY c;k jh k ÞPðY o;k jX k ; h k Þ

ð9Þ

Then, we assume that each clutter measurement, i.e., an

unassigned measurement is **in**dependently **and** uniformly

generated over the field **of** view. Consequently, the clutter

model is a constant C proportional to the number **of** clutter

measurements. |Y c,k |:

PðY c;k jh k Þ¼jY c;k j=C

ð10Þ

The constant C is related to the size **of** the field **of** view—**in**

a 720 · 480 image C = 720 Æ 480.

To model the observations, we map the data association

**in** a Gaussian observation model

PðY o;k jh k ; X k Þ¼NðY o;k ; HX k ; R k Þ

ð11Þ

where R k is the measurement covariance.

We assume that each measurement is generated **in**dependently,

**and** we once aga**in** obta**in** a block-diagonal

structure:

H ¼ diagfH 1 ;...;H M g; H j ¼½I 22 0 22 Š; j ¼ 1;...;M

ð12Þ

6.2. Independent **track ing** us

Once the models are specified, the jo**in**t distribution **of**

data association **and** state can be estimated recursively

us**in**g Eq. (4). In usual, this expression **of** sequential update

equation cannot be solved analytically. Further assumptions

are required to simplify this equation.

In this section, we will **in**troduce the first simplification.

Now we assume that all the targets move **in**dependently,

**and** the targets’ states are not correlated or mutually **in**dependent.

The possible measurement to target data associations

are determ**in**ed by a simple strategy **of** gat**in**g [4]. That

is targets are only assigned to measurements with**in** st**and**ard

deviations **of** the predicted position **of** the target.

Then we can construct their measurement likelihood

**in**dependently **and** compute the MAP (maximum a posterior

probability) estimation for each target, respectively:

Pðx j;k ; h j;k Þ¼Pðh j;k ÞPðY k jx j;k ; h j;k Þ

Z

Pðx j;k jx j;k 1 ÞPðx j;k 1 jY 1:k 1 Þdx j;k 1 ð13Þ

The assignment that provides maximum a posterior

probability is chosen as the potential association for each

target.

The dynamic model for each target is one component **of**

Eq. (6), i.e. l**in**ear with constant velocity.

Pðx j;k jx j;k 1 Þ¼Nðx j;k jAx j;k 1 ; Q j;k 1 Þ

Pðx j;k 1 jY 1:k 1 Þ¼ X Pðx j;k 1 ; h j;k 1 jY 1:k 1 Þ

h j;k 1

Assume the **in**itial prior distributions **of** the target states

are Gaussian.

Pðx j;0 Þ¼Nðx j;0 ; m j;0 V j;0 Þ:

For the targets orig**in**ated measurement likelihood, we

use an additional coherency cue. Then, with the **in**dependent

assumption, the measurement likelihood is:

PðY k jx j;1:k ; h j;k Þ¼Z position

j;k

Z position

j;k

Z coherency

j;k

ð14Þ

is the cue derived from the distance between

measured position **and** predicted position as shown **in** (13):

Z position

j;k

¼ Pðy position

s;k

jX k ; h jk ¼ sÞ ¼Nðy s;k jHx j;k ; R j;k Þ ð15Þ

where h j,k = s means sth measurement y s,k is from

target j, **and** y position

s;k

¼ Hx j;k þ r j;k , r j,k N(0, R j,k ),

H ¼½I 22 0 22 Š.

Z coherency

j;k

is the cue derived from the region membership

**of** measured position belong**in**g to an target trajectory (see

Fig. 9). It is formulated **based** on the observation that two

successive measurements **of** the same person belong to the

same coherent region **in** accumulated image. Some

advanced region segmentation **and** analysis methods

should be effective, but also slow. We choose a quite practical

**and** effective approach that measures the **in**tensity **of**

po**in**ts on the l**in**e l**in**k**in**g the measured position **and** the

last position **of** target trajectory:

Z coherency

j;k

¼

0

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Y

jEj

**in**tensityðpÞ ¼ exp @ ln

¼ exp

¼ exp

p2E

!

1 X

lnð**in**tensityðpÞÞ

jEj

p2E

!

X

histðiÞ lnð1=iÞ

i2 histðEÞ

1

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Y

jEj

**in**tensityðpÞA

p2E

ð16Þ

where E is the l**in**e l**in**k**in**g measured position **and** last trajectory

position, p is the pixel on the l**in**e, hist(E) is the histogram

**of** l**in**e pixels, **and** hist(i) 2 [0,1] is the histogram

value **of** a specific **in**tensity value i 2 (0,1].

This coherency likelihood is demonstrated with great

robustness **and** effectiveness **in** our experiments **and** could

uniquely make data associations correctly **in** most cases.

Even **in** case **of** a trajectory is built newly **and** thus position

is difficult to predict, or **people** changes walk**in**g direction

**and** predicted position is wrong, the coherency likelihood

could also assign measurement to correct trajectory (see

Fig. 13).

6.3. Jo**in**t **track ing**

While us**in**g **in**dependent filters is computationally tractable,

the result is prone to failures. In a typical failure

mode, illustrated **in** Fig. 14, two targets walk closely, **and**

the measurement **of** one target ‘‘hijacks’’ the filter **of**

another nearby target with a high likelihood score.

308 J. Cui et al. / Computer Vision **and** Image Underst**and****in**g 106 (2007) 300–312

n o S

PðX k 1 jY 1:k 1 Þ X ðsÞ

k 1

s¼1

Given this representation, we obta**in** the follow**in**g

Monte Carlo approximation **of** the Bayes filter

PðX k jY 1:k Þc X h k

PðY k jX k ; h k ÞPðh k Þ XS

s¼1

PðX k jX ðsÞ

k 1 Þ ð17Þ

Fig. 13. Illustration for region coherency likelihood on accumulated

image. The top po**in**t is a new detected measurement. The po**in**ts l**in**ked

with this new measurement are locations **of** two trajectories **in** the previous

frame. The l**in**es denote possible data associations.

A straightforward implementation **of** this equation is

**in**tractable due to the large summation over the space **of**

data associations h k comb**in**ed with the summation over

the **in**dicator s. To address this problem, a second Monte

Carlo approximation can be **in**troduced.

PðX k jY 1:k Þc XW

w¼1

PðY k jX k ; h ðwÞ

k

ÞPðh ðwÞ

k

Þ

Fig. 14. (Left) A failure. (Right) Correct trajectories. Frame 1546.

Jo**in**t Probabilistic Data Association Filter [2,3] can

address these situations. However, the JPDAF represents

the belief over the state **of** the targets as a Gaussian, **and**

may not accurately capture the multi-modal distribution

over the target states.

On the other h**and**, **in** [13,14], the Rao-Blackweillized

Monte Carlo data association filter (RBMC-DAF) algorithm

is **in**troduced to estimate data associations with a

SIR filter **and** the other parts with a Kalman filter. This

idea is orig**in**ated from Rao-Blackwellized particle filter**in**g

(RBPF) [12]. That is, sometimes it is possible to evaluate a

part **of** the filter**in**g equations analytically **and** the other

part by Monte Carlo sampl**in**g **in**stead **of** comput**in**g everyth**in**g

by pure sampl**in**g. In this way multi-modal distribution

**of** the target state can be considered with limited

computation.

In [14], only one measurement is processed at each time

step. In this paper, we extend the RBMC-DAF to the case

**of** data associations for **multiple** measurements.

In the follow**in**g, we first provide a Monte Carlo strategy

sampl**in**g on the data association, **and** then Rao-Blackwellized

data association algorithm will be presented as a practical

strategy to improve the computation.

6.3.1. Monte Carlo sampl**in**g on data association

A Monte Carlo sampl**in**g method approximates a probability

distribution by a set **of** samples drawn from the

distribution.

In a typical Monte Carlo sampl**in**g method, one starts

by **in**ductively assum**in**g that the posterior distribution over

the jo**in**t state **of** the targets at the previous time step is

approximated by a set **of** S samples

XS

s¼1

PðX k jX ðsÞ

k 1 Þ ð18Þ

And then, the evaluation **of** this equation can be

achieved with Gaussian assumption **of** state distribution

**in** a Rao-Blackwellized Monte Carlo Data Association

framework.

6.3.2. Rao-Blackwellized Monte Carlo data association

At each time step, we run the **track ing**

process as follow**in**g.

Initialization: We assume that we can approximate the

posterior P(X k 1 |Y 1:k 1 ) by the follow**in**g mixture **of**

Gaussians:

PðX k 1 jY 1:k 1 Þ 1 S

X S

s¼1

NðX k

1 ; m ðsÞ

k

1 ; V ðsÞ

k 1 Þ

Prediction: Because the target motion model is l**in**ear-

Gaussian, the predictive density over X k for each value **of**

the mixture **in**dicator s can be calculated analytically

Z

PðX k jX k 1 ÞNðX k 1 ; m ðsÞ

k

X k 1

1 ; V ðsÞ

k 1 Þ

¼ NðX k ; Am ðsÞ

k 1 ; QðsÞ k 1 Þ ð19Þ

Hence, the predictive prior P(X k |Y 1:k 1 ) on the current

state is also a mixture **of** Gaussians

PðX k jY 1:k

1 Þ 1 S

X S

s¼1

NðX k ; Am ðsÞ

k 1 ; QðsÞ k 1 Þ ð20Þ

Evaluation: The sequential Monte Carlo approximation

(17) to the target posterior us**in**g Bayes filter becomes

PðX k jY 1:k Þ

c X h k

c XW

w¼1

PðY k jX k ; h k ÞPðh k Þ 1 S

PðY k jX ðwÞ

k

; h ðwÞ

k

ÞPðh ðwÞ

k

X S

s¼1

NðX k ; Am ðsÞ

k 1 ; QðsÞ k 1 Þ ð21Þ

ÞNðX ðwÞ

k

; Am ðs0 Þ

k

1 ; Þ

Qðs0 k 1 Þ

J. Cui et al. / Computer Vision **and** Image Underst**and****in**g 106 (2007) 300–312 309

us**in**g a set **of** sampled states, data associations, **and** mixture

**in**dicators fX ðwÞ

k

; h ðwÞ

k

; s ðwÞ g W w¼1

where s 0 = s (w) is the

wth sampled mixture **in**dicator drawn from the follow**in**g

target density

~pðX k ; h k ; sÞ ¼PðY k jX k ; h k ÞPðh k ÞNðX k ; Am ðsÞ

k 1 ; QðsÞ k 1 Þ ð22Þ

Now, we can analytically marg**in**alize out the current state

X k **based** on Eq. (22), **and** obta**in** a Rao-Blackwellized

target density

pðh k ; sÞ ¼PðY c;k jh k ÞPðh k Þ

Z

NðY o;k ; HX k ; R k ÞNðX k ; Am ðsÞ

k 1 ; QðsÞ k 1 Þ ð23Þ

X k

The key observation here is that the product **of** the likelihood

**and** the predictive prior

NðY o;k ; HX k ; R k ÞNðX k ; Am ðsÞ

k

1 ; QðsÞ k 1 Þ

proportional to a Gaussian. P(h k ) is assumed to be uniformly

distributed. As a result, the **in**tegral over X k is analytically

tractable **and** is also Gaussian.

Sampl**in**g: F**in**ally, samples fh ðwÞ

k

; s ðwÞ g W w¼1

drawn from

the Rao-Blackwellized target density p(h k ,s) **based** on Eq.

(23) are used to construct a new mixture **of** Gaussians over

the current state

PðX k jY 1:k Þ¼ 1 W

X W

w¼1

NðX k ; m ðwÞ

k

; V ðwÞ

k

Þ ð24Þ

where m ðwÞ

k

is the mean **and** V ðwÞ

k

is the covariance **of** the target

state at the current time step.

Practical heuristics: We apply two heuristics to obta**in**

some additional ga**in**s **in** efficiency. First, we gate the measurements

**based** on a covariance ellipse around each target.

Targets are only assigned to measurements with**in**

st**and**ard deviations **of** the predicted position **of** the target.

Second, the components **of** the association set are sampled

sequentially conditional on the components sampled earlier

**in** the sequence. We make use **of** this property to ensure

that measurements associated with targets earlier **in** the

sequence are not considered as c**and**idates to be associated

with the current target. In this way, the algorithm is guaranteed

to generate only valid association hypotheses. In

addition, sampl**in**g is done us**in**g Roulette-wheel selection.

That is, states with high density would be selected with high

probability.

6.3.3. Mutual correlation **detection** **and** model**in**g

For **detection** **of** mutual correlation between **multiple**

targets, a graph is used, with the nodes represent**in**g the targets

**and** the edges represent**in**g that there is a correlation or

**in**teraction between correspond**in**g nodes. Targets with**in** a

certa**in** distance (e.g. 20 cm, 15 pixels) **of** one another are

l**in**ked by an edge. The absence **of** edges between two targets

provides the **in**tuition that targets far away will not

**in**fluence each other’s motion. At each time step, the correlation

graph is updated. The targets with no edge are

tracked with **in**dependent filters **and** targets with edges

are tracked with RBMC-DAF. In this work, up to two targets

are considered **and** jo**in**tly tracked us**in**g RB-MCDAF.

7. Evaluations on **track ing** results

Four s**in**gle-row laser-range scanners are exploited **in**

our experiments **and** cover a corner **of** an exhibition hall

as shown **in** Fig. 3(b). Each laser scanner pr**of**iles 360 range

distances equally **in** 180° on the scann**in**g plane with the

frequency **of** 30 fps. Scanners are set do**in**g horizontal scann**in**g

on the height **of** 16 cm from horizontal ground. We

use a sequence with 9035 frames as the experimental data.

Fig. 15 is a screen copy **of** generated trajectories, where

red circles denote the location **of** laser scanners; green

po**in**ts represent background image; white po**in**ts represent

mov**in**g legs. Colour l**in**es are trajectories. Extreme po**in**ts

**of** trajectories are locations **of** **in**active leg **of** **people** at current

time. **Laser** po**in**ts **of** one person are manually circled

to ease observ**in**g. Two close persons (rectangle **in** the right

**of** the figure) are tracked with RB-MCDAF, **and** other persons

are tracked with **in**dependent KFs.

With conventional laser-**based** **track ing** method,

**of**ten fails **in** three k**in**ds **of** situations: (1) **in** the case

that **people** walk too close; (2) **in** the case that **people** walk

cross **and** their feet are too close together at **in**tersection

po**in**t; (3) **in** the case that there is temporal occlusion.

Our method could h**and**le all these cases well.

Case I: **people** walk closely together

This is a case that occurs very **of**ten, especially **in** a

crowded environment. Due to the nearly distributed laser

Fig. 15. A screen copy **of** **track ing** result.

Fig. 16. Results when there are several closely walk**in**g **people**. Simple

cluster**in**g will fail **in** these cases. (Left) Frame 2845; (Right) frame 3425.

Green po**in**ts are expected positions obta**in**ed by Kalman prediction.

310 J. Cui et al. / Computer Vision **and** Image Underst**and****in**g 106 (2007) 300–312

Fig. 17. Track**in**g **of** two correlated targets. From left to right: frame

1590, 1594, 1603 **and** 1650.

Fig. 18. Track**in**g **of** two targets walk**in**g across. (Left) Before cross **in**

frame 3716. (Right) After cross **in** frame 3753.

po**in**ts, cluster**in**g **based** on just one frame cannot correctly

extract the targets. In the experiment, we found that for

most **of** these cases, **in**dependent Kalman filter is sufficient

for correct **track ing** as shown

However, there are still few times that **in**dependent filter

failed. This is because the regions **of** two close targets are

strongly correlated. Once the correlation is detected,

RBMC-DAF with 100 particles is used to track these two

targets jo**in**tly (Fig. 17). In this experiment, we only considered

the case **of** two mutually correlated targets. It is equivalent

to rema**in** 100 most probable data associations at

each time step, while the space **of** all data associations

grows exponentially with time.

Case II: two persons walk cross generat**in**g mixed data

In the case that visitors cross **and** their feet are too close

together at **in**tersection po**in**t, the data **of** which are mixed

**and** one foot is lost **in** extraction. In conventional **track ing**

method, first **in** cluster**in**g process, system will fail to obta**in**

correct observation. And this will **in**crease the difficulty **of**

data association **and** **track ing**. In Fig. 18, we show an

example **of** **track ing** two

are correctly obta**in**ed.

Case III: occlusions

In conventional laser-**based** **track ing** system, occlusions

result **in** the difficulty **in** object extraction at some particular

time steps. Complex filters were used for **in**ference. In

this paper, **in**stead **of** complex filter, we use accumulated

distribution to overcome the data miss**in**g problem **of**

temporal occlusion. The reason is that our feature extraction

is **based** on statistical computation on accumulated

data. So, temporal data miss**in**g with**in** a reasonable while

does not **in**fluence the correctness **of** feature extraction.

Some additional images **of** **track ing** results are shown

Fig. 19.

To make quantitative analysis on the **track ing** performance,

we compared the count **of** correctly tracked trajec-

Fig. 19. Track**in**g results. (a) Frame 700. (b) Frame 715. (c) Frame 730. (d) Frame 745.

J. Cui et al. / Computer Vision **and** Image Underst**and****in**g 106 (2007) 300–312 311

Count **of** trajectories

50

40

30

20

10

0

0 200 400 600 800 1000

Table 2

Reasons for **track ing** failures

Reason for failures **in** frame 0–1000

**Laser** scan frame

Fig. 20. Evaluation **of** the **track ing** results

count **of** trajectories **in** the whole area (situation I). E1: correctly tracked

trajectories **in** the whole area. T2: true count **of** trajectories **in** the central

area (situation II). E2: correctly tracked trajectories **in** the central area.

Non-**detection** results **in** non-trajectory 1

Few laser po**in**ts with low **detection** ratio **of** one person 3

through the sequence results **in** a broken trajectory

Walk**in**g too fast makes data association with**in** nearby range 3

fail

Noisy measurement generates a nonexistent trajectory 2

Noisy measurement disturbs an exist**in**g trajectory nearby, **and** 1

results **in** a broken trajectory

Noisy measurement attracts an exist**in**g trajectory, **and** results 2

**in** a **track ing** error, but then the tracker can recover from

this error

When a new object is detected, an exist**in**g trajectory is 2

attracted by the new measurement, which results **in** errors

**in** both trackers **of** two persons

Track**in**g error caused from mixed data **of** two closely situated 0

persons

Count

**of**

failures

tories **and** the true number **of** trajectories **in** Fig. 20. Aga**in**,

we sampled the results at the **in**terval **of** 50 frames. There

are totally 96 trajectories **in** the whole area **and** 14 **of** them

have one or several failures through 200 time steps (1000

frames). And thus, the success ratio **of** the tracker is

85.42%. The reasons **and** correspond**in**g number **of** the failures

are listed **in** Table 2.

To sum up, **in** this experiment, tens **of** **people** are

tracked simultaneously **in** real-time. At the peak time,

about fifty **people** are tracked simultaneously with near

real-time performance. The experiments demonstrate the

stability **of** the feature extraction method, the effectiveness

**of** the measurement likelihood method **and** Bayesian data

association method, **and** achieve a very promis**in**g **track ing**

result.

T1

E1

T2

E2

8. Conclusions **and** discussions

There are two ma**in** issues for laser-**based** **multiple** **people**

**track ing**. One is the difficulty

The other one is jo**in**t estimation **of** targets states **and**

data associations. In this paper, a novel method is proposed

**of** **track ing**

such as a shopp**in**g mall **and** an exhibition hall, by scann**in**g

the feet **of** pedestrians us**in**g a number **of** s**in**gle-row laserrange

scanners.

In our experiment, four laser scanners are set on an exhibition

hall, monitor**in**g visitors’ flow dur**in**g a whole exhibition

day. About 50 visitors are tracked simultaneously

dur**in**g a peak hour with near real-time performance, which

is much faster than our previous work [8]. Compared with

exist**in**g laser-**based** trackers, our method has two significant

advantages: the extracted feature is very stable **and**

deals with the measurement noise very well; the measurement

likelihood is very strong that could uniquely make

data associations correctly **in** most cases. Additional

RBMC-DAF is used for **track ing** correlated two targets.

The experimental results showed that our proposed

method is very effective **and** robust.

There are still several problems that not yet solved well.

If one **people** moves very fast (jogs for example), the accumulated

image might not provide a significant local maximum

for some static foot positions. We might miss that

position **and** get a broken trajectory there sometimes. This

could be improved by a f**in**er search strategy **of** local maximum

or us**in**g a slide w**in**dow to consider simultaneously

several successive scan images.

For **people** carry**in**g luggage, we can correctly track the

person **in** most cases. But **in** some time, the person **and** luggage

together will generate two trajectories that mutually

cross because we do not use a specific model for luggage.

This problem could be tackled by learn**in**g patterns, respectively,

for human **and** luggage **in** our future work.

In addition, a **track ing** algorithm will be developed for

monitor**in**g not only pedestrians, but also shopp**in**g carts,

baby cars, bicycles, motor cars, **and** so on. Fusion **of** laser

data **and** vision data will be another powerful approach for

high-level **track ing**

References

[1] E. Prassler, J. Scholz, M. Schuster, D. Schwammkrug, Track**in**g a

large number **of** mov**in**g objects **in** a crowded environment, **in**: IEEE

Workshop on Perception for Mobile Agents, Santa Barbara, June

1998.

[2] B. Kluge, C. Koehler, E. Prassler, Fast **and** robust **track ing**

**multiple** mov**in**g objects with a laser range f**in**der, **in**: Proc. **of** the

IEEE International Conference on Robotics & Automation (ICRA),

2001, pp. 1683–1688.

[3] M. L**in**dström, J.-O. Eklundh, Detect**in**g **and** **track ing** mov

from a mobile platform us**in**g a laser range scanner, **in**: Proc. IEEE/

RSJ Int. Conf. on Intelligent Robots **and** Systems (IROS), 2001, pp.

1364–1369.

[4] D. Schulz, W. Burgard, D. Fox, A. Cremers, Track**in**g **multiple**

mov**in**g targets with a mobile robot, **in**: Proc. **of** the IEEE Computer

312 J. Cui et al. / Computer Vision **and** Image Underst**and****in**g 106 (2007) 300–312

Society Conference on Computer Vision **and** Pattern Recognition

(CVPR), Kauwai, Hawaii, 2001.

[5] M. Montemerlo, S. Thun, W. Whittaker, Conditional particle filters

for simultaneous mobile robot localization **and** **people**-**track ing**,

Proc. **of** the IEEE International Conference on Robotics & Automation

(ICRA), 2002.

[6] O. Frank, J. Nieto, J. Guivant, S. Sched**in**g. Multiple target **track ing**

us**in**g sequential Monte Carlo methods **and** statistical data association,

**in**: Proc. IEEE/RSJ Int. Conf. on Intelligent Robots **and**

Systems (IROS), 2003.

[7] A. Fod, A. Howard, M.J. Matari’c. A **Laser**-**based** **people** tracker, **in**:

Proc. **of** the IEEE International Conference on Robotics & Automation

(ICRA), 2002, pp. 3024–3029.

[8] H. Zhao, R. Shibasaki, A Novel system for **track ing** pedestrians us

**multiple** s**in**gle-row laser range scanners, IEEE Trans. SMC. Part A:

Systems **and** Humans 35 (2) (2005) 283–291.

[9] J. Cui, H. Zha, H. Zhao, R. Shibasaki, Track**in**g **multiple** **people**

us**in**g laser **and** vision, **in**: Proc. IEEE/RSJ Int. Conf. on Intelligent

Robots **and** Systems (IROS), Edmonton, Alberta, Canada, August 2–

6, 2005, pp.1301–1306.

[10] E. Parzen, On estimation **of** a probability density function **and** mode,

Ann. Math. Stat. 33 (1962) 1065–1076.

[11] S. Mochon, T.A. McMahon, Ballistic walk**in**g, J. Biomech. 13 (1980)

49–57.

[12] Arnaud Doucet, N**and**o de Freitas, Neil Gordon (Eds.), Sequential

Monte Carlo Methods **in** Practice, Spr**in**ger, 2001.

[13] Z. Khan, T. Balch, F. Dellaert. Multitarget **track ing** with split

merged measurements, **in**: IEEE Conf. on Computer Vision **and**

Pattern Recognition (CVPR), 2005.

[14] Simo Särkkä, Aki Vehtari, Jouko Lamp**in**en. Rao-Blackwellized

Monte Carlo data association for **multiple** target **track ing**,

International Conference on Information Fusion, Stockholm, June

2004.

[15] H. Zhao, R. Shibasaki, A robust method for register**in**g ground-**based**

laser range images **of** urban outdoor environment, Photogrammetric

Eng. Remote Sens. 67 (10) (2001) 1143–1153.