21.07.2013 Views

Estimation of Multiple, Time-Varying Motions Using ... - IEEE Xplore

Estimation of Multiple, Time-Varying Motions Using ... - IEEE Xplore

Estimation of Multiple, Time-Varying Motions Using ... - IEEE Xplore

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

982 <strong>IEEE</strong> TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008<br />

<strong>Estimation</strong> <strong>of</strong> <strong>Multiple</strong>, <strong>Time</strong>-<strong>Varying</strong> <strong>Motions</strong><br />

<strong>Using</strong> <strong>Time</strong>-Frequency Representations and<br />

Moving-Objects Segmentation<br />

Dimitrios S. Alexiadis, Student Member, <strong>IEEE</strong>, and George D. Sergiadis, Member, <strong>IEEE</strong><br />

Abstract—We extend existing spatiotemporal approaches to<br />

handle time-varying motions estimation <strong>of</strong> multiple objects. It is<br />

shown that multiple, time-varying motions estimation is equivalent<br />

to the instantaneous frequency estimation <strong>of</strong> superpositioned<br />

FM sinusoids. Therefore, we apply established signal processing<br />

tools, such as time-frequency representations to show that for<br />

each time instant, the energy is concentrated along planes in the<br />

3-D space: spatial frequencies—instantaneous frequency. <strong>Using</strong><br />

fuzzy C-planes, we estimate indirectly the instantaneous velocities.<br />

Furthermore, adapting existing approaches to our problem, we<br />

attain the identification <strong>of</strong> the moving objects. The experimental<br />

results verify the effectiveness <strong>of</strong> our methodology.<br />

Index Terms—Computer vision, motion estimation, time-frequency<br />

representations.<br />

I. INTRODUCTION<br />

ESTIMATING the motion characteristics within a<br />

time-varying scene and identifying the moving objects is<br />

an important task in many video applications [1]. Most <strong>of</strong> the<br />

proposed motion analysis techniques are carried out directly in<br />

the pixel-intensity domain, applying typically the optical flow<br />

equation or correlation-matching in short sequences <strong>of</strong> frame,<br />

combined with spatial motion-smoothness constraints in order<br />

to obtain the actual motion field [1, Ch. 6]–[3]. However, this<br />

has the unwanted effect <strong>of</strong> over-smoothing the motion discontinuities,<br />

across object edges. More recently, this problem is<br />

partially resolved within a robust estimation framework [4].<br />

The technique presented in this paper belongs to the class<br />

<strong>of</strong> spatiotemporal (3-D) spectral approaches [6]–[12], which<br />

exploit the information contained in long image sequences and<br />

estimate directly the true motion characteristics, regardless <strong>of</strong><br />

the moving objects’ shape or dense motion discontinuities.<br />

Neurophysiological evidence suggests that the human visual<br />

system probably works in a similar manner [13], [14]. However,<br />

the limitation <strong>of</strong> most spatiotemporal approaches is the<br />

assumption <strong>of</strong> time-constant motion [6]–[9]. Thus, they can be<br />

applied only for short time intervals. A few attempts have been<br />

made to overcome the limitation. The authors in [10] address<br />

Manuscript received June 24, 2006; revised February 4, 2008. The associate<br />

editor coordinating the review <strong>of</strong> this manuscript and approving it for publication<br />

was Dr. Onur G. Guleryuz.<br />

The authors are with the Telecommunications Laboratory, Department<br />

<strong>of</strong> Electrical and Computer Engineering, Aristotle University <strong>of</strong> Thessaloniki,<br />

Thessaloniki GR 54124, Greece (e-mail: dalexiad@mri.ee.auth.gr;<br />

sergiadi@auth.gr).<br />

Digital Object Identifier 10.1109/TIP.2008.922437<br />

1057-7149/$25.00 © 2008 <strong>IEEE</strong><br />

only the problem <strong>of</strong> a single global time-varying motion, using<br />

the high-order ambiguity function, to estimate the parameters<br />

<strong>of</strong> chirp signals. The work in [11] handles multiple accelerated<br />

motions, exploiting the chirp-Fourier transform and a clustering<br />

algorithm. In another recent work [12], the instantaneous<br />

velocities are estimated using time-frequency representations<br />

(TFRs), with the restriction, however, <strong>of</strong> stationary background<br />

and small moving objects.<br />

Combining and mainly extending previous relevant ideas [6],<br />

[7], [11], [12], this paper presents a “motion first” spatiotemporal<br />

methodology, for the estimation <strong>of</strong> multiple, occluded<br />

or transparent, time-varying motions and the extraction <strong>of</strong> the<br />

moving objects in the frequency domain. In Section II, we formulate<br />

the motion estimation problem in the frequency domain<br />

and we give the theoretical background for the instantaneous<br />

frequency estimation. In Section III, we present the proposed<br />

methodology for motion estimation by TFRs, while in Section<br />

IV, we address the motion segmentation/objects’ extraction<br />

problem. Finally, in Section V, we present experimental results<br />

on synthetic and natural sequences, before concluding.<br />

II. PROBLEM FORMULATION AND THEORETICAL BACKGROUND<br />

A. Modeling <strong>Multiple</strong> <strong>Time</strong>-<strong>Varying</strong> Motion<br />

We use a 2.5-D layered representation for image sequences<br />

[5], which is described in Fig. 1. The layers are ordered in depth.<br />

For each layer , we have a) the intensity map ;<br />

b) the alpha map , which defines the spatial<br />

support/opacity <strong>of</strong> the layer and remains always registered with<br />

the intensity map; and c) the time varying velocity , which<br />

describes how the maps move over time. The layered representation<br />

can be summarized in the equation<br />

where is the displacement<br />

<strong>of</strong> the th layer at time and<br />

is a window function which carries the transparency/occlusions<br />

information during time. Although the alpha map remains<br />

registered with the intensity map , this does not hold for .<br />

The window is deformed over time and has to be decomposed<br />

(1)<br />

(2)


ALEXIADIS AND SERGIADIS: ESTIMATION OF MULTIPLE, TIME-VARYING MOTIONS USING TIME-FREQUENCY REPRESENTATIONS 983<br />

Fig. 1. Block diagram for our layered model, similar to the one in [5]; “cmp”<br />

represents @— AaI0 — .<br />

into a constant portion and a deformable, error-introducing portion.<br />

After some manipulations [6]<br />

where corresponds to the virtual, constant part <strong>of</strong><br />

. The distortion term accounts for the deformation<br />

<strong>of</strong> due to occlusions. Any other source <strong>of</strong> distortion<br />

(sensor noise, deviation from the motion model, etc.) can also be<br />

included. An explicit analysis in the frequency domain for the<br />

occlusion distortion term in the case <strong>of</strong> time-constant velocities<br />

and two moving objects can be found in [15]. Notice that our<br />

model intrinsically can handle generally transparent motions,<br />

with the occluded motion being only a special case, when is<br />

simply zero or unity.<br />

B. Velocity <strong>Estimation</strong> as Instantaneous Frequency <strong>Estimation</strong><br />

Consider the spatial (2-D) Fourier transform <strong>of</strong> each frame.<br />

From (3) and with ,wehave<br />

where the instantaneous phase <strong>of</strong> the th component is<br />

Consequently, the corresponding instantaneous frequency is<br />

where stands for the instantaneous velocity<br />

<strong>of</strong> the th object, which is considered to be a continuous<br />

and smooth function <strong>of</strong> , as expected by the physic’s laws. From<br />

(4)–(6), one can conclude the following.<br />

1) For a fixed spatial frequency , the signal is a superposition<br />

<strong>of</strong> frequency modulated (FM) complex sinusoids<br />

with instantaneous frequencies .<br />

2) For a given time instant , the instantaneous frequency<br />

varies linearly with and . The triplets<br />

lie on a plane through the origin,<br />

(3)<br />

(4)<br />

(5)<br />

(6)<br />

whose direction depend on the instantaneous velocity at<br />

the instant .<br />

According to the first conclusion, a key component <strong>of</strong> our<br />

approach is the estimation <strong>of</strong> the instantaneous frequencies<br />

. For this purpose, one can exploit established signal<br />

processing tools, such as the time-frequency representations<br />

[16]–[18].<br />

C. Smoothed Pseudo Wigner–Ville (SPWV) Distribution<br />

In this subsection, we briefly review the smoothed pseudo<br />

Wigner–Ville (SPWV) distribution, which is used in our experiments.<br />

For a detailed analysis on TFRs, the reader is referred<br />

to [16]–[18]. We drop the variable for notational simplicity.<br />

The SPWV distribution <strong>of</strong> is given by<br />

namely from the 2-D convolution <strong>of</strong> the well-known<br />

Wigner–Ville (WV) distribution<br />

with a separable 2-D smoothing kernel ,<br />

where is the Fourier transform <strong>of</strong> an 1-D smoothing<br />

window .<br />

The WV distribution presents perfect localization along<br />

the instantaneous frequency for a single linear FM sinusoid.<br />

However, it presents emphatic interference terms in the case<br />

<strong>of</strong> multicomponent signals, which may be destructive in many<br />

applications. Fortunately, the interference terms have oscillatory<br />

nature and can, therefore, be attenuated by the smoothing<br />

operation with the kernel . Introducing separable<br />

smoothing actions allows us to control the smoothing in time<br />

and frequency freely and independently to each other. The<br />

wider the time (frequency) smoothing window, the more time<br />

(frequency) smoothing can be achieved. Also, it can be shown<br />

that the well-known spectrogram, namely the squared modulus<br />

<strong>of</strong> the windowed (short-time) FT, is related to the WV distribution,<br />

using an appropriate (nonseparable) kernel [16].<br />

D. Extracting Information From <strong>Time</strong>-Frequency Images<br />

The simplest approach to extract the instantaneous frequencies<br />

is to find the peaks along in the TFR image, for<br />

each . However, in order to obtain robustness against noise and<br />

interference terms, we use the following methodology. Considering<br />

that the instantaneous velocities and, consequently, the instantaneous<br />

frequencies in (6) vary as smooth functions <strong>of</strong><br />

, we expect smooth curves in the TF plane, which can be approximated<br />

as piecewise linear polynomials. Namely, one can<br />

approximate higher-order FM signals by piecewise linear FM<br />

signals. Let the length <strong>of</strong> the image sequence be . We break<br />

the interval into overlapping intervals <strong>of</strong> length<br />

(7)<br />

(8)<br />

(9)


984 <strong>IEEE</strong> TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008<br />

where is the center time instant <strong>of</strong> the th interval . With<br />

being the overlapping factor between two successive<br />

intervals, we have<br />

(10)<br />

and . We want to estimate the mean<br />

frequencies and the chirp rates for each time interval.<br />

To its center , we assign the instantaneous frequencies estimations<br />

as . The instantaneous frequencies for each<br />

time instant can be interpolated from . For the desired<br />

estimations, we cut the TFR image into the overlapping<br />

time-slices , corresponding to the intervals<br />

. The linear FM estimation within<br />

each time interval is performed as described below.<br />

The WV distribution ideally maps a linear FM sinusoid into<br />

a line in the TF plane. Consequently, the problem <strong>of</strong> the detection-estimation<br />

<strong>of</strong> such a signal is reduced to the easy-to-solve<br />

problem <strong>of</strong> detecting a line, which can be accomplished using<br />

the Hough transform. Applying the Hough transform to the WV<br />

distribution, one obtains the Wigner–Hough transform [18]. It<br />

can be shown that this is the asymptotically efficient estimator<br />

(it reaches the Cramer–Rao bound) <strong>of</strong> linear FM sinusoids in<br />

white Gaussian noise. In practice the TFR is calculated for discrete<br />

values <strong>of</strong> and . Therefore, we use a discrete-space version<br />

<strong>of</strong> Hough transform, similarly to [9]. The distance between<br />

a point in a discrete image and a line<br />

with slope and bias is calculated by<br />

(11)<br />

A “digital” line <strong>of</strong> width is defined as the set <strong>of</strong><br />

all satisfying . Given the total<br />

height <strong>of</strong> a TFR equal to , a good value for , that is used<br />

in our experiments is . The detection-estimation <strong>of</strong><br />

lines is performed by searching the peaks <strong>of</strong> the 2-D function<br />

(line matched filter)<br />

(12)<br />

where is the number <strong>of</strong> pixels belonging to the line<br />

. Notice that, thanks to the integration along lines,<br />

even when strong interference-terms are present in the TFR images,<br />

they are attenuated because they have oscillatory nature.<br />

Therefore, the detector is relatively efficient for multiple FM<br />

components.<br />

III. PROPOSED METHODOLOGY<br />

A. General Approach<br />

According to the motion analysis <strong>of</strong> Section II and the theory<br />

<strong>of</strong> TFRs, a methodology for smoothly time-varying motions estimation,<br />

can be summarized in the following steps.<br />

1) For each spatial frequency , we map the<br />

signal into the TF plane using a TF representation<br />

(13)<br />

2) For each time interval (and each spatial frequency) the<br />

frequencies <strong>of</strong> the FM components are calculated<br />

using the methodology <strong>of</strong> Section II-D.<br />

3) For each time interval , we arrange the data as 3-D<br />

points<br />

(14)<br />

where is the number <strong>of</strong> data points. These triplets are expected<br />

to be concentrated along planes through the origin.<br />

<strong>Using</strong> the fuzzy C-planes (FCP) clustering algorithm as in<br />

[11], we group them into planes, calculating simultaneously<br />

the planes’ parameters and consequently the velocities<br />

, according to (6). The velocities<br />

for each time instant are interpolated from .<br />

The efficiency <strong>of</strong> the above approach depends strongly on the<br />

accuracy <strong>of</strong> the instantaneous frequencies estimates<br />

given at the second step. Exploiting the estimates for many<br />

spatial frequencies using clustering renders the methodology<br />

more efficient.<br />

B. Some Remarks and Implementation Details<br />

Static Background: The effect <strong>of</strong> a static background is the<br />

addition <strong>of</strong> a DC term in , for each spatial frequency [see<br />

(4)–(6)]. Therefore, for each one can subtract from its<br />

mean value. This cancels out the strong effects due to the static<br />

background in the TF plane and reveals the weaker frequency<br />

components.<br />

Temporal Aliasing: The minimum required sampling rate for<br />

an alias-free (smoothed pseudo-) WV distribution is twice the<br />

Nyquist rate. Therefore, aliasing introduces wrapping <strong>of</strong> the motion<br />

planes around , along the temporal frequency. As a<br />

consequence, it is preferred to feed the FCP clustering algorithm<br />

only with triplets corresponding to low spatial frequencies<br />

such that . Also, for small ,<br />

the SNR remains high, since the natural images<br />

have their energy concentrated in low spatial frequencies,<br />

in contrast to the noise term.<br />

Spatially Local Application: The assumptions <strong>of</strong> the motion<br />

model may be relaxed, letting the objects (the intensity<br />

and alpha maps) deform slowly with time, because <strong>of</strong> the timewindowing<br />

introduced by the TF representations. This is verified<br />

also from the presented experimental results. Additionally,<br />

applying the motion model within spatial blocks (overlapping<br />

or not), more complicated motions (for example affine)<br />

are well approximated. Compared to pixel-domain approaches,<br />

frequency-domain approaches suffer from requiring large local<br />

windows, where different motions can occur. This problem is<br />

partly resolved given that our approach can handle discontinuous<br />

motions.<br />

TFR Windows Selection: The choice <strong>of</strong> the windows’ type<br />

and width used in TFRs affect the accuracy <strong>of</strong> the instantaneous<br />

frequencies estimations. However, the optimal width depends<br />

on the instantaneous frequencies, which are the unknowns to be<br />

estimated, as analyzed in [12]. An explicit study about the influence<br />

<strong>of</strong> the used windows to the accuracy <strong>of</strong> the proposed


ALEXIADIS AND SERGIADIS: ESTIMATION OF MULTIPLE, TIME-VARYING MOTIONS USING TIME-FREQUENCY REPRESENTATIONS 985<br />

method is beyond the scope <strong>of</strong> this paper, since the final velocity<br />

estimates depend on the effectiveness <strong>of</strong> the FCP. Notice,<br />

however, that for all the experiments presented, the methodology<br />

worked effectively using a discrete Hamming window <strong>of</strong><br />

length as the time-smoothing window and a Hamming<br />

<strong>of</strong> length as the frequency-smoothing window<br />

, where is the length <strong>of</strong> the sequence.<br />

IV. MOVING OBJECTS’S EXTRACTION<br />

AND MOTION SEGMENTATION<br />

In order to complete our motion estimation task by additionally<br />

identifying the moving objects, we modify two known approaches<br />

[19]–[21] adapting them to our problem, as follows.<br />

A. Objects’s Extraction in the Frequency Domain<br />

Ignore the error term in (4) and define<br />

, which according to the analysis in Section II-A corresponds<br />

to the constant shape part <strong>of</strong> the th object, shifted<br />

by . Discretizing the time-variable, for each spatial frequency<br />

we have a linear system , near ( is<br />

dropped for notational simplicity)<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

.<br />

(15)<br />

where and<br />

is the displacement from frame to frame<br />

. Since the time-varying velocities and, consequently, the<br />

corresponding displacements were estimated, the phases-matrix<br />

is known and the objective is to calculate the objects-vector<br />

. When ,wehavean system <strong>of</strong> linear<br />

equations, which is solved analytically: .<br />

However, one can exploit the information contained in a<br />

longer frame sequence, by using and<br />

solving the system in the least squares sense, i.e. minimizing<br />

.<br />

Calculating for each and applying IFT to one<br />

obtains the image <strong>of</strong> the th object at . In practice, we find convenient<br />

to solve the system only when all the singular values <strong>of</strong><br />

are upper and lower bounded (for instance, by 0.1 and 10,<br />

respectively), in order to avoid singularities and obtain robustness<br />

against noise. The missing values <strong>of</strong> for some<br />

spatial frequencies can be estimated using trigonometric or simpler,<br />

less computationally expensive interpolation schemes. Finally,<br />

when only low spatial frequency components are used,<br />

noise filtering is achieved directly in the frequency domain. Obviously,<br />

an advantage <strong>of</strong> the described approach is that it can be<br />

directly applied for separating transparent objects.<br />

B. Segmentation by Energy Minimization<br />

The goal here is to assign to each pixel one <strong>of</strong> the estimated<br />

velocities, i.e., a motion label , introducing the<br />

Fig. 2. -expansion graph, given for an 1-D image for simplicity. The appropriate<br />

link weights are given in Table I.<br />

TABLE I<br />

WEIGHTS USED IN THE -EXPANSION ALGORITHM<br />

labeling , where is image plane. This<br />

is performed by searching for which minimizes the objective<br />

function (we drop for simplicity)<br />

(16)<br />

The objective term , given in (17), shown at the bottom<br />

<strong>of</strong> the next page, enforces the classification <strong>of</strong> the pixels based<br />

on the intensity residuals after motion compensation. This, however,<br />

becomes unreliable in the presence <strong>of</strong> noise or lack <strong>of</strong><br />

texture. Therefore, also the spatial motion-smoothness term <strong>of</strong><br />

(18), shown at the bottom <strong>of</strong> the next page, is used, with “<br />

” meaning that is in the neighborhood <strong>of</strong> . The penalty<br />

is more tolerant in motion-labeling discontinuities<br />

when real intensity discontinuities are present. Also, the temporal<br />

smoothness term <strong>of</strong> (19), shown at the bottom <strong>of</strong> the next<br />

page, takes into account the labeling <strong>of</strong> previous frames.<br />

The optimal solution for the energy function <strong>of</strong> (16) can be<br />

obtained using the -expansion algorithm [20], [21], which is<br />

well suited to our problem, giving a very fast solution. Traditional<br />

optimization techniques (such as gradients, Hessians,<br />

etc.) cannot be applied here because is multimodal and<br />

nonconvex. Also, the application <strong>of</strong> genetic algorithm is computationally<br />

expensive, according to our experiments. The -expansion<br />

algorithm is summarized as follows.<br />

1) Initialization: Start with the labeling which simply<br />

minimizes , i.e. the motion residuals.<br />

2) Cycle: For each label :<br />

• Iterations: Find the optimal within an<br />

-expansion move, which is explained directly below.<br />

3) If , set and begin a<br />

new cycle, else finish.


986 <strong>IEEE</strong> TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008<br />

Fig. 3. “White Tower & Boat”—frames 1, 32, and 64 (three first figures) and the extracted layers, obtained applying (20) to the objects’ images, taken using the<br />

proposed frequency-domain method, for the first 32 frames (last two figures).<br />

Fig. 4. “White Tower & Boat”—estimated instantaneous velocities.<br />

Fig. 5. “White Tower & Boat”—motion segmentation results using graph-cuts<br />

for the frames 1, 10, and 20.<br />

Given a label ,wehavean -expansion move if some labels<br />

in have changed to in . Dynamically in<br />

each iteration, a binary Graph is constructed as in Fig. 2. We<br />

have two terminal nodes and expressing whether a pixel<br />

will change its label to or not. The nonterminal nodes<br />

represent the image pixels. Two neighbor pixels , having<br />

different labels are connected through an intermediate auxiliary<br />

node with weights and .<br />

Neighbor pixels , with the same label are directly connected<br />

to each other with weights . The nonterminal<br />

nodes are connected to and with t-links <strong>of</strong> appropriate<br />

weights as in Fig. 2. With<br />

being the “data term” (see [20] for details), the weights used are<br />

with<br />

Fig. 6. “Walker”—walker’s positions in the frames 1, 11, 21, FFF, 181 <strong>of</strong> the<br />

natural sequence.<br />

Fig. 7. “Walker”—SPWV for 3 aHand (left) 3 aIS%aITH and (right)<br />

the estimated instantaneous velocity.<br />

summarized in Table I. In order to find the optimal -expansion<br />

move, we seek for the “minimum-cost” cut <strong>of</strong> the graph. A<br />

cut is a partitioning <strong>of</strong> the nodes into two subsets and<br />

such that and and it includes exactly one t-link<br />

for any pixel . Its cost is the sum <strong>of</strong> the weights <strong>of</strong> the links<br />

it includes. After finding the minimum-cost cut, a pixel is<br />

with (17)<br />

with<br />

otherwise<br />

otherwise<br />

(18)<br />

(19)


ALEXIADIS AND SERGIADIS: ESTIMATION OF MULTIPLE, TIME-VARYING MOTIONS USING TIME-FREQUENCY REPRESENTATIONS 987<br />

Fig. 8. “Transparent Lena-Cameraman”—frames 1, 16, and 32 <strong>of</strong> the sequence (first three figures) and the extracted layers using the proposed frequency-domain<br />

method (last two figures).<br />

Fig. 9. “Transparent Lena-Cameraman”—motion planes points @A for (left)<br />

w X P ‘IVY PT“ and (right) the estimated instantaneous velocities.<br />

assigned the label if the cut separates from the terminal ,<br />

or it keeps its old label otherwise.<br />

C. Layers’ Synthesis, Background Mosaicing<br />

Given that the instantaneous velocities estimation was adequately<br />

precise and the moving objects were efficiently separated<br />

using one <strong>of</strong> the presented methodologies, the objects will<br />

appear stationary in the motion-compensated sequence. Variations<br />

due to occlusions, noise and illumination changes are expected.<br />

Moreover, some regions <strong>of</strong> a moving background appear<br />

only after some frames. The extraction <strong>of</strong> a complete representative<br />

image for each layer is performed by a temporal median<br />

operation [5]<br />

(20)<br />

where is the motion compensated-image <strong>of</strong> the th object.<br />

This procedure, together with the other layers, synthesizes<br />

the background mosaic (a panoramic image), or the “sprite” in<br />

MPEG-4 terminology.<br />

V. EXPERIMENTAL RESULTS<br />

The overall motion-estimation algorithm is described in Algorithm<br />

1 (shown at the top <strong>of</strong> the next page). The Matlab code<br />

that accomplishes the simulation results, described in this section,<br />

can be found at the web location: http://esopos.ee.auth.gr/<br />

TIP-<strong>Varying</strong>Motion/code.zip. In the first experiment, the spectrogram<br />

is used because <strong>of</strong> its simplicity. In the next experiments,<br />

the SPWV distribution is used, with a Hamming window<br />

<strong>of</strong> length as the time-smoothing window and a Hamming<br />

<strong>of</strong> length as the frequency-smoothing window<br />

.<br />

A. “White Tower & Boat” Synthetic Sequence<br />

The synthetically generated sequence <strong>of</strong> Fig. 3 contains<br />

a moving object (boat) on a moving background (White<br />

Tower). The frame size is 296 296 and the sequence’s<br />

length is frames. The background moves with velocity<br />

pixel/frame and the boat decelerates according to<br />

pixels/frame. The simulated sequence was<br />

generated according to the presented motion model <strong>of</strong> (1) and<br />

(2) and Fig. 1. We used two 8-bit grayscale still images, a background<br />

“White Tower” image <strong>of</strong> size 502 296 and an image<br />

containing the “Boat” object. These still images correspond to<br />

, 1, 2 <strong>of</strong> the model. The motion <strong>of</strong> the background<br />

was obtained by selecting the appropriate 296 296 region<br />

<strong>of</strong> the corresponding still image, for each time instant. For the<br />

“Boat,” we created manually an auxiliary “mask” image, which<br />

corresponds to the window <strong>of</strong> (1). We cropped the<br />

“Boat” object from the still image and superimposed it on the<br />

background frame, at the appropriate instantaneous location.<br />

In order to handle noninteger instantaneous displacements, we<br />

applied bilinear interpolation. Finally, to obtain more natural<br />

visual appearance, we applied a weak smoothing operation near<br />

the border <strong>of</strong> the “Boat” object.<br />

For this experiment, White Gaussian Noise with<br />

dB was added. We use the spectrogram because <strong>of</strong> its<br />

simplicity. The window used to obtain the Short-<strong>Time</strong> Fourier<br />

Transform is a . In the TFR images, we search<br />

for two peaks along for each time instant . This is equivalent<br />

to applying the proposed Hough-transform methodology on<br />

TFR slices <strong>of</strong> length . Although, the spectrogram<br />

presents interference terms and bad localization, the estimated<br />

instantaneous velocities using FCP are very close to the<br />

real ones, as can be realized from Fig. 4. In Fig. 3, we also<br />

present the layers obtained applying (20) on the object images,<br />

obtained using the proposed frequency-domain approach <strong>of</strong><br />

Section IV-A. Finally, in Fig. 5, we give results <strong>of</strong> the motion<br />

segmentation approach, using the graph-cuts.<br />

B. “Walker”<br />

The natural sequence, used in [12] and depicted in Fig. 6, contains<br />

a single walker, whose size is small compared to the static<br />

background. The sequence contains video-frames <strong>of</strong><br />

size 320 240. The SPWV <strong>of</strong> for and a given<br />

value <strong>of</strong> is depicted in Fig. 7. The TFR images are clear in the<br />

sense that almost no interference terms exist and the TF localization<br />

is good. Applying our methodology by cutting the TFR<br />

images into overlapping time-slices <strong>of</strong> length<br />

with overlapping factor , the estimated instantaneous<br />

velocity is given in Fig. 7. Our results are similar to those in<br />

[12]. However, our methodology is more general in the sense<br />

that it can handle also sequences with large moving objects and<br />

moving background, at the cost <strong>of</strong> increased computational effort.


988 <strong>IEEE</strong> TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008<br />

C. Transparent “Lena -Cameraman” Synthetic Sequence<br />

In order to verify the effectiveness <strong>of</strong> the approach for transparent<br />

motions, we created the synthetic sequence shown in<br />

Fig. 8. The final sequence shown is <strong>of</strong> size 256 256 and length<br />

. To generate it, we used 8-bit grayscale, 512 512<br />

versions <strong>of</strong> the well-known “Lena ” and “Cameraman” still images.<br />

The velocities <strong>of</strong> the sequence are nonzero only along the<br />

-direction. Therefore, along the -direction, we used only the<br />

region <strong>of</strong> the original images .<br />

The instantaneous horizontal velocities vary as quadratic polynomials<br />

<strong>of</strong> time, as depicted in Fig. 9. The motions <strong>of</strong> the<br />

layers were obtained by selecting the appropriate regions <strong>of</strong> the<br />

corresponding still images: (“Lena”)<br />

and (“Cameraman”),<br />

which are <strong>of</strong> size . The translating foreground<br />

“Lena ” layer is opaque except from a together-translating<br />

rectangular region, where it is transparent. According to the<br />

presented motion model, the corresponding “alpha” map is<br />

unity everywhere, except from the rectangular image region:<br />

(for the first frame),<br />

where it is equal to 0.5. To handle noninteger instantaneous<br />

displacements, we applied bilinear interpolation.<br />

White Gaussian noise (WGN) with dB was added.<br />

The estimated instantaneous velocities are given in Fig. 9, together<br />

with the actual ones. <strong>Using</strong> the SPWV, the estimated<br />

motion planes points for are also given in<br />

Fig. 9. Finally, using the presented frequency domain approach,<br />

the synthesized layers are presented in Fig. 8.<br />

D. Modified “Coast Guard”<br />

In this experiment, the well-known “Coast Guard” sequence<br />

is used. In order to have time-varying motion for the back-<br />

ground, we have created a modified sequence (depicted in<br />

Fig. 10) <strong>of</strong> length . The original sequence’s frames<br />

138–265 are used with an increasing rate, according to the<br />

formula: , with and being the<br />

frame numbers in the new and the original sequence, respectively.<br />

The SPWV for is given in Fig. 11(a).<br />

Notice that strong interference terms are present, because the<br />

applied time/frequency smoothing is not sufficient. However,<br />

applying the presented Hough-transform methodology, the<br />

estimation <strong>of</strong> the instantaneous frequencies becomes reliable.<br />

Two peaks are clearly visible in the output <strong>of</strong> the line matched<br />

filter [Fig. 11(c)] applied to the SPWV slice <strong>of</strong> Fig. 11(b). The<br />

length <strong>of</strong> the SPWV slices is and the overlapping<br />

factor between successive pieces . The estimated<br />

instantaneous velocities and displacements are given in Fig. 11<br />

which are close to the real ones, found by manual feature-point<br />

tracking. In Fig. 10, the extracted layers are also shown, using<br />

(20) on the results <strong>of</strong> the frequency-domain approach <strong>of</strong> Section<br />

IV-A.<br />

E. CAVIAR Sequence—Man Walking<br />

The proposed methodology was successfully applied also<br />

on sequences captured for the purpose <strong>of</strong> the CAVIAR project<br />

[22]. We present the results for the frames 310–437 <strong>of</strong> the<br />

original “Walk2” sequence captured at the INRIA-lab, with<br />

a time-downsampling factor equal to 2. The final sequence,<br />

depicted in Fig. 12, has frames <strong>of</strong> size 384 288<br />

and contains a man walking along a parabolic trajectory.<br />

For the piecewise linear approximation <strong>of</strong> the instantaneous<br />

frequencies curves, the produced TFR images were cut into<br />

overlapping slices <strong>of</strong> length , with overlapping factor<br />

. The final results are given in Fig. 13 and they are<br />

reasonably accurate, compared to the ground-truth values.


ALEXIADIS AND SERGIADIS: ESTIMATION OF MULTIPLE, TIME-VARYING MOTIONS USING TIME-FREQUENCY REPRESENTATIONS 989<br />

Fig. 10. Modified “Coast Guard” sequence—frames 1, 21, and 41 (three first figures) and the extracted layers using the first 30 frames (last two figures).<br />

Fig. 11. Modified “Coast Guard”—from left to right: SPVW for @3 Y3 Aa@IQ%aIUTY HA, slice <strong>of</strong> the SPWV image for w X P ‘PWY RR“, and corresponding<br />

output <strong>of</strong> the line-matched filter. The estimated instantaneous velocities and displacements, compared to the real ones.<br />

Fig. 12. “CAVIAR”—frames 1, 31, and 61 <strong>of</strong> the natural sequence (first three figures) and (right) the extracted stationary-background and moving man objects<br />

for the 16th frame.<br />

Fig. 13. “CAVIAR”—estimated instantaneous velocity and displacement, compared to the real ones.<br />

Finally, in Fig. 12, we present results <strong>of</strong> the proposed frequency<br />

domain objects extraction method.<br />

VI. CONCLUSION<br />

Combining and extending previously published ideas [6], [7],<br />

[11], [12], we presented a new spatiotemporal approach for the<br />

estimation <strong>of</strong> the motions <strong>of</strong> multiple objects, which move with<br />

time-varying velocities. For this task, TFR was the appropriate,<br />

selected signal processing tool. We applied the Hough transform<br />

in overlapping time-slices <strong>of</strong> the TFR images to increase<br />

the robustness against noise and interference terms, present in<br />

the TFRs. Moreover, by exploiting the instantaneous frequencies<br />

estimates obtained for many spatial frequency pairs using<br />

FCP, we increased the robustness and accuracy <strong>of</strong> our method.<br />

Having estimated accurately the objects velocities, the proposed<br />

objects segmentation approaches produced correct and meaningful<br />

results. The experiments on various synthetic and natural<br />

sequences verified the effectiveness <strong>of</strong> the proposed method.<br />

The presented and used motion model considers time-varying<br />

but space-invariant motions and its effectiveness was demonstrated<br />

for the dominant and relatively large moving objects<br />

. However, by applying space-locally the proposed approach,<br />

namely using local (windowed) 2-D Fourier transforms,<br />

the above limitations no longer exist. This comes with increased<br />

computational effort, compared to constant motions approaches,<br />

but also with the benefit <strong>of</strong> increased accuracy. Our future research<br />

will focus on achieving real-time motion tracking characteristics,<br />

for both monochrome and color image sequences.<br />

REFERENCES<br />

[1] Y. Wang, J. Ostermann, and Y. Zhang, Video Processing and Communications.<br />

Englewood Cliffs, NJ: Prentice-Hall, 2002.<br />

[2] B. K. B. Horn and B. G. Schunck, “Determining optical flow,” Artif.<br />

Intell., vol. 17, pp. 185–203, 1981.<br />

[3] H. H. Nagel and E. Enklemann, “An investigation <strong>of</strong> smoothness constraints<br />

for the estimation <strong>of</strong> displacement vector fields from image sequences,”<br />

<strong>IEEE</strong> Trans. Pattern Anal. Mach. Intell., vol. PAMI-8, no. 9,<br />

pp. 565–593, Sep. 1986.


990 <strong>IEEE</strong> TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008<br />

[4] M. J. Black and P. Anandan, “The robust estimation <strong>of</strong> multiple<br />

motions: Parametric and piecewise-smooth flow-fields,” Comput. Vis.<br />

Image Understand., vol. 63, pp. 75–104, Jan. 1996.<br />

[5] J. Y. A. Wang and E. H. Adelson, “Representing moving images with<br />

layers,” <strong>IEEE</strong> Trans. Image Process., vol. 3, no. 9, pp. 625–638, Sep.<br />

1984.<br />

[6] W.-G. Chen, G. B. Giannakis, and N. Nandhakumar, “A harmonic retrieval<br />

framework for discontinuous motion estimation,” <strong>IEEE</strong> Trans.<br />

Image Process., vol. 7, no. 9, pp. 1242–1257, Sep. 1998.<br />

[7] C. E. Erdem, G. Z. Karabulut, E. Y. Yanmaz, and E. Anarim, “Motion<br />

estimation in the frequency domain using fuzzy c-planes clustering,”<br />

<strong>IEEE</strong> Trans. Image Process., vol. 10, no. 12, pp. 1873–1879, Dec. 2001.<br />

[8] B. Porat and B. FriedLander, “A frequency domain algorithm for multiframe<br />

detection <strong>of</strong> Dim targets,” <strong>IEEE</strong> Trans. Pattern Anal. Mach.<br />

Intell., vol. 12, no. 4, pp. 398–401, Apr. 1990.<br />

[9] P. Milanfar, “Two-dimensional matched filtering for motion estimation,”<br />

<strong>IEEE</strong> Trans. Image Process., vol. 8, no. 3, pp. 438–444, Mar.<br />

1999.<br />

[10] W.-G. Chen, G. B. Giannakis, and N. Nandhakumar, “Spatiotemporal<br />

approach for time-varying global image motion estimation,” <strong>IEEE</strong><br />

Trans. Image Process., vol. 5, no. 10, pp. 1448–1461, Oct. 1996.<br />

[11] D. S. Alexiadis and G. D. Sergiadis, “<strong>Estimation</strong> <strong>of</strong> multiple accelerated<br />

motions using chirp-Fourier transform and clustering,” <strong>IEEE</strong><br />

Trans. Image Process., vol. 16, no. 1, pp. 142–152, Jan. 2007.<br />

[12] I. Djurovic and S. Stankovic, “<strong>Estimation</strong> <strong>of</strong> time-varying velocities <strong>of</strong><br />

moving objects by time-frequency representations,” <strong>IEEE</strong> Trans. Image<br />

Process., vol. 12, no. 5, pp. 550–562, May 2003.<br />

[13] E. H. Adelson and H. R. Bergen, “Spatiotemporal energy models for<br />

the perception <strong>of</strong> motion,” J. Opt Soc. Amer. A, vol. 2, pp. 284–299,<br />

1985.<br />

[14] B. A. Watson and A. J. Ahumada, “Model <strong>of</strong> human visual-motion<br />

sensing,” J. Opt Soc. Amer. A, vol. 2, pp. 322–342, 1985.<br />

[15] S. S. Beauchemin and J. L. Barron, “The frequency structure <strong>of</strong> one-dimensional<br />

occluding image signals,” <strong>IEEE</strong> Trans. Pattern Anal. Mach.<br />

Intell., vol. 22, no. 2, pp. 200–206, Feb. 2000.<br />

[16] L. Cohen, “<strong>Time</strong>-frequency distributions: A review,” Proc. <strong>IEEE</strong>, vol.<br />

77, no. 7, pp. 941–981, Jul. 1989.<br />

[17] L. J. Stanckovic, “The auto-term representation by the reduced interference<br />

distributions; The procedure for a kernel design,” <strong>IEEE</strong> Trans.<br />

Signal Process., vol. 44, pp. 1557–1564, June 1996.<br />

[18] S. Barbarossa, “Analysis <strong>of</strong> multicomponent LFM signals by a combined<br />

Wigner-Hough transform,” <strong>IEEE</strong> Trans. Signal Process., vol. 43,<br />

no. 6, pp. 1511–1515, Jun. 1995.<br />

[19] C. Mota, I. Stuke, T. Aach, and E. Barth, “Divide-and-conquer strategies<br />

for estimating multiple transparent motions,” in Lecture Notes on<br />

Computer Science. Berlin, Germany: Springer, 2005, vol. 3417.<br />

[20] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization<br />

via graph cuts,” <strong>IEEE</strong> Trans. Pattern Anal. Mach. Intell., vol.<br />

23, no. 11, pp. 1222–1239, Nov. 2001.<br />

[21] V. Kolmogorov and R. Zabih, “What energy functions can be minimized<br />

via graph cuts?,” <strong>IEEE</strong> Trans. Pattern Anal. Mach. Intell., vol.<br />

26, no. 2, pp. 147–159, Feb. 2004.<br />

[22] 2005, The CAVIAR website [Online]. Available: http://homepages.inf.ed.ac.uk/rbf/CAVIAR/<br />

Dimitrios S. Alexiadis (S’03) was born in Kozani,<br />

Greece, in 1978. He received the Diploma degree<br />

in electrical and computer engineering from the<br />

Aristotle University <strong>of</strong> Thessaloniki, Thessaloniki,<br />

Greece, in 2002. He is currently pursuing the Ph.D.<br />

degree at the Telecommunications Laboratory, Department<br />

<strong>of</strong> Electrical and Computer Engineering,<br />

Aristotle University <strong>of</strong> Thessaloniki.<br />

Since 2002, he has been involved in several research<br />

projects and, at the same time, he was a Research<br />

and Teaching Assistant at the Aristotle University.<br />

His main research interests include image/video processing and speech<br />

synthesis/recognition.<br />

Mr. Alexiadis is a member <strong>of</strong> the Technical Chamber <strong>of</strong> Greece.<br />

George D. Sergiadis (M’88) was born in Thessaloniki,<br />

Greece, in 1955. He received the Diploma in<br />

Electrical Engineering from the Aristotle University<br />

<strong>of</strong> Thessaloniki in 1978 and the Ph.D. degree from<br />

the Ecole Nationale Supérieure des Télécommunications,<br />

Paris, France, in 1982.<br />

He was with Thomson CsF, France, until 1985,<br />

participating in the development <strong>of</strong> the French<br />

Magnetic Resonance Scanner. Since 1985, he has<br />

been with the Aristotle University <strong>of</strong> Thessaloniki,<br />

teaching telecommunications and biomedical engineering,<br />

now as a Full Pr<strong>of</strong>essor. For three years, he also served as the<br />

Director <strong>of</strong> the Telecommunications Department. He has developed the Hellenic<br />

TTS engine “Esopos” and designed the mobile communications for the<br />

Athens Olympic Games in 2004. His current research interests include fuzzy<br />

image processing and wireless communications. During the academic year<br />

2004–2005, he was a Visiting Researcher at the Media Lab, Massachusetts<br />

Institute <strong>of</strong> Technology, Cambridge.<br />

Dr. Sergiadis is the President <strong>of</strong> ELEVIT, the Hellenic society for Biomedical<br />

Engineering, and a member <strong>of</strong> TEE, ATLAS, SMRM, ESMRM, EMBS, and<br />

SCIP.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!