Estimation of Multiple, Time-Varying Motions Using ... - IEEE Xplore
Estimation of Multiple, Time-Varying Motions Using ... - IEEE Xplore
Estimation of Multiple, Time-Varying Motions Using ... - IEEE Xplore
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
982 <strong>IEEE</strong> TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008<br />
<strong>Estimation</strong> <strong>of</strong> <strong>Multiple</strong>, <strong>Time</strong>-<strong>Varying</strong> <strong>Motions</strong><br />
<strong>Using</strong> <strong>Time</strong>-Frequency Representations and<br />
Moving-Objects Segmentation<br />
Dimitrios S. Alexiadis, Student Member, <strong>IEEE</strong>, and George D. Sergiadis, Member, <strong>IEEE</strong><br />
Abstract—We extend existing spatiotemporal approaches to<br />
handle time-varying motions estimation <strong>of</strong> multiple objects. It is<br />
shown that multiple, time-varying motions estimation is equivalent<br />
to the instantaneous frequency estimation <strong>of</strong> superpositioned<br />
FM sinusoids. Therefore, we apply established signal processing<br />
tools, such as time-frequency representations to show that for<br />
each time instant, the energy is concentrated along planes in the<br />
3-D space: spatial frequencies—instantaneous frequency. <strong>Using</strong><br />
fuzzy C-planes, we estimate indirectly the instantaneous velocities.<br />
Furthermore, adapting existing approaches to our problem, we<br />
attain the identification <strong>of</strong> the moving objects. The experimental<br />
results verify the effectiveness <strong>of</strong> our methodology.<br />
Index Terms—Computer vision, motion estimation, time-frequency<br />
representations.<br />
I. INTRODUCTION<br />
ESTIMATING the motion characteristics within a<br />
time-varying scene and identifying the moving objects is<br />
an important task in many video applications [1]. Most <strong>of</strong> the<br />
proposed motion analysis techniques are carried out directly in<br />
the pixel-intensity domain, applying typically the optical flow<br />
equation or correlation-matching in short sequences <strong>of</strong> frame,<br />
combined with spatial motion-smoothness constraints in order<br />
to obtain the actual motion field [1, Ch. 6]–[3]. However, this<br />
has the unwanted effect <strong>of</strong> over-smoothing the motion discontinuities,<br />
across object edges. More recently, this problem is<br />
partially resolved within a robust estimation framework [4].<br />
The technique presented in this paper belongs to the class<br />
<strong>of</strong> spatiotemporal (3-D) spectral approaches [6]–[12], which<br />
exploit the information contained in long image sequences and<br />
estimate directly the true motion characteristics, regardless <strong>of</strong><br />
the moving objects’ shape or dense motion discontinuities.<br />
Neurophysiological evidence suggests that the human visual<br />
system probably works in a similar manner [13], [14]. However,<br />
the limitation <strong>of</strong> most spatiotemporal approaches is the<br />
assumption <strong>of</strong> time-constant motion [6]–[9]. Thus, they can be<br />
applied only for short time intervals. A few attempts have been<br />
made to overcome the limitation. The authors in [10] address<br />
Manuscript received June 24, 2006; revised February 4, 2008. The associate<br />
editor coordinating the review <strong>of</strong> this manuscript and approving it for publication<br />
was Dr. Onur G. Guleryuz.<br />
The authors are with the Telecommunications Laboratory, Department<br />
<strong>of</strong> Electrical and Computer Engineering, Aristotle University <strong>of</strong> Thessaloniki,<br />
Thessaloniki GR 54124, Greece (e-mail: dalexiad@mri.ee.auth.gr;<br />
sergiadi@auth.gr).<br />
Digital Object Identifier 10.1109/TIP.2008.922437<br />
1057-7149/$25.00 © 2008 <strong>IEEE</strong><br />
only the problem <strong>of</strong> a single global time-varying motion, using<br />
the high-order ambiguity function, to estimate the parameters<br />
<strong>of</strong> chirp signals. The work in [11] handles multiple accelerated<br />
motions, exploiting the chirp-Fourier transform and a clustering<br />
algorithm. In another recent work [12], the instantaneous<br />
velocities are estimated using time-frequency representations<br />
(TFRs), with the restriction, however, <strong>of</strong> stationary background<br />
and small moving objects.<br />
Combining and mainly extending previous relevant ideas [6],<br />
[7], [11], [12], this paper presents a “motion first” spatiotemporal<br />
methodology, for the estimation <strong>of</strong> multiple, occluded<br />
or transparent, time-varying motions and the extraction <strong>of</strong> the<br />
moving objects in the frequency domain. In Section II, we formulate<br />
the motion estimation problem in the frequency domain<br />
and we give the theoretical background for the instantaneous<br />
frequency estimation. In Section III, we present the proposed<br />
methodology for motion estimation by TFRs, while in Section<br />
IV, we address the motion segmentation/objects’ extraction<br />
problem. Finally, in Section V, we present experimental results<br />
on synthetic and natural sequences, before concluding.<br />
II. PROBLEM FORMULATION AND THEORETICAL BACKGROUND<br />
A. Modeling <strong>Multiple</strong> <strong>Time</strong>-<strong>Varying</strong> Motion<br />
We use a 2.5-D layered representation for image sequences<br />
[5], which is described in Fig. 1. The layers are ordered in depth.<br />
For each layer , we have a) the intensity map ;<br />
b) the alpha map , which defines the spatial<br />
support/opacity <strong>of</strong> the layer and remains always registered with<br />
the intensity map; and c) the time varying velocity , which<br />
describes how the maps move over time. The layered representation<br />
can be summarized in the equation<br />
where is the displacement<br />
<strong>of</strong> the th layer at time and<br />
is a window function which carries the transparency/occlusions<br />
information during time. Although the alpha map remains<br />
registered with the intensity map , this does not hold for .<br />
The window is deformed over time and has to be decomposed<br />
(1)<br />
(2)
ALEXIADIS AND SERGIADIS: ESTIMATION OF MULTIPLE, TIME-VARYING MOTIONS USING TIME-FREQUENCY REPRESENTATIONS 983<br />
Fig. 1. Block diagram for our layered model, similar to the one in [5]; “cmp”<br />
represents @— AaI0 — .<br />
into a constant portion and a deformable, error-introducing portion.<br />
After some manipulations [6]<br />
where corresponds to the virtual, constant part <strong>of</strong><br />
. The distortion term accounts for the deformation<br />
<strong>of</strong> due to occlusions. Any other source <strong>of</strong> distortion<br />
(sensor noise, deviation from the motion model, etc.) can also be<br />
included. An explicit analysis in the frequency domain for the<br />
occlusion distortion term in the case <strong>of</strong> time-constant velocities<br />
and two moving objects can be found in [15]. Notice that our<br />
model intrinsically can handle generally transparent motions,<br />
with the occluded motion being only a special case, when is<br />
simply zero or unity.<br />
B. Velocity <strong>Estimation</strong> as Instantaneous Frequency <strong>Estimation</strong><br />
Consider the spatial (2-D) Fourier transform <strong>of</strong> each frame.<br />
From (3) and with ,wehave<br />
where the instantaneous phase <strong>of</strong> the th component is<br />
Consequently, the corresponding instantaneous frequency is<br />
where stands for the instantaneous velocity<br />
<strong>of</strong> the th object, which is considered to be a continuous<br />
and smooth function <strong>of</strong> , as expected by the physic’s laws. From<br />
(4)–(6), one can conclude the following.<br />
1) For a fixed spatial frequency , the signal is a superposition<br />
<strong>of</strong> frequency modulated (FM) complex sinusoids<br />
with instantaneous frequencies .<br />
2) For a given time instant , the instantaneous frequency<br />
varies linearly with and . The triplets<br />
lie on a plane through the origin,<br />
(3)<br />
(4)<br />
(5)<br />
(6)<br />
whose direction depend on the instantaneous velocity at<br />
the instant .<br />
According to the first conclusion, a key component <strong>of</strong> our<br />
approach is the estimation <strong>of</strong> the instantaneous frequencies<br />
. For this purpose, one can exploit established signal<br />
processing tools, such as the time-frequency representations<br />
[16]–[18].<br />
C. Smoothed Pseudo Wigner–Ville (SPWV) Distribution<br />
In this subsection, we briefly review the smoothed pseudo<br />
Wigner–Ville (SPWV) distribution, which is used in our experiments.<br />
For a detailed analysis on TFRs, the reader is referred<br />
to [16]–[18]. We drop the variable for notational simplicity.<br />
The SPWV distribution <strong>of</strong> is given by<br />
namely from the 2-D convolution <strong>of</strong> the well-known<br />
Wigner–Ville (WV) distribution<br />
with a separable 2-D smoothing kernel ,<br />
where is the Fourier transform <strong>of</strong> an 1-D smoothing<br />
window .<br />
The WV distribution presents perfect localization along<br />
the instantaneous frequency for a single linear FM sinusoid.<br />
However, it presents emphatic interference terms in the case<br />
<strong>of</strong> multicomponent signals, which may be destructive in many<br />
applications. Fortunately, the interference terms have oscillatory<br />
nature and can, therefore, be attenuated by the smoothing<br />
operation with the kernel . Introducing separable<br />
smoothing actions allows us to control the smoothing in time<br />
and frequency freely and independently to each other. The<br />
wider the time (frequency) smoothing window, the more time<br />
(frequency) smoothing can be achieved. Also, it can be shown<br />
that the well-known spectrogram, namely the squared modulus<br />
<strong>of</strong> the windowed (short-time) FT, is related to the WV distribution,<br />
using an appropriate (nonseparable) kernel [16].<br />
D. Extracting Information From <strong>Time</strong>-Frequency Images<br />
The simplest approach to extract the instantaneous frequencies<br />
is to find the peaks along in the TFR image, for<br />
each . However, in order to obtain robustness against noise and<br />
interference terms, we use the following methodology. Considering<br />
that the instantaneous velocities and, consequently, the instantaneous<br />
frequencies in (6) vary as smooth functions <strong>of</strong><br />
, we expect smooth curves in the TF plane, which can be approximated<br />
as piecewise linear polynomials. Namely, one can<br />
approximate higher-order FM signals by piecewise linear FM<br />
signals. Let the length <strong>of</strong> the image sequence be . We break<br />
the interval into overlapping intervals <strong>of</strong> length<br />
(7)<br />
(8)<br />
(9)
984 <strong>IEEE</strong> TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008<br />
where is the center time instant <strong>of</strong> the th interval . With<br />
being the overlapping factor between two successive<br />
intervals, we have<br />
(10)<br />
and . We want to estimate the mean<br />
frequencies and the chirp rates for each time interval.<br />
To its center , we assign the instantaneous frequencies estimations<br />
as . The instantaneous frequencies for each<br />
time instant can be interpolated from . For the desired<br />
estimations, we cut the TFR image into the overlapping<br />
time-slices , corresponding to the intervals<br />
. The linear FM estimation within<br />
each time interval is performed as described below.<br />
The WV distribution ideally maps a linear FM sinusoid into<br />
a line in the TF plane. Consequently, the problem <strong>of</strong> the detection-estimation<br />
<strong>of</strong> such a signal is reduced to the easy-to-solve<br />
problem <strong>of</strong> detecting a line, which can be accomplished using<br />
the Hough transform. Applying the Hough transform to the WV<br />
distribution, one obtains the Wigner–Hough transform [18]. It<br />
can be shown that this is the asymptotically efficient estimator<br />
(it reaches the Cramer–Rao bound) <strong>of</strong> linear FM sinusoids in<br />
white Gaussian noise. In practice the TFR is calculated for discrete<br />
values <strong>of</strong> and . Therefore, we use a discrete-space version<br />
<strong>of</strong> Hough transform, similarly to [9]. The distance between<br />
a point in a discrete image and a line<br />
with slope and bias is calculated by<br />
(11)<br />
A “digital” line <strong>of</strong> width is defined as the set <strong>of</strong><br />
all satisfying . Given the total<br />
height <strong>of</strong> a TFR equal to , a good value for , that is used<br />
in our experiments is . The detection-estimation <strong>of</strong><br />
lines is performed by searching the peaks <strong>of</strong> the 2-D function<br />
(line matched filter)<br />
(12)<br />
where is the number <strong>of</strong> pixels belonging to the line<br />
. Notice that, thanks to the integration along lines,<br />
even when strong interference-terms are present in the TFR images,<br />
they are attenuated because they have oscillatory nature.<br />
Therefore, the detector is relatively efficient for multiple FM<br />
components.<br />
III. PROPOSED METHODOLOGY<br />
A. General Approach<br />
According to the motion analysis <strong>of</strong> Section II and the theory<br />
<strong>of</strong> TFRs, a methodology for smoothly time-varying motions estimation,<br />
can be summarized in the following steps.<br />
1) For each spatial frequency , we map the<br />
signal into the TF plane using a TF representation<br />
(13)<br />
2) For each time interval (and each spatial frequency) the<br />
frequencies <strong>of</strong> the FM components are calculated<br />
using the methodology <strong>of</strong> Section II-D.<br />
3) For each time interval , we arrange the data as 3-D<br />
points<br />
(14)<br />
where is the number <strong>of</strong> data points. These triplets are expected<br />
to be concentrated along planes through the origin.<br />
<strong>Using</strong> the fuzzy C-planes (FCP) clustering algorithm as in<br />
[11], we group them into planes, calculating simultaneously<br />
the planes’ parameters and consequently the velocities<br />
, according to (6). The velocities<br />
for each time instant are interpolated from .<br />
The efficiency <strong>of</strong> the above approach depends strongly on the<br />
accuracy <strong>of</strong> the instantaneous frequencies estimates<br />
given at the second step. Exploiting the estimates for many<br />
spatial frequencies using clustering renders the methodology<br />
more efficient.<br />
B. Some Remarks and Implementation Details<br />
Static Background: The effect <strong>of</strong> a static background is the<br />
addition <strong>of</strong> a DC term in , for each spatial frequency [see<br />
(4)–(6)]. Therefore, for each one can subtract from its<br />
mean value. This cancels out the strong effects due to the static<br />
background in the TF plane and reveals the weaker frequency<br />
components.<br />
Temporal Aliasing: The minimum required sampling rate for<br />
an alias-free (smoothed pseudo-) WV distribution is twice the<br />
Nyquist rate. Therefore, aliasing introduces wrapping <strong>of</strong> the motion<br />
planes around , along the temporal frequency. As a<br />
consequence, it is preferred to feed the FCP clustering algorithm<br />
only with triplets corresponding to low spatial frequencies<br />
such that . Also, for small ,<br />
the SNR remains high, since the natural images<br />
have their energy concentrated in low spatial frequencies,<br />
in contrast to the noise term.<br />
Spatially Local Application: The assumptions <strong>of</strong> the motion<br />
model may be relaxed, letting the objects (the intensity<br />
and alpha maps) deform slowly with time, because <strong>of</strong> the timewindowing<br />
introduced by the TF representations. This is verified<br />
also from the presented experimental results. Additionally,<br />
applying the motion model within spatial blocks (overlapping<br />
or not), more complicated motions (for example affine)<br />
are well approximated. Compared to pixel-domain approaches,<br />
frequency-domain approaches suffer from requiring large local<br />
windows, where different motions can occur. This problem is<br />
partly resolved given that our approach can handle discontinuous<br />
motions.<br />
TFR Windows Selection: The choice <strong>of</strong> the windows’ type<br />
and width used in TFRs affect the accuracy <strong>of</strong> the instantaneous<br />
frequencies estimations. However, the optimal width depends<br />
on the instantaneous frequencies, which are the unknowns to be<br />
estimated, as analyzed in [12]. An explicit study about the influence<br />
<strong>of</strong> the used windows to the accuracy <strong>of</strong> the proposed
ALEXIADIS AND SERGIADIS: ESTIMATION OF MULTIPLE, TIME-VARYING MOTIONS USING TIME-FREQUENCY REPRESENTATIONS 985<br />
method is beyond the scope <strong>of</strong> this paper, since the final velocity<br />
estimates depend on the effectiveness <strong>of</strong> the FCP. Notice,<br />
however, that for all the experiments presented, the methodology<br />
worked effectively using a discrete Hamming window <strong>of</strong><br />
length as the time-smoothing window and a Hamming<br />
<strong>of</strong> length as the frequency-smoothing window<br />
, where is the length <strong>of</strong> the sequence.<br />
IV. MOVING OBJECTS’S EXTRACTION<br />
AND MOTION SEGMENTATION<br />
In order to complete our motion estimation task by additionally<br />
identifying the moving objects, we modify two known approaches<br />
[19]–[21] adapting them to our problem, as follows.<br />
A. Objects’s Extraction in the Frequency Domain<br />
Ignore the error term in (4) and define<br />
, which according to the analysis in Section II-A corresponds<br />
to the constant shape part <strong>of</strong> the th object, shifted<br />
by . Discretizing the time-variable, for each spatial frequency<br />
we have a linear system , near ( is<br />
dropped for notational simplicity)<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
.<br />
(15)<br />
where and<br />
is the displacement from frame to frame<br />
. Since the time-varying velocities and, consequently, the<br />
corresponding displacements were estimated, the phases-matrix<br />
is known and the objective is to calculate the objects-vector<br />
. When ,wehavean system <strong>of</strong> linear<br />
equations, which is solved analytically: .<br />
However, one can exploit the information contained in a<br />
longer frame sequence, by using and<br />
solving the system in the least squares sense, i.e. minimizing<br />
.<br />
Calculating for each and applying IFT to one<br />
obtains the image <strong>of</strong> the th object at . In practice, we find convenient<br />
to solve the system only when all the singular values <strong>of</strong><br />
are upper and lower bounded (for instance, by 0.1 and 10,<br />
respectively), in order to avoid singularities and obtain robustness<br />
against noise. The missing values <strong>of</strong> for some<br />
spatial frequencies can be estimated using trigonometric or simpler,<br />
less computationally expensive interpolation schemes. Finally,<br />
when only low spatial frequency components are used,<br />
noise filtering is achieved directly in the frequency domain. Obviously,<br />
an advantage <strong>of</strong> the described approach is that it can be<br />
directly applied for separating transparent objects.<br />
B. Segmentation by Energy Minimization<br />
The goal here is to assign to each pixel one <strong>of</strong> the estimated<br />
velocities, i.e., a motion label , introducing the<br />
Fig. 2. -expansion graph, given for an 1-D image for simplicity. The appropriate<br />
link weights are given in Table I.<br />
TABLE I<br />
WEIGHTS USED IN THE -EXPANSION ALGORITHM<br />
labeling , where is image plane. This<br />
is performed by searching for which minimizes the objective<br />
function (we drop for simplicity)<br />
(16)<br />
The objective term , given in (17), shown at the bottom<br />
<strong>of</strong> the next page, enforces the classification <strong>of</strong> the pixels based<br />
on the intensity residuals after motion compensation. This, however,<br />
becomes unreliable in the presence <strong>of</strong> noise or lack <strong>of</strong><br />
texture. Therefore, also the spatial motion-smoothness term <strong>of</strong><br />
(18), shown at the bottom <strong>of</strong> the next page, is used, with “<br />
” meaning that is in the neighborhood <strong>of</strong> . The penalty<br />
is more tolerant in motion-labeling discontinuities<br />
when real intensity discontinuities are present. Also, the temporal<br />
smoothness term <strong>of</strong> (19), shown at the bottom <strong>of</strong> the next<br />
page, takes into account the labeling <strong>of</strong> previous frames.<br />
The optimal solution for the energy function <strong>of</strong> (16) can be<br />
obtained using the -expansion algorithm [20], [21], which is<br />
well suited to our problem, giving a very fast solution. Traditional<br />
optimization techniques (such as gradients, Hessians,<br />
etc.) cannot be applied here because is multimodal and<br />
nonconvex. Also, the application <strong>of</strong> genetic algorithm is computationally<br />
expensive, according to our experiments. The -expansion<br />
algorithm is summarized as follows.<br />
1) Initialization: Start with the labeling which simply<br />
minimizes , i.e. the motion residuals.<br />
2) Cycle: For each label :<br />
• Iterations: Find the optimal within an<br />
-expansion move, which is explained directly below.<br />
3) If , set and begin a<br />
new cycle, else finish.
986 <strong>IEEE</strong> TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008<br />
Fig. 3. “White Tower & Boat”—frames 1, 32, and 64 (three first figures) and the extracted layers, obtained applying (20) to the objects’ images, taken using the<br />
proposed frequency-domain method, for the first 32 frames (last two figures).<br />
Fig. 4. “White Tower & Boat”—estimated instantaneous velocities.<br />
Fig. 5. “White Tower & Boat”—motion segmentation results using graph-cuts<br />
for the frames 1, 10, and 20.<br />
Given a label ,wehavean -expansion move if some labels<br />
in have changed to in . Dynamically in<br />
each iteration, a binary Graph is constructed as in Fig. 2. We<br />
have two terminal nodes and expressing whether a pixel<br />
will change its label to or not. The nonterminal nodes<br />
represent the image pixels. Two neighbor pixels , having<br />
different labels are connected through an intermediate auxiliary<br />
node with weights and .<br />
Neighbor pixels , with the same label are directly connected<br />
to each other with weights . The nonterminal<br />
nodes are connected to and with t-links <strong>of</strong> appropriate<br />
weights as in Fig. 2. With<br />
being the “data term” (see [20] for details), the weights used are<br />
with<br />
Fig. 6. “Walker”—walker’s positions in the frames 1, 11, 21, FFF, 181 <strong>of</strong> the<br />
natural sequence.<br />
Fig. 7. “Walker”—SPWV for 3 aHand (left) 3 aIS%aITH and (right)<br />
the estimated instantaneous velocity.<br />
summarized in Table I. In order to find the optimal -expansion<br />
move, we seek for the “minimum-cost” cut <strong>of</strong> the graph. A<br />
cut is a partitioning <strong>of</strong> the nodes into two subsets and<br />
such that and and it includes exactly one t-link<br />
for any pixel . Its cost is the sum <strong>of</strong> the weights <strong>of</strong> the links<br />
it includes. After finding the minimum-cost cut, a pixel is<br />
with (17)<br />
with<br />
otherwise<br />
otherwise<br />
(18)<br />
(19)
ALEXIADIS AND SERGIADIS: ESTIMATION OF MULTIPLE, TIME-VARYING MOTIONS USING TIME-FREQUENCY REPRESENTATIONS 987<br />
Fig. 8. “Transparent Lena-Cameraman”—frames 1, 16, and 32 <strong>of</strong> the sequence (first three figures) and the extracted layers using the proposed frequency-domain<br />
method (last two figures).<br />
Fig. 9. “Transparent Lena-Cameraman”—motion planes points @A for (left)<br />
w X P ‘IVY PT“ and (right) the estimated instantaneous velocities.<br />
assigned the label if the cut separates from the terminal ,<br />
or it keeps its old label otherwise.<br />
C. Layers’ Synthesis, Background Mosaicing<br />
Given that the instantaneous velocities estimation was adequately<br />
precise and the moving objects were efficiently separated<br />
using one <strong>of</strong> the presented methodologies, the objects will<br />
appear stationary in the motion-compensated sequence. Variations<br />
due to occlusions, noise and illumination changes are expected.<br />
Moreover, some regions <strong>of</strong> a moving background appear<br />
only after some frames. The extraction <strong>of</strong> a complete representative<br />
image for each layer is performed by a temporal median<br />
operation [5]<br />
(20)<br />
where is the motion compensated-image <strong>of</strong> the th object.<br />
This procedure, together with the other layers, synthesizes<br />
the background mosaic (a panoramic image), or the “sprite” in<br />
MPEG-4 terminology.<br />
V. EXPERIMENTAL RESULTS<br />
The overall motion-estimation algorithm is described in Algorithm<br />
1 (shown at the top <strong>of</strong> the next page). The Matlab code<br />
that accomplishes the simulation results, described in this section,<br />
can be found at the web location: http://esopos.ee.auth.gr/<br />
TIP-<strong>Varying</strong>Motion/code.zip. In the first experiment, the spectrogram<br />
is used because <strong>of</strong> its simplicity. In the next experiments,<br />
the SPWV distribution is used, with a Hamming window<br />
<strong>of</strong> length as the time-smoothing window and a Hamming<br />
<strong>of</strong> length as the frequency-smoothing window<br />
.<br />
A. “White Tower & Boat” Synthetic Sequence<br />
The synthetically generated sequence <strong>of</strong> Fig. 3 contains<br />
a moving object (boat) on a moving background (White<br />
Tower). The frame size is 296 296 and the sequence’s<br />
length is frames. The background moves with velocity<br />
pixel/frame and the boat decelerates according to<br />
pixels/frame. The simulated sequence was<br />
generated according to the presented motion model <strong>of</strong> (1) and<br />
(2) and Fig. 1. We used two 8-bit grayscale still images, a background<br />
“White Tower” image <strong>of</strong> size 502 296 and an image<br />
containing the “Boat” object. These still images correspond to<br />
, 1, 2 <strong>of</strong> the model. The motion <strong>of</strong> the background<br />
was obtained by selecting the appropriate 296 296 region<br />
<strong>of</strong> the corresponding still image, for each time instant. For the<br />
“Boat,” we created manually an auxiliary “mask” image, which<br />
corresponds to the window <strong>of</strong> (1). We cropped the<br />
“Boat” object from the still image and superimposed it on the<br />
background frame, at the appropriate instantaneous location.<br />
In order to handle noninteger instantaneous displacements, we<br />
applied bilinear interpolation. Finally, to obtain more natural<br />
visual appearance, we applied a weak smoothing operation near<br />
the border <strong>of</strong> the “Boat” object.<br />
For this experiment, White Gaussian Noise with<br />
dB was added. We use the spectrogram because <strong>of</strong> its<br />
simplicity. The window used to obtain the Short-<strong>Time</strong> Fourier<br />
Transform is a . In the TFR images, we search<br />
for two peaks along for each time instant . This is equivalent<br />
to applying the proposed Hough-transform methodology on<br />
TFR slices <strong>of</strong> length . Although, the spectrogram<br />
presents interference terms and bad localization, the estimated<br />
instantaneous velocities using FCP are very close to the<br />
real ones, as can be realized from Fig. 4. In Fig. 3, we also<br />
present the layers obtained applying (20) on the object images,<br />
obtained using the proposed frequency-domain approach <strong>of</strong><br />
Section IV-A. Finally, in Fig. 5, we give results <strong>of</strong> the motion<br />
segmentation approach, using the graph-cuts.<br />
B. “Walker”<br />
The natural sequence, used in [12] and depicted in Fig. 6, contains<br />
a single walker, whose size is small compared to the static<br />
background. The sequence contains video-frames <strong>of</strong><br />
size 320 240. The SPWV <strong>of</strong> for and a given<br />
value <strong>of</strong> is depicted in Fig. 7. The TFR images are clear in the<br />
sense that almost no interference terms exist and the TF localization<br />
is good. Applying our methodology by cutting the TFR<br />
images into overlapping time-slices <strong>of</strong> length<br />
with overlapping factor , the estimated instantaneous<br />
velocity is given in Fig. 7. Our results are similar to those in<br />
[12]. However, our methodology is more general in the sense<br />
that it can handle also sequences with large moving objects and<br />
moving background, at the cost <strong>of</strong> increased computational effort.
988 <strong>IEEE</strong> TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008<br />
C. Transparent “Lena -Cameraman” Synthetic Sequence<br />
In order to verify the effectiveness <strong>of</strong> the approach for transparent<br />
motions, we created the synthetic sequence shown in<br />
Fig. 8. The final sequence shown is <strong>of</strong> size 256 256 and length<br />
. To generate it, we used 8-bit grayscale, 512 512<br />
versions <strong>of</strong> the well-known “Lena ” and “Cameraman” still images.<br />
The velocities <strong>of</strong> the sequence are nonzero only along the<br />
-direction. Therefore, along the -direction, we used only the<br />
region <strong>of</strong> the original images .<br />
The instantaneous horizontal velocities vary as quadratic polynomials<br />
<strong>of</strong> time, as depicted in Fig. 9. The motions <strong>of</strong> the<br />
layers were obtained by selecting the appropriate regions <strong>of</strong> the<br />
corresponding still images: (“Lena”)<br />
and (“Cameraman”),<br />
which are <strong>of</strong> size . The translating foreground<br />
“Lena ” layer is opaque except from a together-translating<br />
rectangular region, where it is transparent. According to the<br />
presented motion model, the corresponding “alpha” map is<br />
unity everywhere, except from the rectangular image region:<br />
(for the first frame),<br />
where it is equal to 0.5. To handle noninteger instantaneous<br />
displacements, we applied bilinear interpolation.<br />
White Gaussian noise (WGN) with dB was added.<br />
The estimated instantaneous velocities are given in Fig. 9, together<br />
with the actual ones. <strong>Using</strong> the SPWV, the estimated<br />
motion planes points for are also given in<br />
Fig. 9. Finally, using the presented frequency domain approach,<br />
the synthesized layers are presented in Fig. 8.<br />
D. Modified “Coast Guard”<br />
In this experiment, the well-known “Coast Guard” sequence<br />
is used. In order to have time-varying motion for the back-<br />
ground, we have created a modified sequence (depicted in<br />
Fig. 10) <strong>of</strong> length . The original sequence’s frames<br />
138–265 are used with an increasing rate, according to the<br />
formula: , with and being the<br />
frame numbers in the new and the original sequence, respectively.<br />
The SPWV for is given in Fig. 11(a).<br />
Notice that strong interference terms are present, because the<br />
applied time/frequency smoothing is not sufficient. However,<br />
applying the presented Hough-transform methodology, the<br />
estimation <strong>of</strong> the instantaneous frequencies becomes reliable.<br />
Two peaks are clearly visible in the output <strong>of</strong> the line matched<br />
filter [Fig. 11(c)] applied to the SPWV slice <strong>of</strong> Fig. 11(b). The<br />
length <strong>of</strong> the SPWV slices is and the overlapping<br />
factor between successive pieces . The estimated<br />
instantaneous velocities and displacements are given in Fig. 11<br />
which are close to the real ones, found by manual feature-point<br />
tracking. In Fig. 10, the extracted layers are also shown, using<br />
(20) on the results <strong>of</strong> the frequency-domain approach <strong>of</strong> Section<br />
IV-A.<br />
E. CAVIAR Sequence—Man Walking<br />
The proposed methodology was successfully applied also<br />
on sequences captured for the purpose <strong>of</strong> the CAVIAR project<br />
[22]. We present the results for the frames 310–437 <strong>of</strong> the<br />
original “Walk2” sequence captured at the INRIA-lab, with<br />
a time-downsampling factor equal to 2. The final sequence,<br />
depicted in Fig. 12, has frames <strong>of</strong> size 384 288<br />
and contains a man walking along a parabolic trajectory.<br />
For the piecewise linear approximation <strong>of</strong> the instantaneous<br />
frequencies curves, the produced TFR images were cut into<br />
overlapping slices <strong>of</strong> length , with overlapping factor<br />
. The final results are given in Fig. 13 and they are<br />
reasonably accurate, compared to the ground-truth values.
ALEXIADIS AND SERGIADIS: ESTIMATION OF MULTIPLE, TIME-VARYING MOTIONS USING TIME-FREQUENCY REPRESENTATIONS 989<br />
Fig. 10. Modified “Coast Guard” sequence—frames 1, 21, and 41 (three first figures) and the extracted layers using the first 30 frames (last two figures).<br />
Fig. 11. Modified “Coast Guard”—from left to right: SPVW for @3 Y3 Aa@IQ%aIUTY HA, slice <strong>of</strong> the SPWV image for w X P ‘PWY RR“, and corresponding<br />
output <strong>of</strong> the line-matched filter. The estimated instantaneous velocities and displacements, compared to the real ones.<br />
Fig. 12. “CAVIAR”—frames 1, 31, and 61 <strong>of</strong> the natural sequence (first three figures) and (right) the extracted stationary-background and moving man objects<br />
for the 16th frame.<br />
Fig. 13. “CAVIAR”—estimated instantaneous velocity and displacement, compared to the real ones.<br />
Finally, in Fig. 12, we present results <strong>of</strong> the proposed frequency<br />
domain objects extraction method.<br />
VI. CONCLUSION<br />
Combining and extending previously published ideas [6], [7],<br />
[11], [12], we presented a new spatiotemporal approach for the<br />
estimation <strong>of</strong> the motions <strong>of</strong> multiple objects, which move with<br />
time-varying velocities. For this task, TFR was the appropriate,<br />
selected signal processing tool. We applied the Hough transform<br />
in overlapping time-slices <strong>of</strong> the TFR images to increase<br />
the robustness against noise and interference terms, present in<br />
the TFRs. Moreover, by exploiting the instantaneous frequencies<br />
estimates obtained for many spatial frequency pairs using<br />
FCP, we increased the robustness and accuracy <strong>of</strong> our method.<br />
Having estimated accurately the objects velocities, the proposed<br />
objects segmentation approaches produced correct and meaningful<br />
results. The experiments on various synthetic and natural<br />
sequences verified the effectiveness <strong>of</strong> the proposed method.<br />
The presented and used motion model considers time-varying<br />
but space-invariant motions and its effectiveness was demonstrated<br />
for the dominant and relatively large moving objects<br />
. However, by applying space-locally the proposed approach,<br />
namely using local (windowed) 2-D Fourier transforms,<br />
the above limitations no longer exist. This comes with increased<br />
computational effort, compared to constant motions approaches,<br />
but also with the benefit <strong>of</strong> increased accuracy. Our future research<br />
will focus on achieving real-time motion tracking characteristics,<br />
for both monochrome and color image sequences.<br />
REFERENCES<br />
[1] Y. Wang, J. Ostermann, and Y. Zhang, Video Processing and Communications.<br />
Englewood Cliffs, NJ: Prentice-Hall, 2002.<br />
[2] B. K. B. Horn and B. G. Schunck, “Determining optical flow,” Artif.<br />
Intell., vol. 17, pp. 185–203, 1981.<br />
[3] H. H. Nagel and E. Enklemann, “An investigation <strong>of</strong> smoothness constraints<br />
for the estimation <strong>of</strong> displacement vector fields from image sequences,”<br />
<strong>IEEE</strong> Trans. Pattern Anal. Mach. Intell., vol. PAMI-8, no. 9,<br />
pp. 565–593, Sep. 1986.
990 <strong>IEEE</strong> TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008<br />
[4] M. J. Black and P. Anandan, “The robust estimation <strong>of</strong> multiple<br />
motions: Parametric and piecewise-smooth flow-fields,” Comput. Vis.<br />
Image Understand., vol. 63, pp. 75–104, Jan. 1996.<br />
[5] J. Y. A. Wang and E. H. Adelson, “Representing moving images with<br />
layers,” <strong>IEEE</strong> Trans. Image Process., vol. 3, no. 9, pp. 625–638, Sep.<br />
1984.<br />
[6] W.-G. Chen, G. B. Giannakis, and N. Nandhakumar, “A harmonic retrieval<br />
framework for discontinuous motion estimation,” <strong>IEEE</strong> Trans.<br />
Image Process., vol. 7, no. 9, pp. 1242–1257, Sep. 1998.<br />
[7] C. E. Erdem, G. Z. Karabulut, E. Y. Yanmaz, and E. Anarim, “Motion<br />
estimation in the frequency domain using fuzzy c-planes clustering,”<br />
<strong>IEEE</strong> Trans. Image Process., vol. 10, no. 12, pp. 1873–1879, Dec. 2001.<br />
[8] B. Porat and B. FriedLander, “A frequency domain algorithm for multiframe<br />
detection <strong>of</strong> Dim targets,” <strong>IEEE</strong> Trans. Pattern Anal. Mach.<br />
Intell., vol. 12, no. 4, pp. 398–401, Apr. 1990.<br />
[9] P. Milanfar, “Two-dimensional matched filtering for motion estimation,”<br />
<strong>IEEE</strong> Trans. Image Process., vol. 8, no. 3, pp. 438–444, Mar.<br />
1999.<br />
[10] W.-G. Chen, G. B. Giannakis, and N. Nandhakumar, “Spatiotemporal<br />
approach for time-varying global image motion estimation,” <strong>IEEE</strong><br />
Trans. Image Process., vol. 5, no. 10, pp. 1448–1461, Oct. 1996.<br />
[11] D. S. Alexiadis and G. D. Sergiadis, “<strong>Estimation</strong> <strong>of</strong> multiple accelerated<br />
motions using chirp-Fourier transform and clustering,” <strong>IEEE</strong><br />
Trans. Image Process., vol. 16, no. 1, pp. 142–152, Jan. 2007.<br />
[12] I. Djurovic and S. Stankovic, “<strong>Estimation</strong> <strong>of</strong> time-varying velocities <strong>of</strong><br />
moving objects by time-frequency representations,” <strong>IEEE</strong> Trans. Image<br />
Process., vol. 12, no. 5, pp. 550–562, May 2003.<br />
[13] E. H. Adelson and H. R. Bergen, “Spatiotemporal energy models for<br />
the perception <strong>of</strong> motion,” J. Opt Soc. Amer. A, vol. 2, pp. 284–299,<br />
1985.<br />
[14] B. A. Watson and A. J. Ahumada, “Model <strong>of</strong> human visual-motion<br />
sensing,” J. Opt Soc. Amer. A, vol. 2, pp. 322–342, 1985.<br />
[15] S. S. Beauchemin and J. L. Barron, “The frequency structure <strong>of</strong> one-dimensional<br />
occluding image signals,” <strong>IEEE</strong> Trans. Pattern Anal. Mach.<br />
Intell., vol. 22, no. 2, pp. 200–206, Feb. 2000.<br />
[16] L. Cohen, “<strong>Time</strong>-frequency distributions: A review,” Proc. <strong>IEEE</strong>, vol.<br />
77, no. 7, pp. 941–981, Jul. 1989.<br />
[17] L. J. Stanckovic, “The auto-term representation by the reduced interference<br />
distributions; The procedure for a kernel design,” <strong>IEEE</strong> Trans.<br />
Signal Process., vol. 44, pp. 1557–1564, June 1996.<br />
[18] S. Barbarossa, “Analysis <strong>of</strong> multicomponent LFM signals by a combined<br />
Wigner-Hough transform,” <strong>IEEE</strong> Trans. Signal Process., vol. 43,<br />
no. 6, pp. 1511–1515, Jun. 1995.<br />
[19] C. Mota, I. Stuke, T. Aach, and E. Barth, “Divide-and-conquer strategies<br />
for estimating multiple transparent motions,” in Lecture Notes on<br />
Computer Science. Berlin, Germany: Springer, 2005, vol. 3417.<br />
[20] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization<br />
via graph cuts,” <strong>IEEE</strong> Trans. Pattern Anal. Mach. Intell., vol.<br />
23, no. 11, pp. 1222–1239, Nov. 2001.<br />
[21] V. Kolmogorov and R. Zabih, “What energy functions can be minimized<br />
via graph cuts?,” <strong>IEEE</strong> Trans. Pattern Anal. Mach. Intell., vol.<br />
26, no. 2, pp. 147–159, Feb. 2004.<br />
[22] 2005, The CAVIAR website [Online]. Available: http://homepages.inf.ed.ac.uk/rbf/CAVIAR/<br />
Dimitrios S. Alexiadis (S’03) was born in Kozani,<br />
Greece, in 1978. He received the Diploma degree<br />
in electrical and computer engineering from the<br />
Aristotle University <strong>of</strong> Thessaloniki, Thessaloniki,<br />
Greece, in 2002. He is currently pursuing the Ph.D.<br />
degree at the Telecommunications Laboratory, Department<br />
<strong>of</strong> Electrical and Computer Engineering,<br />
Aristotle University <strong>of</strong> Thessaloniki.<br />
Since 2002, he has been involved in several research<br />
projects and, at the same time, he was a Research<br />
and Teaching Assistant at the Aristotle University.<br />
His main research interests include image/video processing and speech<br />
synthesis/recognition.<br />
Mr. Alexiadis is a member <strong>of</strong> the Technical Chamber <strong>of</strong> Greece.<br />
George D. Sergiadis (M’88) was born in Thessaloniki,<br />
Greece, in 1955. He received the Diploma in<br />
Electrical Engineering from the Aristotle University<br />
<strong>of</strong> Thessaloniki in 1978 and the Ph.D. degree from<br />
the Ecole Nationale Supérieure des Télécommunications,<br />
Paris, France, in 1982.<br />
He was with Thomson CsF, France, until 1985,<br />
participating in the development <strong>of</strong> the French<br />
Magnetic Resonance Scanner. Since 1985, he has<br />
been with the Aristotle University <strong>of</strong> Thessaloniki,<br />
teaching telecommunications and biomedical engineering,<br />
now as a Full Pr<strong>of</strong>essor. For three years, he also served as the<br />
Director <strong>of</strong> the Telecommunications Department. He has developed the Hellenic<br />
TTS engine “Esopos” and designed the mobile communications for the<br />
Athens Olympic Games in 2004. His current research interests include fuzzy<br />
image processing and wireless communications. During the academic year<br />
2004–2005, he was a Visiting Researcher at the Media Lab, Massachusetts<br />
Institute <strong>of</strong> Technology, Cambridge.<br />
Dr. Sergiadis is the President <strong>of</strong> ELEVIT, the Hellenic society for Biomedical<br />
Engineering, and a member <strong>of</strong> TEE, ATLAS, SMRM, ESMRM, EMBS, and<br />
SCIP.