28.07.2013 Views

Learning manifolds of dynamical models for activity recognition

Learning manifolds of dynamical models for activity recognition

Learning manifolds of dynamical models for activity recognition

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

are run on the KTH and CMU databases.<br />

Some ef<strong>for</strong>ts have been recently put into <strong>recognition</strong><br />

from single images too (e.g. [30], where actions are learnt<br />

from static images taken from the web).<br />

Dynamical <strong>models</strong> in action <strong>recognition</strong>. However, encoding<br />

the dynamics <strong>of</strong> videos or image sequences by<br />

means <strong>of</strong> some sort <strong>of</strong> <strong>dynamical</strong> model can be useful in situations<br />

in which the dynamics is critically discriminative.<br />

Furthermore, the actions <strong>of</strong> interest have to be temporally<br />

segmented from a video sequence: we need to know when<br />

an action/<strong>activity</strong> starts or stops. Actions <strong>of</strong> sometimes very<br />

different lengths have to be encoded in a homogeneous fashion<br />

in order to be compared (“time warping”). Dynamical<br />

representations are very effective in coping with time warping<br />

or action segmentation [51].<br />

Furthermore, in limit situations in which a significant number<br />

<strong>of</strong> people move or ambulate in the field <strong>of</strong> view (as it<br />

is common in surveillance scenarios), the attention has necessarily<br />

to move from single objects/bodies to approaches<br />

which consider the monitored crowd some sort <strong>of</strong> fluid, and<br />

describe its behavior in a way similar to the physical modeling<br />

<strong>of</strong> fluids []. Dynamical <strong>models</strong> are well equipped to<br />

deal with such scenarios [].<br />

In these scenarios, action (or identity) <strong>recognition</strong> reduces<br />

to classifying <strong>dynamical</strong> <strong>models</strong>. Hidden Markov<br />

<strong>models</strong> [23] have been indeed widely employed in action<br />

<strong>recognition</strong> [43, 51] and gait identification [54, 9]. HMM<br />

classification can happen either by evaluating the likelihood<br />

<strong>of</strong> a new sequence with respect to the learnt <strong>models</strong>, or by<br />

learning a new model <strong>for</strong> the test sequence, measuring its<br />

distance from the old <strong>models</strong>, and attributing to it the label<br />

<strong>of</strong> the closest model.<br />

Indeed, many researchers have explored the idea <strong>of</strong> encoding<br />

motions via linear [5], nonlinear [25], stochastic [42, 24]<br />

or chaotic [1] <strong>dynamical</strong> systems, and classifying them by<br />

measuring distances in their space. Chaudry et al [10], <strong>for</strong><br />

instance, have used nonlinear <strong>dynamical</strong> systems (NLDS)<br />

to model times series <strong>of</strong> histograms <strong>of</strong> oriented optical flow,<br />

measuring distances between NLDS by means <strong>of</strong> Cauchy<br />

kernels, while Wang and Mori [56], have proposed sophisticated<br />

max-margin conditional random fields to address locality<br />

by recognizing actions as constellations <strong>of</strong> local motion<br />

patterns.<br />

Sophisticated graphical <strong>models</strong> can be useful to learn<br />

in a bottom-up fashion the temporal structure or plot <strong>of</strong><br />

a footage, or to describe causal relationships in complex<br />

<strong>activity</strong> patterns [38]. Gupta et al [28] work on determining<br />

the plot <strong>of</strong> a video by discovering causal relationships<br />

between actions, represented as an AND/OR graph whose<br />

edges are associated with spatio-temporal constraints. Integer<br />

Programming is used <strong>for</strong> storyline extraction on baseball<br />

footage.<br />

Distance-based <strong>recognition</strong>. The use <strong>of</strong> distances and mani-<br />

Figure 1: Datasets.<br />

fold learning <strong>for</strong> action <strong>recognition</strong> is not limited to <strong>dynamical</strong><br />

<strong>models</strong>. Lin et al [36] think <strong>of</strong> actions as sequences<br />

<strong>of</strong> prototype trees, learned by hierarchical k-means in a<br />

joint shape and motion space. Prototype-to-prototype distances<br />

are generated as a look-up table. The joint likelihood<br />

<strong>of</strong> location/prototype is maximized to track actors in<br />

the Wiezmann and KTH datasets, while actions are recognized<br />

by prototype sequence matching. In an interesting<br />

related work, Li at el [35] describe activities as discriminative<br />

temporal interaction matrices, living in a Discriminative<br />

Temporal Interaction Manifold. They set probability densities<br />

on this manifold, and use a MAP classifier to recognize<br />

new activities. Their data is a collection <strong>of</strong> NCAA American<br />

football footage.<br />

Distance function learning. A number <strong>of</strong> distance functions<br />

between linear systems have been introduced in the past<br />

(e.g., [52]), and a vast literature about dissimilarity measures<br />

<strong>for</strong> Markov <strong>models</strong> also exists [20], mostly concerning<br />

variants <strong>of</strong> the Kullback-Leibler divergence [33]. However,<br />

as <strong>models</strong> (or sequences) can be endowed with different<br />

labels (e.g., action, ID) while maintaining the same geometrical<br />

structure, no single distance function can possibly<br />

outper<strong>for</strong>m all the others in every classification problem.<br />

A reasonable approach when possessing some a-priori in<strong>for</strong>mation<br />

is there<strong>for</strong>e trying to learn in a supervised fashion<br />

the “best” distance function <strong>for</strong> a specific classification<br />

problem [3, 4, 48, 55, 61, 22]. A natural optimization criterion<br />

consists on maximizing the classification per<strong>for</strong>mance<br />

achieved by the learnt metric, a problem which has elegant<br />

solutions in the case <strong>of</strong> linear mappings [50, 57].<br />

However, as even the simplest linear <strong>dynamical</strong> <strong>models</strong> live<br />

in a nonlinear space, the need <strong>for</strong> a principled way <strong>of</strong> learning<br />

Riemannian metrics from such data naturally arises.<br />

Pullback metrics. An interesting tool is provided by the <strong>for</strong>malism<br />

<strong>of</strong> “pullback metrics”. If the <strong>models</strong> belong to a<br />

Riemannian manifold M, any diffeomorphism <strong>of</strong> M onto<br />

itself or “automorphism” induces such a metric on M. By<br />

designing a suitable family <strong>of</strong> automorphisms depending on<br />

a parameter λ, we obtain a family <strong>of</strong> pullback metrics on M<br />

we can optimize on.<br />

Pullback metrics [31] have been recently proposed in the<br />

context <strong>of</strong> document retrieval [34], where a proper Fisher<br />

metric is available: instead <strong>of</strong> optimization classification<br />

rates, the inverse volume <strong>of</strong> the pullback manifold is there<br />

maximized. In [13] pullback Fisher metrics <strong>for</strong> simple<br />

scalar autoregressive <strong>models</strong> <strong>of</strong> order 2 are learned. Besides<br />

considering only a very limited class (AR2) <strong>of</strong> <strong>models</strong>, [13]<br />

only deals with scalar observations, making the approach<br />

impractical <strong>for</strong> action, <strong>activity</strong> or identity <strong>recognition</strong>. As<br />

[34, 13] choose to optimize a geometric quantity totally unrelated<br />

to classification, the obtained metrics deliver rather<br />

modest classification per<strong>for</strong>mances. Furthermore, <strong>for</strong> important<br />

classes <strong>of</strong> <strong>dynamical</strong> <strong>models</strong> used in action recogni-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!