28.07.2013 Views

Learning manifolds of dynamical models for activity recognition

Learning manifolds of dynamical models for activity recognition

Learning manifolds of dynamical models for activity recognition

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Learning</strong> <strong>manifolds</strong> <strong>of</strong> <strong>dynamical</strong> <strong>models</strong> <strong>for</strong> <strong>activity</strong> <strong>recognition</strong><br />

Abstract<br />

Dr Fabio Cuzzolin<br />

Department <strong>of</strong> Computing, Ox<strong>for</strong>d Brookes University<br />

OX33 1HX Ox<strong>for</strong>d, United Kingdom<br />

In action, <strong>activity</strong> or identity <strong>recognition</strong> it is sometimes<br />

useful, rather than to extract some spatio-temporal features<br />

from the volumes representing motions, to explicitly encode<br />

their dynamics by means <strong>of</strong> <strong>dynamical</strong> systems, such as <strong>for</strong><br />

instance hidden Markov <strong>models</strong>, nonlinear <strong>dynamical</strong> systems,<br />

or hierarchical HMMs. Actions can then be classified<br />

by measuring distances in the appropriate space <strong>of</strong> <strong>models</strong>.<br />

However, using a fixed, arbitrary distance to classify<br />

<strong>dynamical</strong> <strong>models</strong> does not necessarily produce good classification<br />

results. The present proposal is concerned with a<br />

general differential-geometric framework <strong>for</strong> learning Riemannian<br />

metrics or distance functions <strong>for</strong> <strong>dynamical</strong> <strong>models</strong>,<br />

given a training set which can be either labeled or unlabeled.<br />

Given a training set <strong>of</strong> <strong>models</strong>, the optimal metric<br />

or distance function is selected among a family <strong>of</strong> pullback<br />

metrics induced by a parameterized automorphism <strong>of</strong> the<br />

space <strong>of</strong> <strong>models</strong>. The exploitation potential <strong>of</strong> the proposed<br />

methodology <strong>for</strong> action and <strong>activity</strong> <strong>recognition</strong> <strong>for</strong> human<br />

machine interaction, video indexing and summarization, is<br />

enormous.<br />

1 Previous research track record<br />

The Proposer: Dr Fabio Cuzzolin graduated in 1997<br />

from the University <strong>of</strong> Padua (Universitas Studii Paduani,<br />

founded 1222, it is the seventh most ancient university in<br />

the world) with a laurea magna cum laude in Computer Engineering<br />

and Master’s thesis on “Automatic gesture <strong>recognition</strong>”.<br />

He received the Ph.D. degree from the same institution<br />

in 2001, <strong>for</strong> a thesis entitled “Visions <strong>of</strong> a generalized<br />

probability theory”. He was first Visiting Scholar at the ES-<br />

SRL laboratory at the Washington University in St. Louis,<br />

currently 12th in the US universities ranking. He was later<br />

appointed fixed-term Assistant Pr<strong>of</strong>essor with the Politecnico<br />

di Milano, Milan, Italy (consistently recognized as the<br />

best Italian university), then moved as a Postdoctoral Fellow<br />

to the UCLA Vision Lab, University <strong>of</strong> Cali<strong>for</strong>nia at<br />

Los Angeles, led by Pr<strong>of</strong>essor Stefano Soatto. He later received<br />

a Marie Curie Fellowship in partnership with INRIA<br />

Rhone-Alpes, Grenoble, France. Since September 2008, he<br />

is Lecturer and an Early Career Fellow with the Department<br />

<strong>of</strong> Computing, School <strong>of</strong> Technology, Ox<strong>for</strong>d Brookes University,<br />

Ox<strong>for</strong>d, U.K.<br />

In addition, Dr Cuzzolin recently classified second in the<br />

2007 Senior Researcher national recruitment at INRIA, and<br />

had interviews with/<strong>of</strong>fer from Ox<strong>for</strong>d University, EPFL,<br />

http://cms.brookes.ac.uk/staff/FabioCuzzolin/<br />

June 21, 2010<br />

1<br />

Universitat Pompeu Fabra, UCSD, GeorgiaTech, U. Houston,<br />

Honeywell Labs, Riya.<br />

Dr Cuzzolin’s research interests span both machine<br />

learning applications <strong>of</strong> computer vision, including gesture<br />

and action <strong>recognition</strong> and identity <strong>recognition</strong> from<br />

gait, and uncertainty modeling via generalized and imprecise<br />

probabilities, to which he has contributed by developing<br />

a systematic analysis <strong>of</strong> the geometry <strong>of</strong> random sets<br />

and other uncertainty measures. His scientific <strong>activity</strong> goes<br />

there<strong>for</strong>e under the heading <strong>of</strong> interdisciplinarity. His scientific<br />

productivity is extremely high, as the thirty papers<br />

he has published in the last three years attest. Dr Cuzzolin<br />

is author <strong>of</strong> 53 peer reviewed scientific publications, 44 <strong>of</strong><br />

them as first or single author, including two book chapters<br />

and 8 journal papers. Several more journals are currently<br />

under review or revision. His work has recently won a best<br />

paper award at the recent Pacific Rim Conference on AI<br />

symposium (PRICAI’08).<br />

Dr Cuzzolin has been recently elected member <strong>of</strong> the Board<br />

<strong>of</strong> Directors <strong>of</strong> the “Belief Functions and Applications Society”.<br />

He has been Guest Editor <strong>for</strong> “In<strong>for</strong>mation Fusion”,<br />

and collaborates with several other international journals in<br />

both computer vision and probability, such as: the IEEE Tr.<br />

on Systems, Man, and Cybernetics, the Int. J. on Approximate<br />

Reasoning, Computer Vision and Image Understanding,<br />

the IEEE Trans. on Fuzzy Systems, In<strong>for</strong>mation Sciences,<br />

the Journal <strong>of</strong> Risk and Reliability, the International<br />

Journal on Uncertainty, Fuzziness, and Knowledge-Based<br />

Systems, Image and Vision Computing. He has served in<br />

the program committee <strong>of</strong> some 15 international conferences<br />

in both imprecise probabilities (e.g. ISIPTA, EC-<br />

SQARU, BELIEF) and computer vision (e.g. VISAPP).<br />

He is reviewer <strong>for</strong> BMVC and ECCV. He has co-supervised<br />

two Ph.D. and two MsC. students and is involved in the supervision<br />

<strong>of</strong> the Ox<strong>for</strong>d Brookes vision group’s cohort <strong>of</strong><br />

students.<br />

Proposer’s Related Work. The proposer has a significant<br />

record <strong>of</strong> research in human motion analysis and <strong>recognition</strong>.<br />

After some early work on gesture <strong>recognition</strong>, he<br />

moved to study marker-less motion capture in 3D. Using<br />

a system <strong>of</strong> 12 synchronized cameras available at the Politecnico<br />

di Milano, silhouettes <strong>of</strong> the moving person in<br />

all cameras were extracted to recover its volumetric extension.<br />

The final purpose was to recognize actions and behaviors<br />

from sequences <strong>of</strong> volumes using Markov <strong>models</strong><br />

[18]. He has recently explored the use <strong>of</strong> bilinear and multilinear<br />

<strong>models</strong> to identity <strong>recognition</strong> from gait [11, 14],<br />

a relatively new but promising branch <strong>of</strong> biometrics. As


in this context per<strong>for</strong>mance is influenced by factors as diverse<br />

as viewpoint, emotional state, illumination, presence<br />

<strong>of</strong> clothes/occlusions, etcetera, that can be modeled through<br />

tensor algebra. He has recently published a book chapter<br />

[14] with IGI on this topic. In the wider field <strong>of</strong> articulated<br />

motion analysis he has published several papers on spectral<br />

motion capture techniques [40], focusing in particular on<br />

the crucial issue <strong>of</strong> how to select and map eigenspaces generated<br />

by two different shapes in order to track 3D points on<br />

their surfaces or consistently segment bodyparts along sequences<br />

<strong>of</strong> voxelsets [17], as a preprocessing step to action<br />

<strong>recognition</strong>. In direct relation to the topic <strong>of</strong> this proposal,<br />

he is now exploring manifold learning techniques <strong>for</strong> <strong>dynamical</strong><br />

<strong>models</strong> representing (human) motions, in order to<br />

learn the optimal metric <strong>of</strong> the space they live in and maximize<br />

classification per<strong>for</strong>mance. Another book chapter [16]<br />

collecting his preliminary results on the topic has been recently<br />

accepted by Springer.<br />

Proposer’s Other Scientific Contributions. Dr Cuzzolin<br />

is also recognized as one <strong>of</strong> the most prominent experts<br />

in the field <strong>of</strong> non-additive probabilities and belief<br />

functions. He has been recently elected member <strong>of</strong> the<br />

Board <strong>of</strong> Directors <strong>of</strong> the newly founded “Belief Functions<br />

and Applications Society”, and is member <strong>of</strong> the “Society<br />

<strong>for</strong> Imprecise Probabilities and Their Applications”.<br />

His most important contribution in the field <strong>of</strong> uncertainty<br />

theory and imprecise probabilities is a general geometric<br />

approach to uncertainty measures, in which probabilities,<br />

possibilities and belief functions can all be represented as<br />

points <strong>of</strong> a Cartesian space and there analyzed [12]. Evidence<br />

aggregation operators (the analogues <strong>of</strong> Bayes’ rule<br />

in the Bayesian <strong>for</strong>malism) can also be seen as geometric<br />

operators. The issues <strong>of</strong> how to approximate a belief function<br />

with an additive probability or a possibility measure, or<br />

what probability trans<strong>for</strong>mation is appropriate <strong>for</strong> decision<br />

making can be all solved by geometric means.<br />

In his recent award-winning paper [15], Dr Cuzzolin has<br />

also investigated alternative combinatorial foundations <strong>for</strong><br />

the theory <strong>of</strong> belief functions, and their algebraic properties.<br />

He is currently working on the generalization <strong>of</strong> the<br />

total probability theorem to finite random sets, as a key contribution<br />

to the field <strong>of</strong> non-additive probabilities. He is in<br />

the process <strong>of</strong> finalizing a book entitled “The geometry <strong>of</strong><br />

uncertainty” which will collect all his contributions to the<br />

mathematics <strong>of</strong> uncertainty.<br />

Past Collaborations and Pr<strong>of</strong>essional Links. Dr Cuzzolin<br />

acquired considerable international experience by<br />

working in the past <strong>for</strong> some <strong>of</strong> the most prominent research<br />

laboratories in both the US and Europe. He gave seminar<br />

and invited talks at several world-leading institutions such<br />

as MIT, EPFL, GeorgiaTech, Micros<strong>of</strong>t Research Europe,<br />

INRIA. His network <strong>of</strong> collaborations with groups <strong>of</strong> researchers<br />

in both Europe and the United States is quite large<br />

and expanding.<br />

Dr Cuzzolin is currently arranging meetings with several<br />

groups <strong>of</strong> researchers all around Europe to set up active<br />

collaborations to support his goal <strong>of</strong> establishing a fairly<br />

large group <strong>of</strong> five-ten people in a few years time in perspective<br />

<strong>of</strong> reaching a pr<strong>of</strong>essorial position in the medium<br />

term. He is in talks with M. Zaffalon (IDSIA, Switzerland)<br />

<strong>for</strong> a STREP on imprecise Markov chains <strong>for</strong> gesture <strong>recognition</strong>.<br />

He is setting up with INRIA’s Radu Horaud, Alejan-<br />

dro Frangi (Pompeu Fabra) and Technion (R. Kimmel, M.<br />

Bronstein) an interdisciplinary Future Emerging Technology<br />

(FET) EU proposal on large scale manifold learning,<br />

with applications to scene understanding. He is discussing<br />

a collaborative project on uncertainty theory at UK level<br />

with J. Lawry (Head <strong>of</strong> Department <strong>of</strong> Bristol’s Engineering<br />

Mathematics) and F. Coolen (Durham’s Dept <strong>of</strong> Statistics),<br />

and exploring the opportunity <strong>of</strong> a European Network<br />

<strong>of</strong> Excellence in the same field. He plans to apply <strong>for</strong> the<br />

European Research Council Starting Grant in October 2010.<br />

Dr Cuzzolin enjoys personal links with several world class<br />

companies (many <strong>of</strong> them with research divisions in the<br />

UK) such as Micros<strong>of</strong>t Research, Honeywell Labs (I. Cohen),<br />

Boston’s MERL (M. Brand, S. Ramalingam), GE (G.<br />

Doretto), Google (A. Bissacco), Riya.<br />

The host organization: Ox<strong>for</strong>d Brookes University,<br />

School <strong>of</strong> Technology, (OBU). In the department there<br />

are around 30 academic staff, these include, in computer<br />

graphics, Pr<strong>of</strong>. David Duce (co-chair <strong>of</strong> Eurographics<br />

2003, 2006 conference), Bob Hopgood OBE, Pr<strong>of</strong>. M.K.<br />

Pidcock, world leader in Electrical Impedance Tomography,<br />

and in AI and image processing, Pr<strong>of</strong>. William<br />

Clocksin. The Computer Science department had the following<br />

break down in the recent RAE: 4* 15%, 3* 35%,<br />

2* 35% and 1* 15%, which means that 85% <strong>of</strong> output was<br />

deemed internationally leading and that no research output<br />

was considered unclassified. The School <strong>of</strong> Technology<br />

has recently established a new doctoral training programme<br />

with the theme <strong>of</strong> intelligent transport systems (<br />

http://tech.brookes.ac.uk/research/), which includes<br />

many computer vision problems, with a set <strong>of</strong><br />

courses and associated infrastructure which will be directly<br />

beneficial to this project.<br />

Dr Fabio Cuzzolin belongs to the Ox<strong>for</strong>d Brookes<br />

vision group founded by Pr<strong>of</strong>essor Philip Torr (<br />

cms.brookes.ac.uk/research/visiongroup/),<br />

which comprises some seventeen staff, students and<br />

post-docs who will add value to this project. Pr<strong>of</strong>essor Torr<br />

was awarded the Marr Prize, the most prestigious prize<br />

in computer vision, in 1998. Members <strong>of</strong> the group have<br />

recently received awards in 4 other conferences, including<br />

best paper at CVPR 08 and honourary mention at NIPS, the<br />

top machine learning conference.<br />

The group was mentioned four times in the UKCRC<br />

Submission to the EPSRC International Review Sep.<br />

20064. It enjoys ongoing collaborations with companies<br />

such as 2d3, Vicon Life, Yotta, Micros<strong>of</strong>t Research Europe,<br />

Sharp Laboratories Europe, Sony Entertainments Europe.<br />

The group’s work with the Ox<strong>for</strong>d Metrics Group in a<br />

Knowledge Transfer Partnership 2005-9 won the National<br />

Best Knowledge Transfer Partnership <strong>of</strong> the year at the<br />

2009 awards, sponsored by the Technology Strategy Board,<br />

selected out <strong>of</strong> several hundred projects.<br />

Ox<strong>for</strong>d Brookes also has close links with Ox<strong>for</strong>d University,<br />

with both the Active Vision Group and Pr<strong>of</strong>.<br />

Zissermans Visual Geometry group, including a joint EP-<br />

SRC grant and EU collaboration as well as co-supervision.<br />

Members <strong>of</strong> the all groups regularly attend each others<br />

reading groups and seminars.<br />

Dr Cuzzolin holds a Early Career Fellow position with<br />

minimal undergraduate teaching duties and hence has sufficient<br />

time to conduct the research listed in this proposal.


2 Proposed research and its context<br />

2.1 Background<br />

Topic <strong>of</strong> research. Recognizing human activities from<br />

video is a natural application <strong>of</strong> computer vision. Since<br />

the classic experiment <strong>of</strong> Johansson [] showing that moving<br />

light displays were enough <strong>for</strong> people to recognize motions<br />

or even identities, the matter has been subject <strong>of</strong> increasing<br />

interest from the vision community. The problem<br />

consists on telling, given one or more image sequences capturing<br />

one or more people per<strong>for</strong>ming an <strong>activity</strong>, what category<br />

(among those previously learned) this <strong>activity</strong> belongs<br />

to. A useful even though quite fuzzy distinction is that between<br />

“actions”, meant as simple (usually stationary) motion<br />

patterns, and “activities” as more complex sequences<br />

<strong>of</strong> actions, sometimes overlapping in time as they are per<strong>for</strong>med<br />

by different parts <strong>of</strong> the body or different agents in<br />

the scene.<br />

State <strong>of</strong> the art. The <strong>activity</strong> <strong>recognition</strong> problem involves,<br />

as we have seen, numerous challenges at different<br />

levels <strong>of</strong> representation. Accordingly, <strong>activity</strong> <strong>recognition</strong><br />

frameworks can be effectively described in terms <strong>of</strong> three<br />

layers: 1. feature extraction from single images o videos;<br />

2. action description or modeling; 3. high-level semantic<br />

<strong>activity</strong> modeling.<br />

Critical issues <strong>of</strong> <strong>activity</strong> <strong>recognition</strong>. Though the <strong>for</strong>mulation<br />

<strong>of</strong> the problem is simple and intuitive, <strong>activity</strong><br />

<strong>recognition</strong> is a much harder problem than it may look [?].<br />

Motions inherently possess an extremely high degree <strong>of</strong><br />

variability. Movements quite different from each can in<br />

fact carry the same meaning, or represent the same gesture.<br />

Even in perfectly controlled situations (constant illumination,<br />

static background) this inherent variability makes<br />

<strong>recognition</strong> hard (e.g., the trajectory followed by a walking<br />

person is irrelevant to the classification <strong>of</strong> the action they<br />

per<strong>for</strong>m).<br />

In addition, motions are subject to a large number <strong>of</strong> nuisance<br />

or “covariate” factors [37], such as: illumination,<br />

background, viewpoint. The list goes on and on. The<br />

combination <strong>of</strong> inherent variability and nuisance factors<br />

explains why experiments have been so far conducted in<br />

small, controlled environments to reduce their complexity.<br />

Attempts have been recently done to go beyond such controlled<br />

environments. Liu, Luo and Shah [37] has taken<br />

on the challenge <strong>of</strong> recognizing actions “in the wild” and<br />

coping with the tremendous nuisance variations <strong>of</strong> unconstrained<br />

videos. Motion statistics are used to prune motion<br />

and static features. Adaboost is chosen to integrate all<br />

those single-feature classifiers. The algorithm is tested on<br />

the KTH dataset and on YouTube videos. Hu et al [59]<br />

have also investigated action detection in complex, cluttered<br />

scenes, representing candidates regions are bags <strong>of</strong> instances.<br />

In their SMILE-SVM framework human action detectors<br />

are learnt based on imprecise action locations. They<br />

test their approach on both the CMU dataset and videos shot<br />

in a shopping mall. Yao and Zhu [58] learn de<strong>for</strong>mable action<br />

templates from cluttered real-world videos. Such action<br />

templates are sequences <strong>of</strong> image templates consisting<br />

<strong>of</strong> a set <strong>of</strong> shape and motion primitives. Templates are<br />

de<strong>for</strong>med to match KTH and CMU videos by space-time<br />

warping.<br />

If we remove the assumption that a single motion <strong>of</strong> interest<br />

is present in our video, locality emerges too as a critical factor.<br />

For instance, a person can walk and wave at the same<br />

time. Different actions per<strong>for</strong>med by different agents can go<br />

on in different regions <strong>of</strong> the image sequence, without being<br />

necessarily synchronized or coordinated. Gilbert et al [27],<br />

<strong>for</strong> instance, per<strong>for</strong>m multi-action <strong>recognition</strong> using very<br />

dense spatio-temporal corner features in a hierarchical classification<br />

framework, testing it on the multi-KTH dataset<br />

and the Real-World Movie dataset. Reddy et al [45] propose<br />

to cope with incremental <strong>recognition</strong> using feature trees that<br />

can handle multiple actions without intensive training. Local<br />

features are recognized by NN, while actions are later<br />

labeled using a simple voting strategy. They use the KTH<br />

and IXMAS datasets.<br />

If we also remove the assumption <strong>of</strong> having a single body<br />

moving in the field <strong>of</strong> interest, problems like occlusion assume<br />

a critical importance []. On the other hand, the presence<br />

<strong>of</strong> one <strong>of</strong> more (static) objects in the vicinity <strong>of</strong> the motion<br />

can effectively help to disambiguate the <strong>activity</strong> class<br />

(<strong>recognition</strong> “in context”). Marszalek, Laptev and Schmid<br />

[39] apply a joint SVM classifier <strong>of</strong> action and scene to bags<br />

<strong>of</strong> features, using movie scripts as means <strong>of</strong> automatic supervision<br />

<strong>for</strong> training (as proposed in [21]) on the newly<br />

proposed “Hollywood movies” database. Han et al [29],<br />

<strong>for</strong> instance, introduce bag-<strong>of</strong>-detectors scene descriptors<br />

to encode such contextual presence, and use Gaussian processes<br />

to select and weight multiple features <strong>for</strong> classification<br />

on the KTH and Hollywood dataset. Sun et al [53] use<br />

the Hollywood and LSCOM databases to model and test<br />

spatio-temporal context in a hierarchical way, using SIFT<br />

<strong>for</strong> point-level context and Markov processes to model trajectory<br />

proximity and transition descriptors, on top <strong>of</strong> which<br />

they apply multi-channel non-linear SVM.<br />

In opposition, sometimes very few instances <strong>of</strong> each action<br />

category are available. Seo and Milanfar [49] detect actions<br />

from single examples by computing sense space-time local<br />

regression kernels, which are used as descriptors and passed<br />

through PCA to extract salient features. A matrix generalization<br />

<strong>of</strong> cosine similarity is then used to compare videos,<br />

on the Irani database.<br />

Bag-<strong>of</strong>-features on spatio-temporal volumes methods.<br />

Recently, <strong>recognition</strong> methods which neglect action dynamics<br />

(typically extracting spatio-temporal [6] features<br />

from the 3D volume associated with a video [32]) have<br />

proven very effective (http://www.wisdom.weizmann.ac.il/ vision/SpaceTimeActions.html).<br />

Kim and Cipolla [32] do<br />

spatio-temporal pattern matching, representing volumes as<br />

tensors. They propose an extension <strong>of</strong> canonical correlation<br />

analysis to tensor to detect actions on a 3D window search<br />

with exemplars on the KTH dataset. Bregonzio et al [7] also<br />

use a global spatio-temporal distribution <strong>of</strong> interest points.<br />

They extract and select “holistic” features from clouds <strong>of</strong><br />

interest points over multiple temporal scales. Rapantzikos<br />

et al [44] also dense spatio-temporal features detected using<br />

saliency measures, which incorporate color, motion and<br />

intensity, in a multi-scale volumetric representation. Yuan<br />

at el [60] describe actions as collections <strong>of</strong> spatio-temporal<br />

invariant features, and propose a naive Bayes mutual in<strong>for</strong>mation<br />

maximization method <strong>for</strong> multi-class. A search<br />

algorithm is also proposed to locate the optimal 3D subvolume<br />

in which the action can be detected. Experiments


are run on the KTH and CMU databases.<br />

Some ef<strong>for</strong>ts have been recently put into <strong>recognition</strong><br />

from single images too (e.g. [30], where actions are learnt<br />

from static images taken from the web).<br />

Dynamical <strong>models</strong> in action <strong>recognition</strong>. However, encoding<br />

the dynamics <strong>of</strong> videos or image sequences by<br />

means <strong>of</strong> some sort <strong>of</strong> <strong>dynamical</strong> model can be useful in situations<br />

in which the dynamics is critically discriminative.<br />

Furthermore, the actions <strong>of</strong> interest have to be temporally<br />

segmented from a video sequence: we need to know when<br />

an action/<strong>activity</strong> starts or stops. Actions <strong>of</strong> sometimes very<br />

different lengths have to be encoded in a homogeneous fashion<br />

in order to be compared (“time warping”). Dynamical<br />

representations are very effective in coping with time warping<br />

or action segmentation [51].<br />

Furthermore, in limit situations in which a significant number<br />

<strong>of</strong> people move or ambulate in the field <strong>of</strong> view (as it<br />

is common in surveillance scenarios), the attention has necessarily<br />

to move from single objects/bodies to approaches<br />

which consider the monitored crowd some sort <strong>of</strong> fluid, and<br />

describe its behavior in a way similar to the physical modeling<br />

<strong>of</strong> fluids []. Dynamical <strong>models</strong> are well equipped to<br />

deal with such scenarios [].<br />

In these scenarios, action (or identity) <strong>recognition</strong> reduces<br />

to classifying <strong>dynamical</strong> <strong>models</strong>. Hidden Markov<br />

<strong>models</strong> [23] have been indeed widely employed in action<br />

<strong>recognition</strong> [43, 51] and gait identification [54, 9]. HMM<br />

classification can happen either by evaluating the likelihood<br />

<strong>of</strong> a new sequence with respect to the learnt <strong>models</strong>, or by<br />

learning a new model <strong>for</strong> the test sequence, measuring its<br />

distance from the old <strong>models</strong>, and attributing to it the label<br />

<strong>of</strong> the closest model.<br />

Indeed, many researchers have explored the idea <strong>of</strong> encoding<br />

motions via linear [5], nonlinear [25], stochastic [42, 24]<br />

or chaotic [1] <strong>dynamical</strong> systems, and classifying them by<br />

measuring distances in their space. Chaudry et al [10], <strong>for</strong><br />

instance, have used nonlinear <strong>dynamical</strong> systems (NLDS)<br />

to model times series <strong>of</strong> histograms <strong>of</strong> oriented optical flow,<br />

measuring distances between NLDS by means <strong>of</strong> Cauchy<br />

kernels, while Wang and Mori [56], have proposed sophisticated<br />

max-margin conditional random fields to address locality<br />

by recognizing actions as constellations <strong>of</strong> local motion<br />

patterns.<br />

Sophisticated graphical <strong>models</strong> can be useful to learn<br />

in a bottom-up fashion the temporal structure or plot <strong>of</strong><br />

a footage, or to describe causal relationships in complex<br />

<strong>activity</strong> patterns [38]. Gupta et al [28] work on determining<br />

the plot <strong>of</strong> a video by discovering causal relationships<br />

between actions, represented as an AND/OR graph whose<br />

edges are associated with spatio-temporal constraints. Integer<br />

Programming is used <strong>for</strong> storyline extraction on baseball<br />

footage.<br />

Distance-based <strong>recognition</strong>. The use <strong>of</strong> distances and mani-<br />

Figure 1: Datasets.<br />

fold learning <strong>for</strong> action <strong>recognition</strong> is not limited to <strong>dynamical</strong><br />

<strong>models</strong>. Lin et al [36] think <strong>of</strong> actions as sequences<br />

<strong>of</strong> prototype trees, learned by hierarchical k-means in a<br />

joint shape and motion space. Prototype-to-prototype distances<br />

are generated as a look-up table. The joint likelihood<br />

<strong>of</strong> location/prototype is maximized to track actors in<br />

the Wiezmann and KTH datasets, while actions are recognized<br />

by prototype sequence matching. In an interesting<br />

related work, Li at el [35] describe activities as discriminative<br />

temporal interaction matrices, living in a Discriminative<br />

Temporal Interaction Manifold. They set probability densities<br />

on this manifold, and use a MAP classifier to recognize<br />

new activities. Their data is a collection <strong>of</strong> NCAA American<br />

football footage.<br />

Distance function learning. A number <strong>of</strong> distance functions<br />

between linear systems have been introduced in the past<br />

(e.g., [52]), and a vast literature about dissimilarity measures<br />

<strong>for</strong> Markov <strong>models</strong> also exists [20], mostly concerning<br />

variants <strong>of</strong> the Kullback-Leibler divergence [33]. However,<br />

as <strong>models</strong> (or sequences) can be endowed with different<br />

labels (e.g., action, ID) while maintaining the same geometrical<br />

structure, no single distance function can possibly<br />

outper<strong>for</strong>m all the others in every classification problem.<br />

A reasonable approach when possessing some a-priori in<strong>for</strong>mation<br />

is there<strong>for</strong>e trying to learn in a supervised fashion<br />

the “best” distance function <strong>for</strong> a specific classification<br />

problem [3, 4, 48, 55, 61, 22]. A natural optimization criterion<br />

consists on maximizing the classification per<strong>for</strong>mance<br />

achieved by the learnt metric, a problem which has elegant<br />

solutions in the case <strong>of</strong> linear mappings [50, 57].<br />

However, as even the simplest linear <strong>dynamical</strong> <strong>models</strong> live<br />

in a nonlinear space, the need <strong>for</strong> a principled way <strong>of</strong> learning<br />

Riemannian metrics from such data naturally arises.<br />

Pullback metrics. An interesting tool is provided by the <strong>for</strong>malism<br />

<strong>of</strong> “pullback metrics”. If the <strong>models</strong> belong to a<br />

Riemannian manifold M, any diffeomorphism <strong>of</strong> M onto<br />

itself or “automorphism” induces such a metric on M. By<br />

designing a suitable family <strong>of</strong> automorphisms depending on<br />

a parameter λ, we obtain a family <strong>of</strong> pullback metrics on M<br />

we can optimize on.<br />

Pullback metrics [31] have been recently proposed in the<br />

context <strong>of</strong> document retrieval [34], where a proper Fisher<br />

metric is available: instead <strong>of</strong> optimization classification<br />

rates, the inverse volume <strong>of</strong> the pullback manifold is there<br />

maximized. In [13] pullback Fisher metrics <strong>for</strong> simple<br />

scalar autoregressive <strong>models</strong> <strong>of</strong> order 2 are learned. Besides<br />

considering only a very limited class (AR2) <strong>of</strong> <strong>models</strong>, [13]<br />

only deals with scalar observations, making the approach<br />

impractical <strong>for</strong> action, <strong>activity</strong> or identity <strong>recognition</strong>. As<br />

[34, 13] choose to optimize a geometric quantity totally unrelated<br />

to classification, the obtained metrics deliver rather<br />

modest classification per<strong>for</strong>mances. Furthermore, <strong>for</strong> important<br />

classes <strong>of</strong> <strong>dynamical</strong> <strong>models</strong> used in action recogni-


tion such as HMMs or variable length Markov <strong>models</strong> [26]<br />

(VLMMs) a proper metric has not yet been identified. In<br />

order to learn optimal pullback distances <strong>for</strong> such important<br />

classes <strong>of</strong> <strong>models</strong> we necessarily need to relax the constraint<br />

<strong>of</strong> having a proper manifold structure, extending the<br />

pullback learning technique to mere distance functions or<br />

divergences.<br />

Industrial and societal context. The growing market<br />

<strong>for</strong> action and gesture <strong>recognition</strong> applications, <strong>activity</strong><br />

<strong>recognition</strong> or human-computer interface is just<br />

too big to be described extensively here. It is perhaps<br />

worth citing the case <strong>of</strong> motion-based video<br />

games interfaces. Micros<strong>of</strong>t has recently launched<br />

its Project Natal, which with its controller-free gaming<br />

experience is probably destined to revolutionize the<br />

whole field <strong>of</strong> interactive video games and consoles<br />

(see http://www.xbox.com/en-us/live/projectnatal/ <strong>for</strong> some<br />

amazing demos). The Ox<strong>for</strong>d Brookes vision group enjoys<br />

continuing strong links with Micros<strong>of</strong>t through its founder<br />

Pr<strong>of</strong>essor Torr, and has recently acquired a range camera<br />

(http://en.wikipedia.org/wiki/Range imaging) in order<br />

to kickstart cutting-edge research in motion analysis.<br />

FIX Historically, the first intended application <strong>of</strong> <strong>activity</strong><br />

<strong>recognition</strong> was human machine interaction. Gesturing can<br />

be seen a much more natural way <strong>of</strong> interacting with a<br />

computer endowed with a simple webcam, and people in<br />

the 1990’s started envisaging the replacement <strong>of</strong> mouse<br />

and keyboard as main interfaces between computers and<br />

their users. This later lead to research ef<strong>for</strong>ts focused on<br />

near-future scenarios in which numerous devices possessing<br />

some degree <strong>of</strong> intelligence would interact with people<br />

in so called ”smart rooms”.<br />

Nowadays, as videos have become part <strong>of</strong> everyday life,<br />

methods to store or index or summarize video footage are<br />

<strong>of</strong> increasing commercial interest. Content-based video retrieval<br />

from repositories such as youtube or ... has to rely<br />

on extraction and labeling <strong>of</strong> significant motion patterns in<br />

the video.<br />

Another application field <strong>of</strong> growing importance is security<br />

and surveillance, where motion classification techniques<br />

can be employed to either detect anomalous behavior in<br />

surveillance video (in order to require the attention <strong>of</strong> a<br />

human supervisor), or to recognize people’s identity from<br />

their walking gait in uncooperative scenarios, in one <strong>of</strong> the<br />

most promising approaches to behavioral biometrics. Most<br />

such companies focus at the moment on cooperative biometrics<br />

such as face or iris <strong>recognition</strong>: investing in behavioral,<br />

non-cooperative biometrics be<strong>for</strong>e the rest <strong>of</strong> the<br />

market could provide them with a significant competitive<br />

advantage. “In both the identity management and security<br />

arenas, the use <strong>of</strong> biometric technology is increasing apace<br />

... the world biometrics market has expanded exponentially.<br />

Annual growth is <strong>for</strong>ecast at 33% between the years 2000<br />

and 2010. Europe is expected to have the fastest growing<br />

biometrics market by 2010 ... The Intellect Association <strong>for</strong><br />

Biometrics (IAfB) is the UK body that represents companies<br />

developing these technologies ... has fostered close ties<br />

with the UK Border Agency and Home Office.” (Biometrics,<br />

November 2008).<br />

2.2 Research hypotheses and objectives<br />

Research idea.<br />

The goal <strong>of</strong> the present proposal is to present and test a<br />

general differential-geometric framework <strong>for</strong> learning Riemannian<br />

metrics or distance functions <strong>for</strong> <strong>dynamical</strong> <strong>models</strong>,<br />

given a training set which can be either labeled or unlabeled.<br />

Given a training set <strong>of</strong> <strong>models</strong>, the optimal metric<br />

or distance function is selected among a family <strong>of</strong> pullback<br />

metrics induced by a parameterized automorphism <strong>of</strong> the<br />

space <strong>of</strong> <strong>models</strong>. Such function is arguably the most appropriate<br />

<strong>for</strong> the collection <strong>of</strong> motions at hand, and can subsequently<br />

be used to classify new movements.<br />

The available in<strong>for</strong>mation (in the <strong>for</strong>m <strong>of</strong> a training set <strong>of</strong><br />

recorded actions/activities) is used to learn the “best” way<br />

to recognize new actions/activities. The proposed approach<br />

can be straight<strong>for</strong>wardly applied to action/identity <strong>recognition</strong>,<br />

identity <strong>recognition</strong> from gait, and video content summarization.<br />

Novelty and contributions.<br />

- contribution to distance learning in nonlinear spaces<br />

- classification <strong>of</strong> complex, structured objects -¿ direct<br />

competition with structured learning<br />

Timeliness.<br />

- commercial applications <strong>of</strong> computer vision are spreading<br />

rapidly - action <strong>recognition</strong> one <strong>of</strong> hottest topic right<br />

now<br />

Goals <strong>of</strong> the project.<br />

Milestones.<br />

2.3 Programme and methodology<br />

2.3.1 Methodology<br />

Preliminary results on pullback metric learning. The<br />

proposer [13] has recently investigated the use <strong>of</strong> pullback<br />

metrics <strong>for</strong> simple scalar autoregressive <strong>models</strong> <strong>of</strong> order 2,<br />

and their use <strong>for</strong> identity <strong>recognition</strong> from gait. Besides<br />

considering only a very limited class (AR2) <strong>of</strong> <strong>models</strong>, [13]<br />

only deals with scalar observations, making the approach<br />

impractical <strong>for</strong> real-world action or <strong>activity</strong> <strong>recognition</strong>. An<br />

first extension to multi-dimensional AR <strong>models</strong> have been<br />

proposed by Dr Cuzzolin [16]. More to the point, in previous<br />

work [34, 13] a geometric quantity totally unrelated to<br />

classification was optimized, leaving the obtained metrics<br />

with rather modest classification per<strong>for</strong>mances. A framework<br />

in which classification rates are directly optimized by<br />

the obtained distance function is sorely needed. Furthermore,<br />

<strong>for</strong> important classes <strong>of</strong> <strong>dynamical</strong> <strong>models</strong> used in action<br />

<strong>recognition</strong> such as HMMs or variable length Markov<br />

<strong>models</strong> [26] (VLMMs) a proper metric has not yet been<br />

identified. In order to learn optimal pullback distances <strong>for</strong><br />

such important classes <strong>of</strong> <strong>models</strong> we necessarily need to<br />

relax the constraint <strong>of</strong> having a proper manifold structure,<br />

extending the pullback learning technique to mere distance<br />

functions or divergences. This is all the more critical <strong>for</strong><br />

the applicability <strong>of</strong> distance learning to the kind <strong>of</strong> sophisticated<br />

graphical or <strong>dynamical</strong> <strong>models</strong> necessary in <strong>activity</strong><br />

<strong>recognition</strong>.<br />

Pullback <strong>for</strong>malism. Let us suppose a data-set D =<br />

{m1, ..., mN} <strong>of</strong> <strong>dynamical</strong> <strong>models</strong> is available. Suppose<br />

also that such <strong>models</strong> live on a Riemannian manifold M <strong>of</strong><br />

some sort, i.e, a Riemannian metric is defined in any point


<strong>of</strong> the manifold. Consider then an automorphism (invertible<br />

differentiable map) between M and itself: F : M → M,<br />

m ↦→ F (m), m ∈ M. Let us denote by TmM the tangent<br />

space to M in m. Each tangent vector v ∈ TmM maps any<br />

function f on M to the derivative <strong>of</strong> f along the direction<br />

v: v(f) = ∂f/∂v.<br />

Any automorphism F is associated with a push-<strong>for</strong>ward<br />

map <strong>of</strong> tangent vectors: F∗ : TmM → T F (m)M, v ∈<br />

TmM ↦→ F∗v ∈ T F (m)M defined as F∗v(f) = v(f ◦ F )<br />

<strong>for</strong> all smooth functions f on M, which maps f to its partial<br />

derivative ∂f/∂v in F (m).<br />

Consider now a Riemannian metric 1 g : T M × T M → R<br />

on M. The automorphism F induces a pullback metric<br />

on M: g∗m(u, v) . = g F (m)(F∗u, F∗v) such that the scalar<br />

product <strong>of</strong> two tangent vectors u, v in m ∈ M according to<br />

the pullback metric g∗ is the scalar product with respect to<br />

the original metric g <strong>of</strong> the push-<strong>for</strong>ward vectors F∗u, F∗v<br />

in F (m). The pullback geodesic (shortest path) between<br />

two points is the lifting <strong>of</strong> the geodesic connecting their images<br />

with respect to the original metric. A pullback distance<br />

between two points on M (in our case, two <strong>dynamical</strong> <strong>models</strong>)<br />

can be computed along such pullback geodesic.<br />

By defining a class <strong>of</strong> such automorphisms {Fλ, λ ∈ Λ}<br />

depending on some parameter λ, we get a corresponding<br />

family <strong>of</strong> pullback metrics {g∗λ, λ ∈ Λ} on M. We can<br />

then define an optimization problem over such family in order<br />

to select an “optimal” metric. The nature <strong>of</strong> the resulting<br />

manifold will obviously depend on the objective function<br />

we choose to optimize.<br />

Figure 2: The push-<strong>for</strong>ward map associated with an automorphism<br />

on a Riemannian manifold M.<br />

Spaces <strong>of</strong> <strong>dynamical</strong> <strong>models</strong>. To apply the pullback<br />

metric framework to <strong>dynamical</strong> <strong>models</strong> we first need to define<br />

a structure <strong>of</strong> Riemannian manifold on them. Even<br />

though a Fisher Riemannian metric has been computed <strong>for</strong><br />

several <strong>manifolds</strong> <strong>of</strong> linear MIMO systems [?], and work on<br />

pullbacks <strong>of</strong> Fisher in<strong>for</strong>mation metrics has been recently<br />

conducted [31], <strong>for</strong> important classes <strong>of</strong> <strong>dynamical</strong> <strong>models</strong><br />

(such as hidden Markov <strong>models</strong> or variable length Markov<br />

<strong>models</strong>) no manifold structure is analytically known. Standard<br />

methods <strong>for</strong> measuring distances between HMMs, <strong>for</strong><br />

instance, rely on the Kullback-Leibler divergence [33] (even<br />

though several other distance functions have been proposed<br />

[20]).<br />

<strong>Learning</strong> pullback distances <strong>for</strong> <strong>dynamical</strong> <strong>models</strong>.<br />

In this proposal the general framework <strong>for</strong> learning an optimal<br />

pullback metric/distance from a training set <strong>of</strong> <strong>dynamical</strong><br />

<strong>models</strong>, outlined in Figure 3, is proposed.<br />

1. given a data-set Y <strong>of</strong> observation sequences {yi =<br />

[yi(t), t = 1, ..., Li], i = 1, ..., N} <strong>of</strong> variable length Li,<br />

a <strong>dynamical</strong> model mi <strong>of</strong> a certain class C can be estimated<br />

by parameter identification, yielding a set <strong>of</strong> <strong>models</strong><br />

1 In<strong>for</strong>mally speaking, g determines how to compute scalar products <strong>of</strong><br />

tangent vectors v ∈ TmM.<br />

D = {m1, ..., mN};<br />

2. such <strong>models</strong> <strong>of</strong> class C belong to a certain domain<br />

MC: to measure distances between pairs <strong>of</strong> <strong>models</strong> on MC<br />

we need either a distance function dM or a proper Riemannian<br />

metric gM;<br />

3. a family {Fλ, λ ∈ Λ} <strong>of</strong> automorphisms from MC<br />

onto itself (parameterized by a vector λ) is then designed to<br />

provide a search space <strong>of</strong> metrics/distances (the variable in<br />

our optimization scheme) from which to select the optimal<br />

one;<br />

4. Fλ induces a family <strong>of</strong> pullback metrics {g∗λ, λ} or<br />

distances {d∗λ, λ} on M, respectively;<br />

5. optimizing over this family <strong>of</strong> pullback distances/metrics<br />

(according to some sensible objective function)<br />

yields an optimal pullback metric 2 ˆg∗ or distance function<br />

ˆ d∗. The learnt optimal distance function can finally be<br />

used to cluster or classify new “test” <strong>models</strong>/sequences.<br />

Objective function. When the data-set <strong>of</strong> <strong>models</strong> is labeled,<br />

we can determine the optimal metric/distance function<br />

by maximizing the classification per<strong>for</strong>mance <strong>of</strong> the<br />

metric. As the classification score is hard to describe analytically,<br />

in preliminary work [13, ?] we extracted a number<br />

<strong>of</strong> samples from the parameter space and pick the maximal<br />

per<strong>for</strong>mance sample.<br />

Image feature representation. ADAPT Historically, silhouettes<br />

have been <strong>of</strong>ten (but by no means always [41])<br />

used to encode the shape <strong>of</strong> the walking person along the sequence,<br />

but are widely criticized <strong>for</strong> their sensitivity to noise<br />

and the fact that they require solving the (inherently ill defined)<br />

background subtraction problem. In the perspective<br />

<strong>of</strong> a real-world deployment <strong>of</strong> behavioral biometrics it is<br />

essential to move beyond silhouette-based representations,<br />

as a crucial step to improve the robustness <strong>of</strong> the <strong>recognition</strong><br />

process. An interesting feature descriptor, <strong>for</strong> instance,<br />

called “action snippets” [47] is based on motion and shape<br />

extraction within rectangular bounding boxes which, contrarily<br />

to silhouettes, can be reliably obtained in most scenarios<br />

by using person detectors [?] or trackers [19]. Our final<br />

goal is to adopt a discriminative feature selection stage,<br />

such as the one proposed in [46], where discriminative features<br />

are selected from an initial bag <strong>of</strong> HOG-based descriptors.<br />

In this sense the expertise <strong>of</strong> the Ox<strong>for</strong>d Brookes vision<br />

group in this area will be extremely valuable to the final<br />

success <strong>of</strong> the project.<br />

Crucial issues and further developments.<br />

In perspective, the proposed methodology can be extended<br />

to cope with more complex classes <strong>of</strong> non-linear <strong>dynamical</strong><br />

<strong>models</strong> [], allowing classification <strong>of</strong> more complex<br />

activities rather than simple stationary actions.<br />

Similarly, other important tasks in vision such as face<br />

and object <strong>recognition</strong>, so long as they involve the classification<br />

<strong>of</strong> objects living on a manifold endowed with a metric<br />

or a distance function, can be treated in the same way.<br />

2.3.2 Programme <strong>of</strong> work and milestones<br />

2.4 Relevance to academic beneficiaries<br />

Impact on <strong>activity</strong> <strong>recognition</strong>.<br />

2 In the Riemannian case the geodesic path between any two <strong>models</strong><br />

has to be known to compute the associated geodesic distance: knowing the<br />

geodesics <strong>of</strong> M we can calculate distances on M based on ˆg∗.


Figure 3: A bird’s eye view <strong>of</strong> our proposed framework <strong>for</strong> learning pullback metrics <strong>for</strong> <strong>dynamical</strong> <strong>models</strong>.<br />

Impact on other classification problems. In addition,<br />

the developed techniques can be applied in a rather straight<strong>for</strong>ward<br />

way to any classification problem involving complex<br />

objects living in a structured, metric space. They can<br />

there<strong>for</strong>e find natural applications in fields such as face<br />

<strong>recognition</strong>, ...<br />

Impact on manifold learning.<br />

References<br />

[1] S. Ali, A. Basharat, and M. Shah, Chaotic invariants <strong>for</strong> human<br />

action <strong>recognition</strong>, ICCV’07.<br />

[2] S.-I. Amari, Differential geometric methods in statistics,<br />

Springer-Verlag, 1985.<br />

[3] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, <strong>Learning</strong><br />

distance functions using equivalence relations, ICML03,<br />

2003, pp. 11–18.<br />

[4] M. Bilenko, S. Basu, and R. J. Mooney, Integrating constraints<br />

and metric learning in semi-supervised clustering,<br />

Proc. <strong>of</strong> ICML’04, 2004.<br />

[5] A. Bissacco, A. Chiuso, and S. Soatto, Classification and<br />

<strong>recognition</strong> <strong>of</strong> <strong>dynamical</strong> <strong>models</strong>: The role <strong>of</strong> phase, independent<br />

components, kernels and optimal transport, IEEE<br />

Trans. PAMI 29 (2007), no. 11, 1958–1972.<br />

[6] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri,<br />

Actions as space-time shapes, IEEE Conference on Computer<br />

Vision (2005), 1395–1402.<br />

[7] M. Bregonzio, S. Gong, and T. Xiang, Recognising action<br />

as clouds <strong>of</strong> space-time interest points, CVPR’09, pp. 1948–<br />

1955.<br />

[8] P. Burman, A comparative study <strong>of</strong> ordinary crossvalidation,<br />

v-fold cross-validation and the repeated<br />

learning-testing methods, Biometrika 76(3) (1989), 503–<br />

514.<br />

[9] N. L. Carter, D. Young, and J. M. Ferryman, Supplementing<br />

Markov chains with additional features <strong>for</strong> behavioural<br />

analysis, 2006, pp. 65–65.<br />

[10] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal, Histograms<br />

<strong>of</strong> oriented optical flow and Binet-Cauchy kernels<br />

on nonlinear <strong>dynamical</strong> systems <strong>for</strong> the <strong>recognition</strong> <strong>of</strong> human<br />

actions, CVPR’09, pp. 1932–1939.<br />

[11] F. Cuzzolin, Using bilinear <strong>models</strong> <strong>for</strong> view-invariant action<br />

and identity <strong>recognition</strong>, CVPR’06, vol. 1, pp. 1701–1708.<br />

[12] F. Cuzzolin, A geometric approach to the theory <strong>of</strong> evidence,<br />

IEEE Transactions on Systems, Man, and Cybernetics - Part<br />

C 38 (2008), no. 4, 522–534.<br />

[13] F. Cuzzolin, <strong>Learning</strong> pullback metrics <strong>for</strong> linear <strong>models</strong>,<br />

Workshop on Machine <strong>Learning</strong> <strong>for</strong> Vision-based Motion<br />

Analysis MLVMA, 2008.<br />

[14] F. Cuzzolin, Multilinear modeling <strong>for</strong> robust identity <strong>recognition</strong><br />

from gait, Behavioral Biometrics <strong>for</strong> Human Identification:<br />

Intelligent Applications (Liang Wang and Xin Geng,<br />

eds.), IGI Publishing, 2009.<br />

[15] F. Cuzzolin, Three alternative combinatorial <strong>for</strong>mulations <strong>of</strong><br />

the theory <strong>of</strong> evidence, Intelligent Decision Analysis (2010).


[16] F. Cuzzolin, Manifold learning <strong>for</strong> multi-dimensional autoregressive<br />

<strong>dynamical</strong> <strong>models</strong>, Machine <strong>Learning</strong> <strong>for</strong> Visionbased<br />

Motion Analysis (L. Wang, G. Zhao, L. Cheng, and<br />

M. Pietikine, eds.), Springer, 2010.<br />

[17] F. Cuzzolin, D. Mateus, D. Knossow, E. Boyer, and<br />

R. Horaud, Coherent laplacian protrusion segmentation,<br />

CVPR’08.<br />

[18] F. Cuzzolin, A. Sarti, and S. Tubaro, Action modeling with<br />

volumetric data, ICIP’04, vol. 2, pp. 881– 884.<br />

[19] N. Dalai and B. Triggs, Histograms <strong>of</strong> oriented gradients <strong>for</strong><br />

human detection, CVPR’06, pp. 886– 893.<br />

[20] M. N. Do, Fast approximation <strong>of</strong> Kullback-Leibler distance<br />

<strong>for</strong> dependence trees and hidden Markov <strong>models</strong>, IEEE Signal<br />

Processing Letters 10 (2003), no. 4, 115 – 118.<br />

[21] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, Automatic<br />

annotation <strong>of</strong> human actions in video, ICCV’09.<br />

[22] C. F. Eick, A. Rouhana, A. Bagherjeiran, and R. Vilalta, Using<br />

clustering to learn distance functions <strong>for</strong> supervised similarity<br />

assessment, ICML and Data Mining, 2005.<br />

[23] R. Elliot, L. Aggoun, and J. Moore, Hidden markov <strong>models</strong>:<br />

estimation and control, Springer Verlag, 1995.<br />

[24] J. M. Wang et. al., Gaussian process <strong>dynamical</strong> model,<br />

NIPS’05.<br />

[25] L. Ralaivola et. al., Dynamical modeling with kernels <strong>for</strong><br />

nonlinear time series prediction, NIPS’04.<br />

[26] A. Galata, N. Johnson, and D. Hogg, <strong>Learning</strong> variablelength<br />

Markov <strong>models</strong> <strong>of</strong> behavior, CVIU 81 (2001), no. 3,<br />

398–413.<br />

[27] A. Gilbert, J. Illingworth, and R. Bowden, Scale invariant<br />

action <strong>recognition</strong> using compound features mined from<br />

dense spatio-temporal corners, 2008, pp. I: 222–233.<br />

[28] A. Gupta, P. Srinivasan, J. Shi, and L. S. Davis, Understanding<br />

videos, constructing plots learning a visually<br />

grounded storyline model from annotated videos., CVPR’09,<br />

pp. 2012–2019.<br />

[29] D. Han, L. Bo, and C. Sminchisescu, Selection and context<br />

<strong>for</strong> action <strong>recognition</strong>, ICCV’09.<br />

[30] N. Ikizler-Cinbis, R. G. Cinbis, and S. Sclar<strong>of</strong>f, <strong>Learning</strong><br />

actions from the web, ICCV’09.<br />

[31] M. Itoh and Y. Shishido, Fisher in<strong>for</strong>mation metric and Poisson<br />

kernels, Differential Geometry and its Applications 26<br />

(2008), no. 4, 347 – 356.<br />

[32] T. K. Kim and R. Cipolla, Canonical correlation analysis <strong>of</strong><br />

video volume tensors <strong>for</strong> action categorization and detection,<br />

31 (2009), no. 8, 1415–1428.<br />

[33] S. Kullback and R. A. Leibler, On in<strong>for</strong>mation and sufficiency,<br />

Annals <strong>of</strong> Math. Stat. 22 (1951), 79–86.<br />

[34] G. Lebanon, Metric learning <strong>for</strong> text documents, IEEE Tr.<br />

PAMI 28 (2006), no. 4, 497–508.<br />

[35] R. N. Li, R. Chellappa, and S. H. K. Zhou, <strong>Learning</strong> multimodal<br />

densities on discriminative temporal interaction manifold<br />

<strong>for</strong> group <strong>activity</strong> <strong>recognition</strong>.<br />

[36] Z. Lin, Z. Jiang, and L. S. Davis, Recognizing actions by<br />

shape-motion prototype trees, ICCV’09, pp. 444–451.<br />

[37] J. G. Liu, J. B. Luo, and M. Shah, Recognizing realistic actions<br />

from videos ’in the wild’, CVPR’09, pp. 1996–2003.<br />

[38] C. C. Loy, T. Xiang, and S. Gong, Modelling <strong>activity</strong><br />

global temporal dependencies using time delayed probabilistic<br />

graphical model, ICCV’09.<br />

[39] M. Marszalek, I. Laptev, and C. Schmid, Actions in context,<br />

CVPR’09.<br />

[40] D. Mateus, R. Horaud, D. Knossow, F. Cuzzolin, and E.<br />

Boyer, Articulated shape matching using Laplacian eigenfunctions<br />

and unsupervised point registration, CVPR’08.<br />

[41] C. Nandini and C. N. Ravi Kumar, Comprehensive framework<br />

to gait <strong>recognition</strong>, Int. J. Biometrics 1 (2008), no. 1,<br />

129–137.<br />

[42] B. North, A. Blake, M. Isard, and J. Rittscher, <strong>Learning</strong> and<br />

classification <strong>of</strong> complex dynamics, PAMI 22 (2000), no. 9.<br />

[43] M. Piccardi and O. Perez, Hidden Markov <strong>models</strong> with kernel<br />

density estimation <strong>of</strong> emission probabilities and their use in<br />

<strong>activity</strong> <strong>recognition</strong>, VS’07, pp. 1–8.<br />

[44] K. Rapantzikos, Y. Avrithis, and S. Kollias, Dense saliencybased<br />

spatiotemporal feature points <strong>for</strong> action <strong>recognition</strong>,<br />

CVPR’09, pp. 1454–1461.<br />

[45] K. K. Reddy, J. Liu, and M. Shah, Incremental action <strong>recognition</strong><br />

using feature-tree, ICCV’09.<br />

[46] G. Rogez, J. Rihan, S. Ramalingam, C. Orrite, and P. H. S.<br />

Torr, Randomized trees <strong>for</strong> human pose detection, CVPR’08.<br />

[47] K. Schindler and L. van Gool, Action snippets: How many<br />

frames does human action <strong>recognition</strong> require?, CVPR’08.<br />

[48] M. Schultz and T. Joachims, <strong>Learning</strong> a distance metric from<br />

relative comparisons, NIPS’04.<br />

[49] H. J. Seo and P. Milanfar, Detection <strong>of</strong> human actions from a<br />

single example, ICCV’09.<br />

[50] N. Shental, T. Hertz, D. Weinshall, and M. Pavel, Adjustment<br />

learning and relevant component analysis, ECCV’02.<br />

[51] Q. F. Shi, L. Wang, L. Cheng, and A. Smola, Discriminative<br />

human action segmentation and <strong>recognition</strong> using semi-<br />

Markov model, CVPR’08.<br />

[52] A. J. Smola and S. V. N. Vishwanathan, Hilbert space embeddings<br />

in <strong>dynamical</strong> systems, IFAC’03, pp. 760 – 767.<br />

[53] J. Sun, X. Wu, S. C. Yan, L. F. Cheong, T. S. Chua, and J. T.<br />

Li, Hierarchical spatio-temporal context modeling <strong>for</strong> action<br />

<strong>recognition</strong>, CVPR’09, pp. 2004–2011.<br />

[54] A. Sundaresan, A. K. Roy Chowdhury, and R. Chellappa,<br />

A hidden Markov model based framework <strong>for</strong> <strong>recognition</strong> <strong>of</strong><br />

humans from gait sequences, ICIP’03, pp. II: 93–96.<br />

[55] I. W. Tsang, J. T. Kwok, C. W. Bay, and H. Kong, Distance<br />

metric learning with kernels, ICAI’03.<br />

[56] Y. Wang and G. Mori, Max-margin hidden conditional random<br />

fields <strong>for</strong> human action <strong>recognition</strong>, CVPR’09, pp. 872–<br />

879.<br />

[57] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russel, Distance<br />

metric learning with applications to clustering with side in<strong>for</strong>mation,<br />

NIPS’03.<br />

[58] B. Yao and S.C Zhu, <strong>Learning</strong> de<strong>for</strong>mable action templates<br />

from cluttered videos, ICCV’09.<br />

[59] Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, and T. S. Huang,<br />

Action detection in complex scenes with spatial and temporal<br />

ambiguities, ICCV’09.<br />

[60] J. S. Yuan, Z. C. Liu, and Y. Wu, Discriminative subvolume<br />

search <strong>for</strong> efficient action detection, CVPR’09, pp. 2442–<br />

2449.<br />

[61] Z. Zhang, <strong>Learning</strong> metrics via discriminant kernels and<br />

multidimensional scaling: Toward expected euclidean representation,<br />

ICML’03.


A Justification <strong>of</strong> resources B Diagrammatic work plan


C Academic beneficiaries (4000 characters<br />

max)<br />

Please summarise how your proposed research will contribute<br />

to knowledge, both within the UK and globally.<br />

Please look broadly beyond narrow research fields. We<br />

recognise that in generating new knowledge, a crossdisciplinary<br />

or single-disciplinary approach may be the<br />

most appropriate. Applicants are asked to clearly state their<br />

chosen approach and provide justification <strong>for</strong> that choice.<br />

List <strong>of</strong> beneficiaries<br />

How the research will benefit other researchers in the<br />

field<br />

Whether there are any academic beneficiaries in other<br />

disciplines and, if so, how they will benefit and what will be<br />

done to ensure that they benefit<br />

What other researchers, both within the UK and elsewhere<br />

are likely to be interested in or to benefit from the<br />

proposed research.<br />

Relevance <strong>of</strong> the research to beneficiaries<br />

Identify the potential academic impact <strong>of</strong> the proposed<br />

work<br />

Show how the research will benefit other researchers<br />

(this might include methodological or theoretical advances<br />

Identify whether the research will produce data or materials<br />

<strong>of</strong> benefit to other researchers, and explain how these<br />

will be stored, maintained and made available<br />

D Impact Summary (4000 characters<br />

max)<br />

The Impact Summary (4000 characters max) should address<br />

the three questions:<br />

Who will benefit from this research?<br />

How will they benefit from this research?<br />

What will be done to ensure that they have the opportunity<br />

to benefit from this research?<br />

Who will benefit from this research<br />

List any beneficiaries from the research, <strong>for</strong> example<br />

those who are likely to be interested in or to benefit from<br />

the proposed research both directly or indirectly. It may<br />

be useful to think <strong>of</strong> beneficiaries as users <strong>of</strong> the research<br />

outputs, both immediately, and in the longer term.<br />

within the commercial private sector?<br />

policy-makers, government and government agencies<br />

within the public sector, third sector or any others? (museums,<br />

galleries and charities)<br />

within the wider public?<br />

How will they benefit from this research<br />

Describe the relevance <strong>of</strong> the research to these beneficiaries,<br />

identifying the potential <strong>for</strong> impacts arising from the<br />

proposed work. Please consider the following when framing<br />

your response:<br />

Explain how the research has the potential to impact on<br />

the nations health, wealth or culture.<br />

security, surveillance<br />

What will these impacts be, and what is their importance?<br />

What are the realistic timescales <strong>for</strong> the benefits to be<br />

realised?<br />

What research and pr<strong>of</strong>essional skills will staff working<br />

on the project develop which they could apply in all employment<br />

sectors?<br />

Actions taken to ensure this<br />

Please detail how the proposed research project will be<br />

managed to engage users and beneficiaries and increase the<br />

likelihood <strong>of</strong> impacts<br />

Communication and engagement plans<br />

Collaboration arrangements<br />

Plans <strong>for</strong> exploitation<br />

Relevant experience and track record<br />

Ox<strong>for</strong>d Brookes Computer Vision Group has very well<br />

established channels <strong>for</strong> making technology transfer from<br />

research to product and a track record <strong>of</strong> achieving this. The<br />

group has now well trodden paths <strong>for</strong> exploiting IP with a<br />

track record <strong>of</strong> company interactions including Sony, Vicon,<br />

2d3, HMGCC and Sharp. Pr<strong>of</strong>. Torrs links with Yotta,<br />

Micros<strong>of</strong>t, Sony in particular, and throughout the industry<br />

in general will provide a fertile ground <strong>for</strong> exploitation.


E Impact plan<br />

Please detail how the proposed research project will be<br />

managed to engage users and beneficiaries and increase the<br />

likelihood <strong>of</strong> impacts<br />

Methods <strong>for</strong> communications and engagement<br />

Collaboration and exploitation in the most effective and<br />

appropriate manner<br />

Track record in this area and the costs <strong>of</strong> these activities<br />

E.1 Communications and engagement<br />

Describe engagement with the identified beneficiaries, <strong>for</strong><br />

example:<br />

How have beneficiaries been engaged to date, and how will<br />

they be engaged moving <strong>for</strong>ward?<br />

How will the work build on existing or create new links?<br />

Outline plans to work with intermediary organisations or<br />

networks.<br />

What activities will be undertaken to ensure good engagement<br />

and communication?<br />

Secondments <strong>of</strong> research or user community staff<br />

Events aimed at a target audience<br />

Workshops to provide training or in<strong>for</strong>mation dissemination<br />

Publications and publicity materials summarising main<br />

outcomes in a way that beneficiaries will be able to understand<br />

and use<br />

Websites and interactive media<br />

Media relations<br />

Public affairs activities<br />

E.2 Exploitation and application<br />

Identify the mechanisms in place <strong>for</strong> potential exploitation,<br />

both commercially and non-commercially<br />

Do you have any specific partnership, collaborative or<br />

exploitation agreements in place?<br />

Ox<strong>for</strong>d Brookes Computer Vision department already<br />

has an established record <strong>for</strong> exploiting IP, and interactions<br />

with company, including Sony, Vicon, 2d3, HMGCC and<br />

Sharp.<br />

How will the outputs with potential impact be identified?<br />

What structure and mechanisms can you put in place to<br />

exploit and protect the outputs from the research, during<br />

and at the end <strong>of</strong> the grant lifecycle?<br />

Intellectual Property Rights management and exploitation<br />

will be managed by the Research and Business Development<br />

Office (RBDO) at Ox<strong>for</strong>d Brookes University,<br />

which has access to financial and other resources to enable<br />

Intellectual Property and its commercial exploitation to<br />

be effectively managed, whilst maximizing the widespread<br />

dissemination <strong>of</strong> the research results. This includes, as appropriate,<br />

finance <strong>for</strong> patenting and pro<strong>of</strong> <strong>of</strong> concept funding;<br />

IP, technology and market assessment; resources <strong>for</strong><br />

defining and implementing a commercialization strategy<br />

though licensing, start-up company or other routes.<br />

E.3 Capability<br />

Who is likely to be undertaking the impact activities? For<br />

example: The Principal Investigator or Co-Investigator<br />

PhD students and post-doctoral researchers who may be involved<br />

in activities in addition to research; Specialised staff<br />

employed to undertake communication and exploitation activities;<br />

Technical experts to write publications, web pages<br />

and user-friendly interfaces.<br />

What previous and relevant experience do they have<br />

in achieving successful knowledge exchange and impact?<br />

How will they acquire the skills?<br />

E.4 Resource <strong>for</strong> the <strong>activity</strong><br />

If there are any resource implications as a result <strong>of</strong> implementing<br />

the knowledge exchange and/or impact activities,<br />

please ensure these are documented in the financial summary<br />

and also in the Justification <strong>of</strong> Resources section <strong>of</strong><br />

the proposal.<br />

Research results will be fully disseminated and published<br />

via international journals and conferences, and the<br />

Ox<strong>for</strong>d Brookes website. Currently the Ox<strong>for</strong>d Brookes<br />

vision group the proposer belongs to maintains a full programme<br />

<strong>of</strong> publication at all the major conferences in addition<br />

in the related fields <strong>of</strong> graphics and machine learning.<br />

All published papers and the generated code will be made<br />

available on a website that will be established specifically<br />

<strong>for</strong> the project.<br />

Many companies are active in the commercialization <strong>of</strong><br />

vision-based products, in particular in the fields <strong>of</strong> automatic<br />

surveillance, gait biometrics, human computer interaction,<br />

image-based web retrieval. I expect to exploit my<br />

personal links in this sense with researchers in several world<br />

class companies (many <strong>of</strong> them with research divisions in<br />

Europe) like Micros<strong>of</strong>t Research, Honeywell Labs (I. Cohen),<br />

Boston’s Mitsubishi Electric Research Lab (M. Brand,<br />

S. Ramalingam), GE (G. Doretto), Google (A. Bissacco).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!