28.07.2013 Views

Project Proposal (PDF) - Oxford Brookes University

Project Proposal (PDF) - Oxford Brookes University

Project Proposal (PDF) - Oxford Brookes University

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

FP7-ICT-2011-9 STREP proposal<br />

18/01/12 v1 [Dynact]<br />

Bag-of-features (BoF) on spatio-temporal volumes methods. These methods, in which<br />

dictionaries are built from clustering features extracted from (typically) the spatio-temporal volume<br />

associated with a video sequence (Figure 4), have been very successful in recent time. To cite a few, Kim<br />

and Cipolla [32] have proposed a spatio-temporal pattern matching approach in which spatio-temporal<br />

volumes are seen as tensors, and an extension of canonical correlation analysis to tensor analysis is used to<br />

detect actions on a 3D window search. Bregonzio et al [7] have used a global spatio-temporal distribution of<br />

interest points, from which they extract and select “holistic” features over multiple temporal scales.<br />

Rapantzikos et al [44] have also adopted dense spatio-temporal features detected using saliency measures, in<br />

a multi-scale volumetric representation [105]. Yuan at al [60] have described actions, once again, as<br />

collections of spatio-temporal invariant features, and proposed a naive Bayes mutual information<br />

maximisation method for multi-class recognition.<br />

Figure 4. The “baseline”, Bag-of-Feature algorithm for action classification [87].<br />

Issues with BoF methods. These methods have delivered good results in recent years, at least on<br />

datasets of fairly limited size, with few action categories. They have been tested, however, mostly on<br />

classification test-beds with pre-segmented clips, while localization using pure volume descriptors is rather<br />

difficult. It is interesting to note that on a new human motion action dataset dataset with 51 action categories<br />

(HMDB51 [104]), bag-of-features models [87] achieve classification results of just over 20%, suggesting<br />

that these approaches do not fully represent the complexities of human actions. Kuehne et al. has shown that<br />

this drop in performance with respect to previous datasets is most likely due to the increased number of<br />

action categories. [104] has also highlighted the need for more challenging datasets: while in the UCF Sports<br />

dataset [109], human static joint locations alone are sufficient to classify sports actions with an accuracy of<br />

over 98%, the same experiment on HMDB51 performs poorly (35%). This indicates that it may be necessary<br />

to take more into account action dynamics, in accordance with psychophysical experiments in which both<br />

motion and shape are critical for visual recognition of human actions [110, 111]. On top of that, entirely<br />

forgetting the spatio-temporal structure produces paradoxes even in simple mental experiments: by<br />

scrambling around the frames in a spatio-temporal volume we can obtain videos with absolutely no meaning<br />

whatsoever to us, but with roughly the same descriptors as the original one. In fact, the temporal factor is<br />

somewhat implicitly taken into consideration, as descriptors are extracted at multiple space and time scales.<br />

Dynamical generative modelling. Indeed, encoding the dynamics of videos or image sequences can<br />

be useful in situations in which it is critically discriminative. Moreover, actions of sometimes very different<br />

lengths have to be encoded in a homogeneous fashion in order to be compared (“time warping”). Encoding<br />

each video sequence by means of some sort of dynamical generative model has been proven effective when<br />

coping with time warping or temporal segmentation [51]. In these scenarios, action recognition reduces to<br />

classifying dynamical models. Hidden Markov models [23] have been indeed widely employed in action<br />

recognition [43, 51] and gait identification [54, 9]. HMM classification can happen, for instance, by<br />

evaluating the likelihood of a new sequence with respect to the learnt models, or by learning a new model for<br />

the test sequence through the Expectation-Maximisation (EM) algorithm [67], measuring its distance from<br />

the old models, and attributing to it the label of the closest one(s).<br />

Researchers have proposed the use of linear [5], nonlinear [25], stochastic [42, 24] or even chaotic [1]<br />

dynamical systems. Chaudry et al [10], for instance, employ nonlinear dynamical systems (NLDS) to model<br />

times series of histograms of oriented optical flow, measuring distances between NLDS via Cauchy kernels.<br />

As we recalled above, graphical models allow us to address the issue of localization [56], can be used as a<br />

tool to infer “plots” of videos [28] and, in situations in which crowds are present (as in surveillance<br />

scenarios), are suitable to describe the crowd’s behaviour in a way similar to the physical fluids modelling.<br />

<strong>Proposal</strong> Part B: page [9] of [67]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!