Learning manifolds of dynamical models for activity recognition

More documents

Recommendations

Info

are run on the KTH and CMU databases. Some efforts have been recently put into recognition from single images too (e.g. [30], where actions are learnt from static images taken from the web). Dynamical models in action recognition. However, encoding the dynamics of videos or image sequences by means of some sort of dynamical model can be useful in situations in which the dynamics is critically discriminative. Furthermore, the actions of interest have to be temporally segmented from a video sequence: we need to know when an action/activity starts or stops. Actions of sometimes very different lengths have to be encoded in a homogeneous fashion in order to be compared (“time warping”). Dynamical representations are very effective in coping with time warping or action segmentation [51]. Furthermore, in limit situations in which a significant number of people move or ambulate in the field of view (as it is common in surveillance scenarios), the attention has necessarily to move from single objects/bodies to approaches which consider the monitored crowd some sort of fluid, and describe its behavior in a way similar to the physical modeling of fluids []. Dynamical models are well equipped to deal with such scenarios []. In these scenarios, action (or identity) recognition reduces to classifying dynamical models. Hidden Markov models [23] have been indeed widely employed in action recognition [43, 51] and gait identification [54, 9]. HMM classification can happen either by evaluating the likelihood of a new sequence with respect to the learnt models, or by learning a new model for the test sequence, measuring its distance from the old models, and attributing to it the label of the closest model. Indeed, many researchers have explored the idea of encoding motions via linear [5], nonlinear [25], stochastic [42, 24] or chaotic [1] dynamical systems, and classifying them by measuring distances in their space. Chaudry et al [10], for instance, have used nonlinear dynamical systems (NLDS) to model times series of histograms of oriented optical flow, measuring distances between NLDS by means of Cauchy kernels, while Wang and Mori [56], have proposed sophisticated max-margin conditional random fields to address locality by recognizing actions as constellations of local motion patterns. Sophisticated graphical models can be useful to learn in a bottom-up fashion the temporal structure or plot of a footage, or to describe causal relationships in complex activity patterns [38]. Gupta et al [28] work on determining the plot of a video by discovering causal relationships between actions, represented as an AND/OR graph whose edges are associated with spatio-temporal constraints. Integer Programming is used for storyline extraction on baseball footage. Distance-based recognition. The use of distances and mani- Figure 1: Datasets. fold learning for action recognition is not limited to dynamical models. Lin et al [36] think of actions as sequences of prototype trees, learned by hierarchical k-means in a joint shape and motion space. Prototype-to-prototype distances are generated as a look-up table. The joint likelihood of location/prototype is maximized to track actors in the Wiezmann and KTH datasets, while actions are recognized by prototype sequence matching. In an interesting related work, Li at el [35] describe activities as discriminative temporal interaction matrices, living in a Discriminative Temporal Interaction Manifold. They set probability densities on this manifold, and use a MAP classifier to recognize new activities. Their data is a collection of NCAA American football footage. Distance function learning. A number of distance functions between linear systems have been introduced in the past (e.g., [52]), and a vast literature about dissimilarity measures for Markov models also exists [20], mostly concerning variants of the Kullback-Leibler divergence [33]. However, as models (or sequences) can be endowed with different labels (e.g., action, ID) while maintaining the same geometrical structure, no single distance function can possibly outperform all the others in every classification problem. A reasonable approach when possessing some a-priori information is therefore trying to learn in a supervised fashion the “best” distance function for a specific classification problem [3, 4, 48, 55, 61, 22]. A natural optimization criterion consists on maximizing the classification performance achieved by the learnt metric, a problem which has elegant solutions in the case of linear mappings [50, 57]. However, as even the simplest linear dynamical models live in a nonlinear space, the need for a principled way of learning Riemannian metrics from such data naturally arises. Pullback metrics. An interesting tool is provided by the formalism of “pullback metrics”. If the models belong to a Riemannian manifold M, any diffeomorphism of M onto itself or “automorphism” induces such a metric on M. By designing a suitable family of automorphisms depending on a parameter λ, we obtain a family of pullback metrics on M we can optimize on. Pullback metrics [31] have been recently proposed in the context of document retrieval [34], where a proper Fisher metric is available: instead of optimization classification rates, the inverse volume of the pullback manifold is there maximized. In [13] pullback Fisher metrics for simple scalar autoregressive models of order 2 are learned. Besides considering only a very limited class (AR2) of models, [13] only deals with scalar observations, making the approach impractical for action, activity or identity recognition. As [34, 13] choose to optimize a geometric quantity totally unrelated to classification, the obtained metrics deliver rather modest classification performances. Furthermore, for important classes of dynamical models used in action recogni-
tion such as HMMs or variable length Markov models [26] (VLMMs) a proper metric has not yet been identified. In order to learn optimal pullback distances for such important classes of models we necessarily need to relax the constraint of having a proper manifold structure, extending the pullback learning technique to mere distance functions or divergences. Industrial and societal context. The growing market for action and gesture recognition applications, activity recognition or human-computer interface is just too big to be described extensively here. It is perhaps worth citing the case of motion-based video games interfaces. Microsoft has recently launched its Project Natal, which with its controller-free gaming experience is probably destined to revolutionize the whole field of interactive video games and consoles (see http://www.xbox.com/en-us/live/projectnatal/ for some amazing demos). The Oxford Brookes vision group enjoys continuing strong links with Microsoft through its founder Professor Torr, and has recently acquired a range camera (http://en.wikipedia.org/wiki/Range imaging) in order to kickstart cutting-edge research in motion analysis. FIX Historically, the first intended application of activity recognition was human machine interaction. Gesturing can be seen a much more natural way of interacting with a computer endowed with a simple webcam, and people in the 1990’s started envisaging the replacement of mouse and keyboard as main interfaces between computers and their users. This later lead to research efforts focused on near-future scenarios in which numerous devices possessing some degree of intelligence would interact with people in so called ”smart rooms”. Nowadays, as videos have become part of everyday life, methods to store or index or summarize video footage are of increasing commercial interest. Content-based video retrieval from repositories such as youtube or ... has to rely on extraction and labeling of significant motion patterns in the video. Another application field of growing importance is security and surveillance, where motion classification techniques can be employed to either detect anomalous behavior in surveillance video (in order to require the attention of a human supervisor), or to recognize people’s identity from their walking gait in uncooperative scenarios, in one of the most promising approaches to behavioral biometrics. Most such companies focus at the moment on cooperative biometrics such as face or iris recognition: investing in behavioral, non-cooperative biometrics before the rest of the market could provide them with a significant competitive advantage. “In both the identity management and security arenas, the use of biometric technology is increasing apace ... the world biometrics market has expanded exponentially. Annual growth is forecast at 33% between the years 2000 and 2010. Europe is expected to have the fastest growing biometrics market by 2010 ... The Intellect Association for Biometrics (IAfB) is the UK body that represents companies developing these technologies ... has fostered close ties with the UK Border Agency and Home Office.” (Biometrics, November 2008). 2.2 Research hypotheses and objectives Research idea. The goal of the present proposal is to present and test a general differential-geometric framework for learning Riemannian metrics or distance functions for dynamical models, given a training set which can be either labeled or unlabeled. Given a training set of models, the optimal metric or distance function is selected among a family of pullback metrics induced by a parameterized automorphism of the space of models. Such function is arguably the most appropriate for the collection of motions at hand, and can subsequently be used to classify new movements. The available information (in the form of a training set of recorded actions/activities) is used to learn the “best” way to recognize new actions/activities. The proposed approach can be straightforwardly applied to action/identity recognition, identity recognition from gait, and video content summarization. Novelty and contributions. - contribution to distance learning in nonlinear spaces - classification of complex, structured objects -¿ direct competition with structured learning Timeliness. - commercial applications of computer vision are spreading rapidly - action recognition one of hottest topic right now Goals of the project. Milestones. 2.3 Programme and methodology 2.3.1 Methodology Preliminary results on pullback metric learning. The proposer [13] has recently investigated the use of pullback metrics for simple scalar autoregressive models of order 2, and their use for identity recognition from gait. Besides considering only a very limited class (AR2) of models, [13] only deals with scalar observations, making the approach impractical for real-world action or activity recognition. An first extension to multi-dimensional AR models have been proposed by Dr Cuzzolin [16]. More to the point, in previous work [34, 13] a geometric quantity totally unrelated to classification was optimized, leaving the obtained metrics with rather modest classification performances. A framework in which classification rates are directly optimized by the obtained distance function is sorely needed. Furthermore, for important classes of dynamical models used in action recognition such as HMMs or variable length Markov models [26] (VLMMs) a proper metric has not yet been identified. In order to learn optimal pullback distances for such important classes of models we necessarily need to relax the constraint of having a proper manifold structure, extending the pullback learning technique to mere distance functions or divergences. This is all the more critical for the applicability of distance learning to the kind of sophisticated graphical or dynamical models necessary in activity recognition. Pullback formalism. Let us suppose a data-set D = {m1, ..., mN} of dynamical models is available. Suppose also that such models live on a Riemannian manifold M of some sort, i.e, a Riemannian metric is defined in any point
Page 1 and 2: Learning manifolds of dynamical mod
Page 3: 2 Proposed research and its context
Page 7 and 8: Figure 3: A bird’s eye view of ou
Page 9 and 10: A Justification of resources B Diag
Page 11: E Impact plan Please detail how the

Learning manifolds of dynamical models for activity recognition

Create successful ePaper yourself

Delete template?

Save as template?