Project Proposal (PDF) - Oxford Brookes University

More documents

Recommendations

Info

FP7-ICT-2011-9 STREP proposal 18/01/12 v1 [Dynact] Issues with generative modelling. As we also argued above, classical dynamical models can be too rigid to describe complex activities, or prone to over-fitting the existing training examples. In probabilistic graphical models, such as (for instance) hidden Markov models, a major cause of overfitting is that they need to estimate from the training data unique, “precise” probability distributions describing, say, the conditional probabilities in an MRF, or the transition probabilities between the states of a traditional or a hierarchical hidden Markov model. To remain in the HMM example, there are efficient ways of dealing with this estimation problem, involving, respectively, the Expectation-Maximization (EM, [67]) and the Viterbi algorithm [65]. When little training data are available, the resulting model will depend quite strongly on the prior assumptions (probabilities) about the behaviour of the dynamical model. 1.2.3 Contributions: Pushing the boundaries in generative and discriminative approaches From our brief discussion it follows that: on one side, discriminative approaches, thought successful in limited experiments in controlled environment, need to be extended to include a description of the spatiotemporal structure of an action, if they are to tackle issues such as action segmentation/localization, multiagent activities, and the classification of more complex activities. On the other hand, generative graphical models have attractive features in terms of automatic segmentation, localization and extraction of plots from videos, but suffer from a tendency to overfit the available, limited training data. On top of that, more advanced techniques for the classification of generative models are necessary to cope with inherent variability and presence of covariates. With this project we propose to break new ground in all these respects, with significant impact on the real world deployment of action recognition tools, by designing novel modelling techniques (both generative and discriminative) able to incorporate the spatio-temporal structure of the data, while allowing for the necessary flexibility induced by a generally limited amount of training information. Introducing structure in discriminative models. In the field of discriminative modelling, we plan to build on recent progresses on the use of part-based discriminative models for the detection of 2D objects in cluttered images. If we think of actions (and even more so for complex activities) as spatio-temporal “objects”, composed by distinct but coordinated “parts” (elementary motions, simple actions), the notion of generalizing part based models originally designed for 2D object detection to actions becomes natural and appealing. In particular, as it is the case for objects, discriminative action parts can be learned in the framework of Multiple Instance Learning (MIL) [126, 127]. Consider a one-versus-all classification problem. In MIL, a discriminative model is learned starting from a bag of negative (of the wrong class) examples, and a bag of examples some of which are positive (of the right class) and some are negative (but we do not know which ones). Think, in our case, of all possible spatio-temporal sub-volumes in a video sequence within which, we know, a positive example of a certain action category is indeed present (but we do not know where). An initial “positive” model is learned by assuming that all examples in the positive bag are indeed positive (all sub-volumes of the sequence do contain the action at hand), while a negative one is learned from the examples in the negative bag (videos labelled with a different action category). Initial models are updated in an iterative process: eventually, only the most discriminative examples in the initial, positive bag are retained as positive. A flexible constellation of the most discriminative “action parts” can then be built from the training data to take into account the spatio-temporal structure of the activity at hand. Such an approach builds on the already significant results of discriminative models, but addresses at the same time several of the challenges we isolated in our analysis of the state of the art: 1 – complex activities can be learned and discriminated; 2 – localization in both space and time becomes an integral part of the recognition process; 3 – multi-agent action recognition now becomes standard practise as the presence of more than one action is assumed by default. Move towards imprecise-probabilistic generative models. As for generative modelling, addressing the issues of inherent variability (which causes data overfitting) and influence of the covariate factors (which make rigid classification techniques inadequate) requires, on one side, to move beyond classical, “precise” graphical models; on the other, to develop a theory of classification for generative models which allows for robustness and flexibility. We have seen above that classical graphical models require to estimate a number of probability distributions from the training data. If the training data form a small subset of the whole universe of examples (as it is always the case), the constraint of having to determine single probabilities necessarily leads to overfitting. In opposition, imprecise-probabilistic models replace such single (“precise”) probability distributions by whole convex closed sets of them, or “credal sets” [81]. Graphical models which handle credal sets, or “credal networks” [66], are a promising way of solving the overfitting problem, as they allow the actual evidence provided by a necessarily limited training set to determine only a set of linear constraints on the <strong>Proposal</strong> Part B: page [10] of [67]
FP7-ICT-2011-9 STREP proposal 18/01/12 v1 [Dynact] true value of the probability distributions to estimate. Despite a number of early successes [67], progress in this field has been hampered by the computational complexity of inference algorithms in such networks. This has made finding computationally efficient or even feasible counterparts of the classical EM and Viterbi algorithms rather difficult. Recently, however, significant progress has been made towards efficient exact inference algorithms in credal trees [62,63]. This research, initiated and actively pursued by members of the present consortium (SYSTeMS and IDSIA) has opened up the development of efficient algorithms for imprecise hidden Markov models [64], a significant special case of imprecise-probabilistic graphical model. Imprecise probabilistic graphical models allow a more flexible definition of action model: an imprecise model is indeed equivalent to infinitely many classical ones. At the same time, they retain the desirable features of classical graphical models in terms of segmentation and concise description of the observed motion. In addition, recognition algorithms based on such models (when compared to “precise” classifiers) can return more than a single action label: empirical comparisons generally show that the imprecise algorithm returns more than a single output when the precise algorithm just returns a wrong output: an extremely desirable feature, especially for security applications. Novel classification techniques for (precise/imprecise) generative models. In the generative approach, action classification reduces to the classification of (precise/imprecise) graphical models. In the “precise” case, a number of distance functions have been introduced in the past (e.g., [52]), and a vast literature about dissimilarity measures for Markov models also exists [22], mostly concerning variants of the Kullback-Leibler divergence [33]. However, as the same models (or sequences) can be endowed with different labels (e.g., action, ID, no single distance function can possibly outperform all the others in each and every classification problem. On top of that, the variation induced by the many nuisance factors) makes any naive approach to the classification of generative models doomed to fail. A reasonable approach when possessing some training data is instead trying to learn a supervised fashion the “best” distance function between models for a specific classification problem, for instance by employing differential-geometric methods [2,13,34]. Manifolds of generative models have been studied, for instance in [16], while the idea of supervised learning of distance functions has been successfully proposed and applied in the past in various contexts [3,4,48]. The same holds for imprecise graphical models. Dissimilarity measures for imprecise models can be derived, for instance, from distances between convex sets of distributions (as proposed by IDSIA). As graphical models are complex objects, classification frameworks based on the “structured learning” approach to SVM classification [69] have also the potential to deliver state of the art results. Manifold learning based approaches to the classification of generative models encoding complex activities subject to a number of nuisance factors can help tackle the crucial issue of the presence of a large number of covariate factors, pushing towards “recognition in the wild” scenarios [37]. 1.3 S/T methodology and associated work plan 1.3.1 Breakthroughs To summarize: in order to progress towards the actual real-world deployment of action recognition in the different potential scenarios, the following challenges need to be tackled: 1– localization in space and time (segmentation); 2 – presence of multiple actors; 3 – influence of covariance/nuisance factors which make unconstrained recognition “in the wild” so difficult; 4 – inherent variability of actions with the same meaning, and corresponding overfitting issues due to limited training sets; 5 – analysis of complex activities rather than elementary actions. The underlying idea we support with this project is that these issues can be tackled only by properly modelling the spatio-temporal structure (which combine localization and dynamics) of the motions to analyze. We propose to do so in both the generative and the discriminative approach to recognition, in order to achieve the following breakthroughs, which reflect the main challenges described above: Breakthrough #1: action localization in space and time. An often neglected step, localizing an action in both space (image region) and time is the first necessary step in any action recognition framework. Modelling the spatio-temporal structure of the action/activity is paramount to this purpose. Generative graphical models have been explicitly designed for these purposes; novel discriminative models which incorporate such a structure will be developed. Breakthrough #2: recognition in the presence of multiple actors. If more than one person is present in the scene, their individual behavior might have to be analyzed. Once again, spatio-temporal discriminative models potentially allow us to analyze their behavioral pattern separately. The same holds for <strong>Proposal</strong> Part B: page [11] of [67]
Page 1 and 2: FP7-ICT-2011-9 STREP proposal 18/01
Page 9: FP7-ICT-2011-9 STREP proposal 18/01
Page 61 and 62:
FP7-ICT-2011-9 STREP proposal 18/01
Page 63 and 64:
Page 65 and 66:
Page 67:
show all

Project Proposal (PDF) - Oxford Brookes University

Create successful ePaper yourself

Delete template?

Save as template?