Outline Proposal - Oxford Brookes University

More documents

Recommendations

Info

Request 69009 Page 4 of 11 Figure 2: BoF methods build histograms of frequencies of local video features: as any spatiotemporal relationship is lost, meaningless videos with almost the same histograms can be incorrectly recognized. Inspired by the successes of similar approaches in 2D object detection [20], we propose to represent human activities as spatio-temporal “objects” composed of distinct, coordinated “parts” (elementary actions). More specifically, instead of computing action descriptors on whole video clips (Figure 3 left), we do that for collections of space-time action parts associated with video subvolumes (middle); multiple instance learning (MIL) is used to learn which subvolumes are particularly discriminative of the action (solid-line green cubes), and which are not (dotted-line cubes); finally (right) a human action is represented as a “star model” of elementary BoF action parts. Figure 3: the proposed approach for learning and recognizing human activities as structured constellations of the most discriminative action parts. Step 1: Prior to modeling actions, video streams have to be processed to extract salient “features”, either frame by frame or from the entire spatio-temporal (S/T) volume which contains the action(s) of interest. A plethora of local video descriptors have been proposed for S/T volumes: Cuboid, 3D-SIFT, HoGHoF, HOG3D, extended SURF. Dense Trajectory Features, a combination of HoG-HoF with optical flow vectors and motion boundary histograms have been shown to outperformed all the other approaches. An appealing alternative to traditional video is provided by “range” (Time-of-Flight) cameras: feature extraction from range images and fusion of range and video features will be integral parts of this project. Step 2: from the local features extracted from each video subvolume, a Fisher vector representation is calculated, so that each subvolume is encoded by a single Fisher vector. Step 3: Multiple Instance Learning of the most discriminative (i.e. better characterizing an activity versus all the others) subvolumes. An initial “positive” model is learned by assuming that all examples in the positive bag (all the sub-volumes of the sequence) do contain the action at hand; a “negative” model is learned from the examples in the negative bags (videos labeled with a different action class). After an iterative process, only the most discriminative examples in each positive bag are retained. MIL reduces to a semi-convex optimisation problem, for which 23611 Chagrin Blvd., Suite 320, Cleveland, OH 44122 • 216-295-4800 • www.ninesigma.com
Request 69009 Page 5 of 11 efficient heuristics exist [5]. The resulting model allows us to factor out the effect of common, shared context (similar background, common action elements). Step 4: Once the most discriminative action parts are learnt via MIL, we can construct tree-like ensembles of action parts (Figure 3 right) to use for both localization and classification of actions. Felzenszwalb and Huttenlocher have shown (in the object detection problem) that if the pictorial structure forms a star model, where each part is only connected to the root node, it is possible to compute the best match very efficiently by dynamic programming. Other approaches to building a constellation of discriminative parts have been proposed by Hoiem and Ramanan. Crucial will be the introduction of sparsity constraints in the Latent SVM semi-convex optimization problem proposed by Felzenszwalb to automatically identify the optimal number of parts. (2) Specifics in system configuration (please indicate required camera or system, if anything special is required as a basis of using the proposed algorithm): The approach is designed to work with both conventional cameras and range cameras, as in both cases a spatiotemporal volume can be constructed, from which the most discriminative parts can be learned and assembled in an overall model. A fusion of both would be pioneering work. (3) Applicability of the algorithm to versatile human activities (what should be overcome in applying algorithm developed for a specific human acitivity to any other human activities) The algorithm is being developed as general purpose: as such, it is designed to discriminate between any activities introduced in a training stage. In particular, it is explicitely designed to represent complex activities formed by a sequence of elementary actions; to cope with the presence of multiple actors/people; to localize the action of interested within a larger video in both space and time; to factor out the background (static or dynamic) in order to better discriminate different activities with common background or elementary components (i.e., parts in common). Current Performance (please answer to the following questions by showing a specific recognition task you have experienced so far as an example): (1) Recognition tasks/applications in brief (if proposers have experience in analyzing and recognizing one or some of the followings, please indicate those. If not, please briefly describe what kind of human activites proposers have experienced): The approach has been so far tested on most of the publicly available benchmarks for action recognition: The KTH dataset contains 6 action classes each performed by 25 actors, in four scenarios. People perform repetitive actions at different speeds and orientations. Sequences are longer when compared to the YouTube or the HMDB51 datasets, and contain clips in which the actors move in and out of the scene during the same sequence. The YouTube dataset contains 11 action categories and presents several challenges due to camera motion, object appearance, scale, viewpoint and cluttered backgrounds. The 1600 video sequences are split into 25 groups, and we follow the author’s evaluation procedure of 25-fold, leave-one-out cross validation. The Hollywood2 dataset contains 12 action classes collected from 69 different Hollywood movies. There are a total of 1707 action samples containing realistic, unconstrained human and camera motion. The dataset is divided into 823 training and 884 testing sequences, each from 5-25 seconds long. The HMDB dataset contains 51 action classes, with a total of 6849 video clips collected from movies, the Prelinger archive, YouTube and Google videos. Each action category contains a minimum of 101 clips. 23611 Chagrin Blvd., Suite 320, Cleveland, OH 44122 • 216-295-4800 • www.ninesigma.com
Page 1 and 2: Request Title: NineSigma Point of C
Page 3: Request 69009 Page 3 of 11 Title of
Page 7 and 8: Request 69009 Page 7 of 11 Though n
Page 9 and 10: Please include the following if app
Page 11: Request 69009 Page 11 of 11 Our Cli

Outline Proposal - Oxford Brookes University

Create successful ePaper yourself

Delete template?

Save as template?