Project Proposal (PDF) - Oxford Brookes University
Project Proposal (PDF) - Oxford Brookes University
Project Proposal (PDF) - Oxford Brookes University
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
FP7-ICT-2011-9 STREP proposal<br />
18/01/12 v1 [Dynact]<br />
Section 1: Scientific and/or technical quality, relevant to the topics addressed by the call<br />
1.1 Concept and objectives<br />
1.1.1 Topic of research<br />
Action and activity recognition. Since Johanssen's classic experiment showing that moving light<br />
displays were enough for people to recognise motions or even identities [110], recognising human activities<br />
from videos has been an increasingly important topic of computer vision. The problem consists in telling,<br />
given one or more image sequences capturing one or more people performing various activities, what<br />
categories (among those previously learned) these activities belong to. A significant (though semantically<br />
rather vague) distinction is that between “actions”, meant as simple (usually stationary) motion patterns, and<br />
“activities” considered to be more complex sequences of actions, sometimes overlapping in time as they are<br />
performed by different parts of the body or different agents in the scene.<br />
Why is action recognition important: scenarios for real-world deployment. The potential impact<br />
of a real-world deployment of reliable automatic action recognition techniques is enormous, and involves a<br />
large number of scenarios in different societal, scientific and commercial contexts (Figure 1). Historically,<br />
ergonomic human-machine interfaces able to replace keyboard and mouse with gesture recognition as the<br />
preferred means of interacting with computers may have been envisaged first. It has been stated that 93% of<br />
human communication is non-verbal, that is, with body language, facial expressions, and tone of the voice<br />
imparting most of the information involved. In this scenario humans are allowed to interact with virtual<br />
characters for entertainment, educational purposes, or at automated information points in public spaces.<br />
Smart rooms have also been imagined, where people are assisted in their everyday activities by distributed<br />
intelligence in their own homes (switching lights when they move through the rooms, interpreting their<br />
gestures to replace remote controls and switches, etcetera). In particular, given our rapidly ageing population<br />
(at least in Europe), semi-automatic assistance to non-autonomous elderly people (able to signal the need for<br />
intervention to hospitals or relatives) is becoming an object of increasing interest.<br />
In the surveillance scenario, in contrast, the focus is more on detecting “anomalous” behavior in public<br />
spaces, rather than precisely determining the exact nature of the activity that is taking place. Rather than<br />
spending a lot of time (and money) in front of an array of monitors, security guards can be assisted by semiautomatic<br />
event detection algorithms able to signal anomalous events to their attention. In the related field of<br />
biometrics, popular techniques such as face, iris, or fingerprint recognition cannot be used at a distance, and<br />
require user cooperation. For these reasons, identity recognition from gait is being studied as a novel<br />
“behavioral” biometric, based on people’s distinctive gait pattern.<br />
The newest generation of controllers (such as the movement-sensing Kinect console by Microsoft) have<br />
opened new directions in the gaming industry. Yet, these tools are only designed to track the user’s<br />
movements, without any real elaboration or interpretation of their actions. Action recognition can "spice" up<br />
the console owners’ gaming experience to their great satisfaction.<br />
Last but not least, the availability of enormous quantities of images and videos over the internet is one of the<br />
major features of nowadays' society. People acquire new videos all the time with their own smartphones and<br />
portable cameras, and post them on Facebook or YouTube. However, techniques able to efficiently datamine<br />
them are a scarce resource: the potential of a "drag and drop" style application, similar to that set up by<br />
Google for images, able to retrieve videos with a same "semantic" content is easy to imagine.<br />
Figure 1. Action and activity recognition have manifold applications: virtual reality, human-computer<br />
interaction, surveillance, gaming and entertainment are just some examples.<br />
Crucial issues with action recognition. While the formulation of the problem is simple and<br />
intuitive, action recognition is, in fact, a hard problem (Figure 2). Motions possess an extremely high degree<br />
of inherently variability: quite distinct movements can carry the same meaning or represent the same gesture.<br />
In addition, they are subject to a large number of nuisance or “ covariate”<br />
factors [37], such as illumination,<br />
moving background, camera(s) viewpoint, and many others. This explains why experiments have been often<br />
<strong>Proposal</strong> Part B: page [3] of [67]