28.07.2013 Views

Project Proposal (PDF) - Oxford Brookes University

Project Proposal (PDF) - Oxford Brookes University

Project Proposal (PDF) - Oxford Brookes University

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

FP7-ICT-2011-9 STREP proposal<br />

18/01/12 v1 [Dynact]<br />

Section 1: Scientific and/or technical quality, relevant to the topics addressed by the call<br />

1.1 Concept and objectives<br />

1.1.1 Topic of research<br />

Action and activity recognition. Since Johanssen's classic experiment showing that moving light<br />

displays were enough for people to recognise motions or even identities [110], recognising human activities<br />

from videos has been an increasingly important topic of computer vision. The problem consists in telling,<br />

given one or more image sequences capturing one or more people performing various activities, what<br />

categories (among those previously learned) these activities belong to. A significant (though semantically<br />

rather vague) distinction is that between “actions”, meant as simple (usually stationary) motion patterns, and<br />

“activities” considered to be more complex sequences of actions, sometimes overlapping in time as they are<br />

performed by different parts of the body or different agents in the scene.<br />

Why is action recognition important: scenarios for real-world deployment. The potential impact<br />

of a real-world deployment of reliable automatic action recognition techniques is enormous, and involves a<br />

large number of scenarios in different societal, scientific and commercial contexts (Figure 1). Historically,<br />

ergonomic human-machine interfaces able to replace keyboard and mouse with gesture recognition as the<br />

preferred means of interacting with computers may have been envisaged first. It has been stated that 93% of<br />

human communication is non-verbal, that is, with body language, facial expressions, and tone of the voice<br />

imparting most of the information involved. In this scenario humans are allowed to interact with virtual<br />

characters for entertainment, educational purposes, or at automated information points in public spaces.<br />

Smart rooms have also been imagined, where people are assisted in their everyday activities by distributed<br />

intelligence in their own homes (switching lights when they move through the rooms, interpreting their<br />

gestures to replace remote controls and switches, etcetera). In particular, given our rapidly ageing population<br />

(at least in Europe), semi-automatic assistance to non-autonomous elderly people (able to signal the need for<br />

intervention to hospitals or relatives) is becoming an object of increasing interest.<br />

In the surveillance scenario, in contrast, the focus is more on detecting “anomalous” behavior in public<br />

spaces, rather than precisely determining the exact nature of the activity that is taking place. Rather than<br />

spending a lot of time (and money) in front of an array of monitors, security guards can be assisted by semiautomatic<br />

event detection algorithms able to signal anomalous events to their attention. In the related field of<br />

biometrics, popular techniques such as face, iris, or fingerprint recognition cannot be used at a distance, and<br />

require user cooperation. For these reasons, identity recognition from gait is being studied as a novel<br />

“behavioral” biometric, based on people’s distinctive gait pattern.<br />

The newest generation of controllers (such as the movement-sensing Kinect console by Microsoft) have<br />

opened new directions in the gaming industry. Yet, these tools are only designed to track the user’s<br />

movements, without any real elaboration or interpretation of their actions. Action recognition can "spice" up<br />

the console owners’ gaming experience to their great satisfaction.<br />

Last but not least, the availability of enormous quantities of images and videos over the internet is one of the<br />

major features of nowadays' society. People acquire new videos all the time with their own smartphones and<br />

portable cameras, and post them on Facebook or YouTube. However, techniques able to efficiently datamine<br />

them are a scarce resource: the potential of a "drag and drop" style application, similar to that set up by<br />

Google for images, able to retrieve videos with a same "semantic" content is easy to imagine.<br />

Figure 1. Action and activity recognition have manifold applications: virtual reality, human-computer<br />

interaction, surveillance, gaming and entertainment are just some examples.<br />

Crucial issues with action recognition. While the formulation of the problem is simple and<br />

intuitive, action recognition is, in fact, a hard problem (Figure 2). Motions possess an extremely high degree<br />

of inherently variability: quite distinct movements can carry the same meaning or represent the same gesture.<br />

In addition, they are subject to a large number of nuisance or “ covariate”<br />

factors [37], such as illumination,<br />

moving background, camera(s) viewpoint, and many others. This explains why experiments have been often<br />

<strong>Proposal</strong> Part B: page [3] of [67]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!