Actions in the Eye

Cristian Sminchisescu, Lund University

joint work with Stefan Mathe, Romanian Academy & U.Toronto

Computer Vision in the Age of Data

• Computer visual recognition nowadays relies heavily on

machine learning

• Annotated datasets important for

– Training and algorithm design, model selection

– Performance evaluation

• Typical annotations subjectively defined, with task focus

– Image labels, bounding boxes, layout, attributes

– No guidance into intermediate computations to achieve the goal

… Are there any other useful annotations?

• State of the art still lags behind human performance

Human Performance

Task Influence on Visual Attention

Image stimulus

The Unexpected Visitor

Free Examine

What are the material

circumstances of the family?

What are their


What were they doing

before arrival?

Remember the


Remember object and

person positions

How long has the

unexpected visitor been


Yarbus, 1967


Eye movements provide insight into working system

• The `ultimate’ interest point operator

• May help select among combinatorial sets of

features or parts of a model

• May reveal cognitive routines

I do not believe that replicating biology is the right thing

to do necessarily, nor was this our plan. But eye

movements give access to information that has not

been computationally explored within computer vision.

We wanted to see where it may lead.

Why Work with Video?

• Video provides natural exposure time

• Very few eye movement datasets

– Mostly collected under free viewing

– Yarbus’ static studies not confirmed quantitatively

– No dynamic studies

– No evaluation of impact on computer visual recognition

Key Quantitative Questions

• Are the eye movement patters of different subjects

statically and dynamically consistent?

• How do human fixations relate to responses of

computer vision interest point operators?

• How do computer vision pipelines based on human

fixations fare with the ones based on CV operators?

– If better, can we predict fixations accurately?

– Can we close the loop?

Eye Tracking Data Collection

Eye Tracking Data

• Datasets

– Hollywood2

• 1707 videos from 69 Hollywood movies, ~500k frames

• 12 classes (e.g. answer phone, drive car, eat, fight)

– UCF Sports Actions

• 150 videos from sport events

• 9 classes (e.g. dive, kick, golf swing)

• Subject groups

– active (12 subjects): action recognition task

– free viewers (4 subjects): no task, just watch

Data Collection Setup

• High Quality Acquisition Setup

– SMI iView X 1250 tower-mounted eye tracker

• Data Quality Standards

– high temporal frequency (500Hz)

– high spatial accuracy (0.5 o )

• Subject Fatigue Management

– 204h of capture

– 816 blocks with recalibration

– Mandatory breaks between blocks

– No more than 4 blocks/subject/day

• Avoid habituation effects: randomization

Video Eye Tracking Setup

Human Recognition Performance

Near perfect (AP ~ 95%)

• Omissions more frequent than additions

• Almost never mislabel

Human Fixation Heat Maps

(16 subjects)

frame 25 frame 50 frame 90 frame 150

frame 15 frame 40 frame 60 frame 90

frame 4 frame 12 frame 25 frame 31

Fixated locations tightly clustered

Static and Dynamic

Consistency Metrics

Static Inter-Subject Agreement

For each frame, predict fixation of each subject

from the fixations of others

– AUC=1 for perfect consistency, but part due to

center screen or shooter bias


– Predict fixations on one frame using data from a

different frame

– AUC at chance level (0.5) if no bias

Good Static Inter-Subject Agreement


UCF Sports Actions

Scanpath Representation based on

Automatic AOI Generation

Assigning fixation to AOIs

Scanpath through AOIs over the entire

video sequence

Dynamic Consistency Metrics
















Markov Models

- goal: model transitions

- metric: sequence probability

Sequence Alignment

- goal: handle gaps/insertions

- metric: alignment score

Static And Dynamic Inter-Subject

Agreement Scores


Task / No Task Condition Comparisons

• Distributions not significantly different!

• Director emphasizes actions already

Breakdown across subjects


UCF Sports Actions

Breakdown across actions

Consistency Findings

• Subjects remarkably consistent under all metrics

• Consistency stable across action classes

• Significant bias

• Shooter’s bias varies significantly across classes

(especially for UCF)

Fixation Vocabularies

N.B. Fixations fall almost always on objects or parts of objects; almost never on unstructured parts of the image

Computer Action Recognition

Processing Pipeline for Recognition

• Same pipeline used for all experiments

– Interest Point Operator

– Descriptor Extraction (HoG, MBH)

– Visual Dictionaries

– Classifiers (nonlinear SVM+MKL)

• Interest Point Operators are the main variable

– Computer vision operators (Harris2D/3D, dense)

– Various biological operators

Human Fixations vs. Harris Corners


Recognition using Fixation-Derived

Interest Point Operators

Note: All pipelines are equally sparse. They all generate on average 55 interest points/frame

Saliency Map Prediction

• Baselines

– Uniform Map

– Central Bias Map (CB)

• Static Feature Maps (SF)

– Itti&Koch model (2000)

– Oliva&Torralba model (2001)

– Rosenholtz model (1999)

– Horizon detector (Oliva&Torralba,2001)

– Object Detectors (faces, persons, cars)

• Novel Motion Feature Maps (MF)

– Flow

– Pb with Flow

– Flow Bimodality

– Harris Cornerness

• Our proposed HoG-MBH detector trained on human fixations

HoG-MBH Detector

Designed to fire on semantically meaningful image regions.

Train to detect human fixations and focus on semantic

structure, rather than corners.

• Extract spatio-temporal HoG and MBH descriptors

(3 grid configurations, concatenated)

• Train linear SVM ( kernel approximation)

Sliding window approach

• Detector output Predicted Saliency Map

Saliency Map Evaluation

• Standard practice uses AUC to predict ground

truth fixation

• However, we are using saliency maps as spatial

probability distributions

– Competition effects (due to normalization)

• Better use KL divergence metric

– provides visually intuitive comparisons

Human Saliency Map Prediction

Saliency Maps

Recognition using Saliency-Based

Interest Point Operators

Note: All pipelines are equally sparse. They all generate on average 55 interest points/frame

UCF Sports (same trends)


• Subjects remarkably consistent under both static and new

dynamic metrics introduced

• Human fixations are weakly correlated with Harris interest

point operators

– Obvious eye-tracking pipelines not superior

• Detector trained to predict human saliency map, feasible

• State-of-the-art results can be obtained with saliency maps

learnt from human fixations

– Symbiosis of computer vision techniques and human vision data


S. Mathe and C. Sminchisescu:

Dynamic Eye Movement Datasets and Learnt

Saliency Models for Visual Action Recognition,

ECCV 2012

Eye movement data available at:

3D Human pose data (3.6 million) available at:

More magazines by this user
Similar magazines