Sliding

robots.ox.ac.uk

Sliding

Actions in the Eye

Cristian Sminchisescu, Lund University

joint work with Stefan Mathe, Romanian Academy & U.Toronto


Computer Vision in the Age of Data

• Computer visual recognition nowadays relies heavily on

machine learning

• Annotated datasets important for

– Training and algorithm design, model selection

– Performance evaluation

• Typical annotations subjectively defined, with task focus

– Image labels, bounding boxes, layout, attributes

– No guidance into intermediate computations to achieve the goal

… Are there any other useful annotations?

• State of the art still lags behind human performance


Human Performance


Task Influence on Visual Attention

Image stimulus

The Unexpected Visitor

Free Examine

What are the material

circumstances of the family?

What are their

ages?

What were they doing

before arrival?

Remember the

clothes

Remember object and

person positions

How long has the

unexpected visitor been

away?

Yarbus, 1967


Motivation

Eye movements provide insight into working system

• The `ultimate’ interest point operator

• May help select among combinatorial sets of

features or parts of a model

• May reveal cognitive routines

I do not believe that replicating biology is the right thing

to do necessarily, nor was this our plan. But eye

movements give access to information that has not

been computationally explored within computer vision.

We wanted to see where it may lead.


Why Work with Video?

• Video provides natural exposure time

• Very few eye movement datasets

– Mostly collected under free viewing

– Yarbus’ static studies not confirmed quantitatively

– No dynamic studies

– No evaluation of impact on computer visual recognition


Key Quantitative Questions

• Are the eye movement patters of different subjects

statically and dynamically consistent?

• How do human fixations relate to responses of

computer vision interest point operators?

• How do computer vision pipelines based on human

fixations fare with the ones based on CV operators?

– If better, can we predict fixations accurately?

– Can we close the loop?


Eye Tracking Data Collection


Eye Tracking Data

• Datasets

– Hollywood2

• 1707 videos from 69 Hollywood movies, ~500k frames

• 12 classes (e.g. answer phone, drive car, eat, fight)

– UCF Sports Actions

• 150 videos from sport events

• 9 classes (e.g. dive, kick, golf swing)

• Subject groups

– active (12 subjects): action recognition task

– free viewers (4 subjects): no task, just watch


Data Collection Setup

• High Quality Acquisition Setup

– SMI iView X 1250 tower-mounted eye tracker

• Data Quality Standards

– high temporal frequency (500Hz)

– high spatial accuracy (0.5 o )

• Subject Fatigue Management

– 204h of capture

– 816 blocks with recalibration

– Mandatory breaks between blocks

– No more than 4 blocks/subject/day

• Avoid habituation effects: randomization


Video Eye Tracking Setup


Human Recognition Performance

Near perfect (AP ~ 95%)

• Omissions more frequent than additions

• Almost never mislabel


Human Fixation Heat Maps

(16 subjects)

frame 25 frame 50 frame 90 frame 150

frame 15 frame 40 frame 60 frame 90

frame 4 frame 12 frame 25 frame 31

Fixated locations tightly clustered


Static and Dynamic

Consistency Metrics


Static Inter-Subject Agreement

For each frame, predict fixation of each subject

from the fixations of others

– AUC=1 for perfect consistency, but part due to

center screen or shooter bias

Alternatively

– Predict fixations on one frame using data from a

different frame

– AUC at chance level (0.5) if no bias


Good Static Inter-Subject Agreement

Hollywood-2

UCF Sports Actions


Scanpath Representation based on

Automatic AOI Generation

Assigning fixation to AOIs

Scanpath through AOIs over the entire

video sequence


Dynamic Consistency Metrics

mirror

driver

handbag

driver

0.7

0.5

0.3

0.4

driver

0.5

spoiler

driver

handbag

driver

0.6

Markov Models

- goal: model transitions

- metric: sequence probability

Sequence Alignment

- goal: handle gaps/insertions

- metric: alignment score


Static And Dynamic Inter-Subject

Agreement Scores

Hollywood-2


Task / No Task Condition Comparisons

• Distributions not significantly different!

• Director emphasizes actions already

Breakdown across subjects

Hollywood-2

UCF Sports Actions

Breakdown across actions


Consistency Findings

• Subjects remarkably consistent under all metrics

• Consistency stable across action classes

• Significant bias

• Shooter’s bias varies significantly across classes

(especially for UCF)


Fixation Vocabularies

N.B. Fixations fall almost always on objects or parts of objects; almost never on unstructured parts of the image


Computer Action Recognition


Processing Pipeline for Recognition

• Same pipeline used for all experiments

– Interest Point Operator

– Descriptor Extraction (HoG, MBH)

– Visual Dictionaries

– Classifiers (nonlinear SVM+MKL)

• Interest Point Operators are the main variable

– Computer vision operators (Harris2D/3D, dense)

– Various biological operators


Human Fixations vs. Harris Corners

Low-correlation


Recognition using Fixation-Derived

Interest Point Operators

Note: All pipelines are equally sparse. They all generate on average 55 interest points/frame


Saliency Map Prediction

• Baselines

– Uniform Map

– Central Bias Map (CB)

• Static Feature Maps (SF)

– Itti&Koch model (2000)

– Oliva&Torralba model (2001)

– Rosenholtz model (1999)

– Horizon detector (Oliva&Torralba,2001)

– Object Detectors (faces, persons, cars)

• Novel Motion Feature Maps (MF)

– Flow

– Pb with Flow

– Flow Bimodality

– Harris Cornerness

• Our proposed HoG-MBH detector trained on human fixations


HoG-MBH Detector

Designed to fire on semantically meaningful image regions.

Train to detect human fixations and focus on semantic

structure, rather than corners.

• Extract spatio-temporal HoG and MBH descriptors

(3 grid configurations, concatenated)

• Train linear SVM ( kernel approximation)

Sliding window approach

• Detector output Predicted Saliency Map


Saliency Map Evaluation

• Standard practice uses AUC to predict ground

truth fixation

• However, we are using saliency maps as spatial

probability distributions

– Competition effects (due to normalization)

• Better use KL divergence metric

– provides visually intuitive comparisons


Human Saliency Map Prediction


Saliency Maps


Recognition using Saliency-Based

Interest Point Operators

Note: All pipelines are equally sparse. They all generate on average 55 interest points/frame


UCF Sports (same trends)


Conclusions

• Subjects remarkably consistent under both static and new

dynamic metrics introduced

• Human fixations are weakly correlated with Harris interest

point operators

– Obvious eye-tracking pipelines not superior

• Detector trained to predict human saliency map, feasible

• State-of-the-art results can be obtained with saliency maps

learnt from human fixations

– Symbiosis of computer vision techniques and human vision data


References

S. Mathe and C. Sminchisescu:

Dynamic Eye Movement Datasets and Learnt

Saliency Models for Visual Action Recognition,

ECCV 2012

Eye movement data available at:

http://vision.imar.ro/eyetracking

3D Human pose data (3.6 million) available at:

http://vision.imar.ro/human3.6m/

More magazines by this user
Similar magazines