Recognition What Is Recognition?

Recognition 

Forsyth&Ponce Chap. 22-24 

Szeliski Chap. 14 

Some of the material from: Fei-Fei Li, Antonio Torralba, Szeliski&Seitz, Rob Fergus 

What Is Recognition? 

lamp 

lamp 

phone 

vase 

keyboard 

book 

table 

building 

building 

1

Input 

Internal Model 

Training Example(s) 

Output 

Recognition: Simplified story 

object 1 

Training Images 

object m 

Features 

from 

training 

image(s) 

Features 

from input 

image 

Feature Space 

Test image 

Approaches based on using feature 

matches and geometric relations 

Approaches based on classifying/matching 

image patches (windows) 

2

Example Datasets for 

Category Recognition 

Collecting datasets 

(towards 10 6-7 examples) 

• ESP game (CMU) 

Luis Von Ahn and Laura Dabbish 2004 

• LabelMe (MIT) 

Russell, Torralba, Freeman, 2005 

• StreetScenes (CBCL-MIT) 

Bileschi, Poggio, 2006 

• WhatWhere (Caltech) 

Perona et al, 2007 

• PASCAL challenge 

2006, 2007 

• Lotus Hill Institute 

Song-Chun Zhu et al 2007 

• Tiny Images (MIT) 

Torralba, Fergus & Freeman 2007 

• ImageNet 

Fei Fei Li 2009 

3

The PASCAL Visual Object 

Classes Challenge 2007 

The twenty object classes that have been selected are: 

Person: person 

Animal: bird, cat, cow, dog, horse, sheep 

Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train 

Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor 

M. Everingham, Luc van Gool , C. Williams, J. Winn, A. Zisserman 2007 

ImageNet (http://www.image-net.org/) 

4

Lotus Hill Research Institute image 

corpus 

Z.Y. Yao, X. Yang, and S.C. Zhu, 2007 

Evaluation 

5

• Terminology: 

– Classification: Is the object in the image? 

• One-vs.-all 

• Forced choice among N 

– Detection: Is the object in the image and 

where? 

Correct detection: 

• Terminology: 

– True positive TP=number of examples 

containing the object with object detected 

correctly 

– True negative TN=number of examples not 

containing the object with object not detected 

– False positive FP=number of examples not 

containing the object with object incorrectly 

detected 

– True positive FN=number of examples 

containing the object with object not detected 

6

Precision = TP/(TP+FP) 

Out of the detections, how many are correct 

ROC curve 

Detection 

Rate 

TP/(TP+FN) 

• Most useful in 

classification tasks 

• Requires TN to be 

known 

• EER = Equal error rate 

= As many false 

positives as false 

negatives 

• AUC = Area under the 

curve 

False Detections FP or 

False detection rate FP/(TN+FP) 

Or FPPI = False Positives Per Image 

Precision-Recall curve 

Recall = TP/(TP+FN) 

Same as detection rate for detection tasks 

Davis and Goadrich. The Relationship Between Precision-Recall and 

ROC Curves. ICML (Intern. Conf. Machine Learning) 2006. 

7

Precision-Recall curve 

• Invented for retrieval tasks 

• Do not need to know TN! 

• Summary performance numbers: 

– AUPRC = Area under the curve 

– AP = Average precision 

– F-measure = (1 is perfect performance) 

Confusion matrix 

• For forced choice classification tasks 

• Accuracy = (correct samples)/(total samples) 

• Different effect of class imbalance: 

– Per-class accuracy 

– Total overall accuracy 

8

Feature matching 

Feature Matching & Geometric 

Relations 

Features from 

training image(s) 

Features from 

input image 

9


Relations 

Features from 

training image(s) 

Features from 

input image 

Aspect is different between training and test images “invariant” features 

Local feature similarity is not sufficient use global geometric consistency 

Large number of features define distance in feature space + efficient 

indexing 

• Image gradients are sampled over 16x16 array of 

locations in scale space 

• Create array of orientation histograms 

• 8 orientations x 4x4 histogram array = 128 dimensions 

SIFT 

10

Outline 

• For each SIFT feature, find the closest 

feature in the reference set 

• Keep match if distance is greater than 

(1+d) times distance to next neighbor 

• Apply RANSAC to set of potential 

correspondences to verify geometric 

consistency 

11

Examples 

Model: Limited set of 

views of the objects 

Run-time input: 

Object in 

arbitrary pose 

and illumination 

conditions in 

uncontrolled 

environments 

Examples 

12


Relations 

• Positive: 

– Relatively simple implementation 

– Well-defined, efficient operations: Indexing, geometric 

verification (RANSAC), etc. 

• Negative: 

– Information reduced to a relatively small set of 

discrete features 

– Generalization issues for broad categories How do I 

deal with recognizing the class of all the chairs? 

• Alternative: Use the pixels in the image windows 

directly 

Window-based techniques 

13

Window-based techniques: The 

frequency 

short story 

Example 

training window : 

feature value 

1. Template classification: 

Techniques inspired from 

template matching: 

Compare directly with pixels 

from training image. 

Perhaps use blurring to be 

robust to small shifts, etc. 

2. Locally orderless structures: 

Intermediate solution: Use 

local histograms of features 

instead of a single global 

histogram. Potentially 

combines the advantages of 

templates (strong spatial 

information) and histograms 

(robustness to distortions of 

the image) 

3. Histograms: Use histograms 

of features to compare the 

windows. Example: Bags Of 

Words techniques. 

Problem: All local spatial 

information is lost since the 

representation is global. Will be 

robust to changes but much 

less discriminative. 

Example from: The Structure of Locally Orderless Images, Jan J. Koenderink AND A. J. Van Doorn, IJCV 1999. 

Key approaches: 1. Template 

classification 

• Nearest-neighbors, PCA, dimensionality 

reduction 

• Naïve Bayes 

• Combination of simple classifiers 

(boosting) 

• Neural networks 

• SVMs 

14




reduction 



(boosting) 


• SVMs 

Nearest Neighbors 

Features 

Test image 

Feature Space 

• Does not require recovery of distributions or decision 

surfaces 

• Asymptotically twice Bayes risk at most 

• Choice of distance metric critical 

• Indexing may be difficult 

15




reduction 



(boosting) 


• SVMs 

More Complicated Features 

Scale= 4 

Scale = 2 

C(x,y,s) = 

Wavelet 

coefficient at 

position x,y at 

scale s 

Scale 1 

• Feature = Set of coefficients S = (C 1 ,..,C N ) 

Example from Henry Schneiderman 

16

• Given features S 1 ,..,S r computed from a 

window, threshold the likelihood ratio 

P( 

S1Sr 

log 

P( 

S S 

1 

r 

| 1) 

 

) 

2 

Assume independence 

(Naïve Bayes) 

P( 

S | ) P( 

S | ) P( 

S | ) 

log 1 1 log 2 1 ... 

log r 1 

P( 

S | ) P( 

S | ) P( 

S | ) 

1 2 2 2 

r 2 

? 


How can we compute these 

probabilities? 

Estimating the Probabilities 

• Collect the values of the features for training data in 

histograms that approximate the probabilities 

50-2,000 original images 

~1,000 synthetic variations per original image 

P (S i | 1 ) 

S i 

~10,000,000 examples 


P (S i | 2 ) 

S i 

17

Compute the values of all the features in the window 

For each feature, compute the probabilities of coming from 

the object or non-object class 

Aggregate into likelihood ratio 

From Windows to Images 

Search in position 

• Move a window to all 

possible positions and all 

possible scales 

• At each (position,scale) 

evaluate the classifier 

• Return detection if above 

threshold 

Search in scale 


18

Feature Selection Problem 

• Each feature is a set of variables (wavelet 

coefficients) S = {C 1 ,..,C N } 

• Problem: 

– If N is large, the feature is very discriminative 

(S is equivalent to the entire window if N is the 

total number of variables) but representing the 

corresponding distribution is very expensive 

– If N is small, the feature is not discriminative 

but classification is very fast 


Solution: Classifier Cascade 

• Standard problem: 

– We can have either discriminative or efficient features 

but not both! 

– Cannot do classification in one shot 

• Standard solution: Classifier Cascade 

– Apply first a classifier with simple features Fast and 

will eliminate the most obvious non-object locations 

– Then apply a classifier with more complex features 

More expensive but applied only to these locations 

that survived the previous stage 


19

Cascade Example 

Cascade Stage 1 



Apply classifier 

with very simple 

(and fast) features 

Eliminates 

most of the image 


with more 

complex features 

on what is left 


with more 

complex features 

on what is left 


Cascade Example: 

Stage 4 

Stage 2 

Stage 3 

Stage 1 

Detection 

Rate 

False Detections 

20

Examples 




reduction 



(boosting) 


• SVMs 

21

Using Weak Features 

abs 

x > q ? 

Using Weak Features 

abs 

x > q ? 

• Don’t try to design strong features from the beginning, 

just use really stupid but really fast features (and a lot of 

them) 

• Weak learner = Very fast (but very inaccurate) classifier 

• Example: Multiply input window by a very simple box 

operator and threshold output 

(Example from Paul Viola, Distributed by Intel as part of the OpenCV library) 

22

Repeat T times 

Feature Selection 

• Operators defined over all possible shapes and 

positions within the window 

• For a 24x24 window 45,396 combinations!! 

• How to select the “useful” features? 

• How to combine them into classifiers? 

(Example from Paul Viola) 

• Input: Training examples {x i } with labels 

(“face” or “non-face” = +/-1) {y i } + weights 

w i (initially w i = 1) 

• Choose the feature (weak classifier h t ) with 

minimum error: e w h ( x ) y 

t 

• Update the weights such that 

– w i is increased if x i is misclassified 

– w i is decreased if x i is correctly classified 

• Compute a weight a t for classifier h t 

– a t large if e t is small 

• Final classifier: 

i 

i 

 

H( 

x) 

sgn 

 

t 

i 

 

t 

i 

 

 

a h x 

t 

t 

23

Repeat T times 

• Input: Training examples {x i } with labels 

(“face” or “non-face”) {y i } + weights w i 

(initially w i = 1) 

• Choose the feature (weak classifier h t ) with 

The training examples that are not 

minimum error: e correctly w classified h ( x ) contribute y more 

This is a general description of a boosting 

algorithm. Well-defined rules for updating w 

and for computing a guarantee convergence 

and “good” classification performance. 

t 

• Update the weights such that 

– w i is increased if x i is misclassified 

– w i is decreased if x i is correctly classified 

• Compute a weight a t for classifier h t 

– a t large if e t is small 

• Final classifier: 

H( 

x) 

sgna 

tht 

x t 

i 

 

 

i t i i 

through higher weights 

Features that yield good classification 

performance receive higher weights 

1 st feature selected 2 nd feature selected 

The automatic selection process selects “natural” features 


24

Detection Rate 

Using a Cascade (Again) 

• Same problem as before: It is too hard (or impossible) to build a 

single accurate classifier 

• Key reason: An image containing one face may have 10 5 possible 

locations but only 1 “correct” location rare event detection 

Would require an enormous number of features 

• Solution: Use a cascade of classifiers. Each classifier eliminates 

more of the non-object locations while retaining the “object” 

locations. 

Cascade of 10 20- 

feature classifiers 

200 feature-classifiers 

• In this example, the cascade and the single 

classifier have similar accuracy, but the cascade 

is 10 times faster 

25

Detection Rate 

Training: ~5000 face images 

+ ~10000 non-face windows 

Run-Time: Apply the cascade of 

classifiers over all positions and scales 

and return those positions/scales that 

survive the sequence of classifiers 

False Positive Rate 

75,081,800 windows scanned over 

130 images (507 faces). 


26

Window-based techniques: The short story 

Training Images object m 

object 1 

Feature Space 

What representation should we use? 

Test image 




reduction 



(boosting) 


• SVMs 

27

Neural Networks: 

f(x) ~ g(x) (e.g., g(x) = p(o|x)) 

f 

f(x) = orientation of template 

sampled at 10 o intervals 

f(x) = face detection 

posterior 

Example from Rowley et al. 

28

Convolutional Networks: Lecun et al. 

http://yann.lecun.com/exdb/lenet/ 




reduction 



(boosting) 


• SVMs 

29

Example: Pedestrian detection 

• Compute the gradients 

of the input image in a 

detection window 

• Compute histograms of 

the gradients over 16 

directions in overlapping 

blocks 

• Combine all the 

histograms into a single 

feature vector f 

• Apply a classifier (linear 

SVM) trained off-line on 

a large training set to f 

N. Dalal and B. Triggs . Histograms of Oriented Gradients for Human 

Detection. CVPR, 2005 

HOG Descriptor = Histogram of 

Gradients 

Scan input image 

Resize 

window to 

64x128 

Divide into 

overlapping 

16x16 

blocks 

Compute 

histogram 

of gradient 

over 9 

directions 

Final feature vector f = 7x15x9 = 945 

30

HOG + Linear SVM 

Not pedestrian: 

W.f < 0 

W 

HOG Space 

Pedestrian: 

W.f > 0 

Training 

31

Histogram of Oriented Gradients (HOG) 

• f = vector of gradient histograms 

• Input detection window is classified as pedestrian if: 

w.f > 0 

Example 

detection 

window 

Average 

gradients 

over the 

training data 

Positive 

components: 

Pedestrian 

class 

Negative 

components: 

Non-pedestrian 

class 

N. Dalal and B. Triggs . Histograms of Oriented Gradients for Human Detection. CVPR, 2005 

Histogram of Oriented Gradients (HOG) 

• f = vector of gradient histograms 

• Input detection window is classified as pedestrian if: 

w.f > 0 

W 

Example 

detection 

window 

HOG 

descriptor 

Positive 

components: 

Pedestrian 

class 

Negative 

components: 

Non-pedestrian 

class 

N. Dalal and B. Triggs . Histograms of Oriented Gradients for Human Detection. CVPR, 2005 

32

Finding the actual detections 

Input image + 

final detection 

Detection confidence 

image: Need to find 

the local maxima and 

suppress the nonlocal 

maxima 

Examples 

33

Other objects 

Advanced: How to deal with 

geometric deformations? 

Coarse model Higherresolution 

parts 

Model of parts 

locations 

• A Discriminatively Trained, Multiscale, Deformable Part 

Model with Pedro Felzenszwalb and Deva Ramanan, 2008- 

2010 

34

Recognition What Is Recognition?

Create successful ePaper yourself

Delete template?

Save as template?