submodularity-2012

Tutorial on Submodularity in 

Machine Learning and Computer Vision 

Stefanie Jegelka 

Andreas Krause

Seman?c Segmenta?on 

Int J Comput Vis (2009) 82: 302–324 

How can we map pixels to objects? 

rder potentials for object segmentation. 

C-21dataset.(b), (c) and(d) Unsuper- 

result obtained by augmenting2 the p 

potentials defined on the segments sh

Document Summariza?on 

How can we select 

representa?ve sentences? 

3

Network inference 

How can we learn who influences whom? 

4

What‘s common? 

! Can be formalized as op?mizing a set func?on F(S) 

under constraints 

! Generally very hard 

! Structure helps: We‘ll see that if F(S) is submodular, 

! can solve maximiza?on and minimiza?on problems with 

strong guarantees 

! can solve learning problems involving submodular func?ons 

! You‘ll learn about theory and applica?ons 

5

Outline 

! What is submodularity? 

! Proper?es of submodular func?ons 

! Op?miza?on 

! Minimiza?on 

! Maximiza?on 

! Learning 

! Learning for Op?miza?on 

6

Running example: Sensor placement 

OF FIC E 

OF FIC E 

CONF ER ENCE 

STO RAGE 

ELEC 

COPY 

KITC HEN 

Want to place sensors to monitor temperature 

LAB 

SE RV ER 

QU IE T 

PHONE 

7

! Finite set V = {1,2,…,n} 

! Set func?on 

Set func?ons 

F :2 V ! R 

! Will always assume F({}) = 0 (w.l.o.g.) 

! Assume black-‐box that can evaluate F for any input A 

! Approximate (noisy) evalua?on of F is ok 

OF FIC E 

OF FIC E 

CONF ER ENCE 

STO RAGE 

ELEC COPY 

KITC HEN 

LAB 

SE RV ER 

QU IE T PHONE 

8

Example: Sensor placement 

U?lity F(A) of having sensors at subset A of all loca?ons 

OF FIC E 

OF FIC E 

X 1 

CONF ER ENCE 

STO RAGE 

ELEC 

COPY 

KITC HEN 

X 2 

LAB 

SE RV ER 

QU IE T PHONE 

X 3 

A={1,2,3}: Very informa?ve 

High value F(A) 

X 4 

X 5 

OF FIC E 

X 1 

OF FIC E 

CONF ER ENCE 

STO RAGE 

ELEC 

COPY 

KITC HEN 

LAB 

SE RV ER 

QU IE T PHONE 

A={1,4,5}: Redundant info 

Low value F(A) 

9

! Given set func?on 

! Marginal gain: 

X 1 

OF FIC E 

OF FIC E 

Marginal gains 

CONF ER ENCE 

STO RAGE 

ELEC 

COPY 

KITC HEN 

F :2 V ! R 

F (s | A) =F ({s}[A) F (A) 

X s 

LAB 

SE RV ER 

New sensor s 

QU IE T PHONE 

X 2 

10

Submodularity: Decreasing marginal gains 

X 1 

Placement A = {1,2} 

OF FIC E 

OF FIC E 

CONF ER ENCE 

STO RAGE 

ELEC 

COPY 

KITC HEN 

LAB 

SE RV ER 

QU IE T PHONE 

X 2 

X 1 

Placement B = {1,…,5} 

Adding s will help a lot! Xs Adding s doesn’t help much 

New sensor s 

+ s Large improvement 

Submodularity: 

For 

B A 

OF FIC E 

+ 

OF FIC E 

CONF ER ENCE 

STO RAGE 

ELEC 

X 4 

COPY 

KITC HEN 

s 

X 3 

LAB 

SE RV ER 

QU IE T PHONE 

X 5 

Small improvement 

A ⇥ B : F (A ⌅ {s}) F (A) ⇤ F (B ⌅ {s}) F (B) 

(s | A) (s | B) 

X 2 

11

Alterna?ve characteriza?ons 

! Set func?on F on V is called submodular if 

8A, B ✓ V : F (A)+F (B) F (A [ B)+F (A \ B) 

+ ≥ 

B 

AA U B 

A ∩ B 

! Equivalent diminishing returns characteriza?on: 

A + S 

8A ✓ B ✓ V,s /2 B : 

B 

+ 

S 

+ 

Large improvement 

Small improvement 

F (A [ {s}) F (A) F (B [ {s}) F (B) 

(s | A) 

(s | B) 

12

Ques?ons 

How do I prove my problem is 

submodular? 

Why is submodularity useful? 

13

Place sensors 

in building Possible 

locations 

V 

Node predicts 

values of posi?ons 

with some radius 

Example: Set cover 

Want to cover floorplan with discs 

OF F IC E 

OF F IC E 

CONFERENCE 

STO RA GE 

ELEC 

COPY 

KITCHEN 

For A ⊆ V: F(A) = “area 

covered by sensors placed at A” 

Formally: 

W finite set, collec?on of n subsets S i ⊆ W 

For A ⊆ V define F(A) = | i∈ A S i | 

LAB 

SERVER 

QU I ET 

PHONE 

14

OF F IC E 

OF F IC E 

OF F IC E 

S 1 

OF F IC E 

S 1 

CONFERENCE 

ELEC 

Set cover is submodular 

STO RA GE 

COPY 

KITCHEN 

CONFERENCE 

STO RA GE 

ELEC 

COPY 

KITCHEN 

S 3 

S 2 

S 2 

LAB 

SERVER 

LAB 

SERVER 

QU I ET 

QU I ET 

PHONE 

S’ 

PHONE 

S 4 S’ 

A={S 1 ,S 2 } 

F(A U {S’})-F(A) 

≥ 

F(B U {S’})-F(B) 

B = {S 1 ,S 2 ,S 3 ,S 4 } 

15

X 1 

Y 4 

More complex model for sensing 

OF FIC E OF FIC E 

QU IE T 

Y1 CONF ER ENCE Y2 Y3 X 4 

STO RAGE 

ELEC COPY 

X2 KITC HEN 

X 5 

X 3 

Joint probability distribu?on 

P(X 1 ,…,X n ,Y 1 ,…,Y n ) = P(Y 1 ,…,Y n ) P(X 1 ,…,X n | Y 1 ,…,Y n ) 

LAB 

SE RV ER 

Y 5 

PHONE 

Y 6 

X 6 

Prior Likelihood 

Y s : temperature 

at loca?on s 

X s : sensor value 

at loca?on s 

X s = Y s + noise 

16

Example: Sensor placement 

U?lity of having sensors at subset A of all loca?ons 

F (A) =H(Y) H(Y | XA) 

OF FIC E 

OF FIC E 

X 1 

CONF ER ENCE 

STO RAGE 

ELEC 

COPY 

KITC HEN 

Uncertainty 

about temperature Y 

before sensing 

X 2 

LAB 

SE RV ER 

QU IE T PHONE 

X 3 

A={1,2,3}: High value F(A) 

X 4 

X 5 

Uncertainty 

about temperature Y 

aHer sensing 

OF FIC E 

X 1 

OF FIC E 

CONF ER ENCE 

STO RAGE 

ELEC 

COPY 

KITC HEN 

LAB 

SE RV ER 

QU IE T PHONE 

A={1,4,5}: Low value F(A) 

17

Example: Submodularity of info-‐gain 

Y 1 ,…,Y m , X 1 , …, X n discrete RVs 

F(A) = I(Y; X A ) = H(Y)-‐H(Y | X A ) 

! F(A) is NOT always submodular 

Theorem [Krause & Guestrin `05] 

If X i are all condi?onally independent given Y, 

then F(A) is submodular! 

Y 1 

X 1 

X 2 

Y 2 

X 3 

Y 3 

X 4 

18

Proof: Submodularity of informa?on gain 

Y 1 ,…,Y m , X 1 , …, X n discrete RVs 

F(A) = I(Y; X A ) = H(Y)-‐H(Y | X A ) 

Variables X independent given Y 

F(A U {s}) – F(A) = H(X s |X A ) – H(X s |Y) 

Nonincreasing in A: 

A ⊆ B : H(X s |X A ) ≥ H(X s |X B ) 

„informa?on never hurts“ 

Constant 

(indep. of A) 

Δ(s | A) = F(A U {s})-‐F(A) monotonically nonincreasing 

ó F submodular J 

19

eakfast?? 

t 1 

Example: costs 

cost: 

?me to reach shop 

+ price of items 

ground set V 

t 2 

each item 

1 € 

t 2 

20

eakfast?? 

Example: costs 

cost: 

?me to shop 

+ price of items 

F( ) = cost( ) + cost( , ) 

= t 1 + 1 + t 2 + 2 

= #shops + #items 

submodular? 

21

A 

B 

Shared fixed costs 

(b|A) =1+1 

(b|B) =1 

marginal cost: #new shops + #new items 

diminishing è cost is submodular! 

• shops: shared fixed cost 

• economies of scale 

22

Another example: Cut func?ons 

a 2 

2 

2 2 

c 

2 

b 

2 

d 

1 

1 

e 3 g 

3 

3 3 3 

f 

3 

h 

Cut func?on is submodular! 

V={a,b,c,d,e,f,g,h} 

F (A) = X 

s2A,t /2A 

ws,t 

23

Why are cut func?ons submodular? 

a 2 

2 

2 2 

c 

2 

b 

2 

d 

a 

w 

b 

1 

1 

e 3 g 

3 

3 3 3 

f 

3 

h 

F (S) = X 

(i,j)2E 

S F ab (S) 

{} 0 

{a} w 

{b} w 

{a,b} 0 

Submodular if w ≥ 0! 

Fi,j(S \{i, j}) 

Cut func?on in subgraph {i,j} 

è Submodular! 

24

Closedness proper?es 

F 1 ,…,F m submodular func?ons on V and λ 1 ,…,λ m > 0 

Then: F(A) = ∑ i λ i F i (A) is submodular 

Submodularity closed under nonnega?ve linear 

combina?ons! 

Extremely useful fact: 

! F θ (A) submodular è ∑ θ P(θ) F θ (A) submodular! 

! Mul?criterion op?miza?on: 

F 1 ,…,F m submodular, λ i >0 è ∑ i λ i F i (A) submodular 

! A basic proof technique! J 

25

Other closedness proper?es 

! Restric?on: F(S) submodular on V, W subset of V 

Then is submodular 

F 0 (S) =F (S \ W ) 

V 

S 

W 

26




F 0 (S) =F (S \ W ) 

! Condi?oning: F(S) submodular on V, W subset of V 

F 0 (S) =F (S [ W ) 


V 

S 

W 

27




F 0 (S) =F (S \ W ) 

! Condi?oning: F(S) submodular on V, W subset of V 

F 0 (S) =F (S [ W ) 


! Reflec?on: F(S) submodular on 

F 0 (S) =F (V \ S) 


V 

S 

28

Convex aspects 

! Submodularity as discrete analogue of convexity 

! Convex extension 

! Efficient minimiza?on 

! Duality 

! However, this is only one half of the story... 

f(x) 

1 

0.8 

0.6 

0.4 

0.2 

1 0 

0.5 

x b 

0 

0 

x a 

0.5 

1 

29

! Marginal gain 

! Submodular: 

! Concave: 

Concave aspects 

F (s | A) =F ({s}[A) F (A) 

8 A ✓ B,s /2 B : F (A [ {s}) F (A) F (B [ {s}) F (B) 

8 a0 f(a + s) f(a) f(b + s) f(b) 

F(A) “intuitively” 

|A| 

30

Submodularity and Concavity 

Suppose g: N à R and F(A) = g(|A|) 

Then F(A) submodular if and only if g is …concave 

! 

g(|A|) 

|A| 

31

Maximum of submodular func?ons 

Suppose F 1 (A) and F 2 (A) submodular. 

Is F(A) = max(F 1 (A),F 2 (A)) submodular? 

F 2 (A) 

F 1 (A) 

F(A) = max(F 1 (A),F 2 (A)) 

|A| 

max(F 1 ,F 2 ) not submodular in general! 

32

Minimum of submodular func?ons 

Well, maybe F(A) = min(F 1 (A),F 2 (A)) instead? 

F 1 (A) F 2 (A) F(A) 

{} 0 0 0 

{a} 1 0 0 

{b} 0 1 0 

{a,b} 1 1 1 

F({b}) – F({})=0 

< 

F({a,b}) – F({a})=1 

min(F 1 ,F 2 ) not submodular in general! 

33

Two faces of submodular func?ons 


èminimiza?on! 


èmaximiza?on! 

34

What to do with submodular func?ons 

Op?miza?on 

Minimiza?on 

Maximiza?on 

Online/ 

adap?ve 

op?m. 

Learning 

35

Op?miza?on 

Op?miza?on 

Minimiza?on 

Maximiza?on 

Online/ 

adap?ve 

op?m. 

Learning 

Minimiza?on and maximiza?on not the same?? 

36

Submodular minimiza?on 

Op?miza?on 

MinimizaNon 

Maximiza?on 

Online/ 

adap?ve 

op?m. 

Learning 

37


Op?miza?on 

unconstrained 

constrained 

Maximiza?on 

Online/ 

adap?ve 

op?m. 

Learning 

38

Submodular func?on minimiza?on 

clustering 

MAP 

inference 

min 

S✓V 

F (S) 

“oh, yes!” 

“yes …” 

“True.” 

“yes, that’s true” 

“that’s …!?” 

Structured 

sparsity 

oh 

yes 

true 

that’s 

Corpus 

training set 

extrac?on 

39


! polynomial-‐?me? 

min 

S✓V 

F (S) 

è submodularity and convexity 

40

Set func?ons and energy func?ons 

Can view any set func?on 

equivalently as func?on on binary vectors: 

where |V|=n 

F :2 V ! R 

F : {0, 1} n ! R 

A 

a 

b 

c 

d 

ˆ= 

x = e A 

1 

1 

0 

0 

a 

b 

c 

d 

41

Submodularity and Convexity 

! Extension: 

F : {0, 1} n ! R 

Lovász extension 

f(x) = max 

y2PF 

convex 

f :[0, 1] n ! R 

x · y 

• minimum of F is a minimum of f 

è submodular minimiza?on reduces to convex min 

a 

b 

c 

d 

ˆ= 

1 

1 

0 

0 

a 

b 

c 

d 

42

The submodular polyhedron P F 

PF = {x 2 R n : x(A) apple F (A) for all A ✓ V } 

P F 

x(A) = X 

-2 

-‐1 

i2A 

xi 

2 

1 

x {b} 

0 1 

x({b}) ≤ F({b}) 

x({a}) ≤ F({a}) 

Example: V = {a,b} 

x({a,b}) ≤ F({a,b}) 

x {a} 

A F(A) 

{} 0 

{a} -‐1 

{b} 2 

{a,b} 0 

43

Evalua?ng the Lovász extension 

Linear maximiza?on over PF f(x) = max 

y2Pf 

x > y 

Exponen?ally many constraints!!! L 

Computable in O(n log n) ?me J 

[Edmonds ‘70] 

• Subgradient 

• Separa?on oracle 

• Central for op?miza?on 

-2 

y* 

-‐1 

2 

1 

x {b} 

0 1 

x 

x {a} 

44

f(x) 

1 

0.8 

0.6 

0.4 

0.2 

1 0 

F(b) 

Example Lovasz extension 

0.5 

x b 

0 

0 

F(a,b) 

F({}) 

x a 

0.5 

F(a) 

1 

A F(A) 

{} 0 

{a} 1 

{b} .8 

{a,b} .2 

45


! polynomial-‐?me? 

! Yes -‐-‐ ellipsoid algorithm: Grötschel, Lovasz, Schrijver `81 

! Combinatorial algorithm? 

Edmonds, Cunningham,… 

Iwata, Fujishige, Fleischer `01 Schrijver `00 

Iwata `03: O(n 4 T + n 5 logM) 

min 

S✓V 

F (S) 

“efficient” …? 

Orlin `09: O(n 6 + n 5 T) 

T = ?me for evalua?ng F 

46

The submodular polyhedron P F 

PF = {x 2 R n : x(A) apple F (A) for all A ✓ V } 

P F 

x(A) = X 

-2 

-‐1 

i2A 

xi 

2 

1 

x {b} 

0 1 

x({b}) ≤ F({b}) 

x({a}) ≤ F({a}) 

Example: V = {a,b} 

x({a,b}) ≤ F({a,b}) 

x {a} 

A F(A) 

{} 0 

{a} -‐1 

{b} 2 

{a,b} 0 

47

A more prac?cal alterna?ve? [Fujishige ’91, Fujishige et al ‘11] 

Minimum norm point algorithm 

1. Find x = arg min x 2 

x⇥BF 

2. 

Base polytope: 

A = {i | x (i) 0} 

Theorem [Fujishige `91]: 

A* is an op?mal solu?on 

Run?me finite but worst-‐case complexity open 

x({a,b})=F({a,b}) 

[-‐1,1] 

-2 

-‐1 

x* 

2 

1 

x {b} 

0 1 

A F(A) 

{} 0 

{a} -‐1 

{b} 2 

{a,b} 0 

x {a} 

48

Lower is beÑer (log-‐scale!) 

Running ?me (seconds) 

Empirical comparison [Fujishige et al ’06] 

64 

128 

Minimum 

norm algorithm 

256 

Problem size (log-‐scale!) 

512 1024 

Minimum norm algorithm: usually O(n 3 ) to O(n 4 ) 

è orders of magnitude faster 

Cut functions 

from DIMACS 

Challenge 

49

p(x|y) / exp( E(x; y)) 

E(x; y) = X 

color, 

texture, … 

Image segmenta?on 

i Ei(xi) 

+ 

ij Eij(xi,xj) 

smoothness 

MAP inference = energy minimiza?on 

50

Set func?ons and energy func?ons 

Can view any set func?on 

equivalently as func?on on binary vectors: 

where |V|=n 

F :2 V ! R 

F : {0, 1} n ! R 

Conversely: 

a func?on on binary variables is a set func?on! 

A 

a 

b 

c 

d 

ˆ= 

x = e A 

1 

1 

0 

0 

a 

b 

c 

d 

51

Set func?ons & probabilis?c models 

! Maximum a posteriori: 

observa?ons y: 

infer binary labels 

maximize 

p(x | y) / exp( E(x; y)) 

, min E(x; y) 

x2{0,1} n 

, minimize 

F (A) :=E(eA; y) 

x 

0 

1 1 

1 1 

0 

A 

if F is submodular: 

polynomial-‐?me J 

0 

0 

0 

52

pixels 

wideband 

signal 

samples 

Example: Sparsity 

frequency 

?me 

large 

wavelet 

coefficients 

(blue = 0) 

large 

Gabor (TF) 

coefficients 

Many natural signals sparse in suitable basis. 

Can exploit for learning/regulariza?on/compressive sensing...

Sparse reconstruc?on 

min 

x 

• explain y with few columns 

of M: few x i 

⌦(x) =kxk0 = |S| 

⌦(x) =kxk1 

⇥y Mx⇥ 2 + ⌦(x) 

discrete regulariza?on on support S of x 

relax to convex envelope 

in nature: sparsity paÑern oÖen not random… 

subset 

selec?on: 

S = {1,3,4,7} 

54

m 2 

m 1 

m 5 

m 3 m 4 m 6 m 7 

Structured sparsity 

Incorporate tree preference in regularizer? 

x 

Set func?on: 

F (T )

m 2 

m 1 

m 5 

m 3 m 4 m 6 m 7 



x 

Set func?on: 

F (T )

m 3 

Func?on m1 F is … 

m 2 

m 5 

m 4 m 6 m 7 



x 

submodular! J 

Set func?on: 

F (T )

min 

x 

• explain y with few 

columns of M: few x i 

⌦(x) =kxk1 

[Bach`10] 

Sparsity 

⇥y Mx⇥ 2 

+ ⌦(x) 

• prior knowledge: paÑerns 

of nonzeros 

discrete regulariza?on on support S of x 

⌦(x) =kxk0 = |S| 

• submodular func?on 

⌦(x) =F (S) 

relax to convex envelope 

è Lovász extension 

⌦(x) =f(|x|) 

• Op?miza?on: submodular minimiza?on 

58

Further connec?ons: Dic?onary Selec?on 

min 

x 

⇥y Mx⇥ 2 + ⌦(x) 

Where does the dic?onary M come from? 

Want to learn it from data: {y1,...,yn} ✓R d 

Selec?ng a dic?onary with near-‐max. variance reduc?on 

ó Maximiza?on of approximately submodular func?on 

[Krause & Cevher ‘10; Das & Kempe ’11] 

59

More applica?ons ... 

clustering 

MAP 

inference 

“oh, yes!” 

“yes …” 

“True.” 

“yes, that’s true” 

“that’s …!?” 

Corpus 

training set 

extrac?on 

… 

oh 

yes 

true 

that’s 

60

Special cases 

Minimizing general submodular func?ons: 

poly-‐?me, but not very scalable 

Special structure è faster algorithms 

! Symmetric func?ons 

! Graph cuts 

! Concave func?ons 

! Sums of func?ons with bounded support 

! ... 

61

Graph cuts 

Cut(S) = X 

u2S,v/2S 

Given F, can we build a graph so that F(S) = Cut(S)? 

s t 

S 

Solving a Min-‐(s,t)-‐cut: efficient 

w(u, v) 

general case: O(mn) 

[Orlin`12] 

special cases: linear-‐?me 

62

E(x; y) = X 

minimize 

MAP inference as graph cut 

i Ei(xi) 

+ 

ij Eij(xi,xj) 

min F (A) ⌘ min E(x; y) 

A x2{0,1} n 

Cut(A) =E(eA; y) 

x1 x x 1 12 

03 

x1 4 x1 x 5 06 

x1 x x 7 08 09 

Minimum energy = minimum cut! O(n) 

[Boykov&Kolmogorov `04] 

s 

t 

A 

63

Which func?ons can be minimized as cuts? 

! Func?ons of order two: 

E(x) = X 

Need regularity: 

= X 

i 

Ei(xi)+ X 

i ↵ixi + X 

ij Eij(xi,xj) 

Eij(0, 0) + Eij(1, 1) apple Eij(1, 0) + Eij(0, 1) 

è Each pairwise term must be submodular 

ij apple 0 

ij 

ijxixj 

[Queyranne, Picard&Ratliff,...] 

64


E(x) = 

X 

But … 

Unary preferences 

(texture) 

+ 

Pairwise poten?als 

Fig. 1 Incorporating higher order potentials for object segmentation. 

(a) AnimagefromtheMSRC-21dataset.(b), (c) and(d) Unsupervised 

Ei(xi)+ image segmentation results generated by using different para- 

i 

meters values in the mean-shift segmentation algorithm (Comaniciu 

and Meer 2002). (e) The object segmentation obtained using the unary 

likelihood potentials from TextonBoost. (f) The result of performing 

inference in the pairwise CRF defined in Sect. 2.(g)Oursegmentation 

X 

Eij(xi,xj) 

ij 

Ec(xc) 

c 

+ X 

sky 

building 

grass 

tree 

65 

result obt 

potentials 

rough han 

can be cl 

significan 

branches

+ 

pairwise 

smoothness 

Poten?als 

+ 303 

btained by augmenting the pairwise CRF with higher order 

ls defined on the segments shown in (b), (c) and(d). (h) The 

and labelled segmentations provided in the MSRC data set. It 

clearly 

tentials 

seen 

for object 

that the 

segmentation. 

use of higher order potentials results in a 

ataset.(b), (c) and(d) Unsuper- 

P n poten?als 

E(x) = 

X 

Ei(xi)+ X 

Fig. 1 Incorporating Fig. 1 Incorporating higher order higher potentials orderfor potentials object segmentation. 

for object segmentation. result obtained result byobtained augmenting by augmenting the pairwisethe CRF pairwise with higher CRF wo 

303 

(a) AnimagefromtheMSRC-21dataset.(b), (a) AnimagefromtheMSRC-21dataset.(b), (c) and(d) (c) Unsuper- and(d) Unsuper- potentials defined potentials on the defined segments on theshown segments in (b), shown (c) and(d). in (b), (c) (h) 

vised imagevised segmentation image segmentation results generated resultsby generated using different by usingpara differentrough para- handrough labelled hand segmentations labelled segmentations provided in provided the MSRCin data the Ms 

meters values meters in the values mean-shift in the mean-shift segmentation segmentation algorithm (Comaniciu algorithm (Comaniciu can be clearly canseen be clearly that theseen usethat of higher the useorder of higher potentials orderresults poten 

and Meer 2002). and Meer (e) The 2002). object (e) segmentation The object segmentation obtained using obtained the unary using thesignificant unary improvement significant improvement in the segmentation in the segmentation result. For instance result. F 

likelihood potentials likelihoodfrom potentials TextonBoost. from TextonBoost. (f) The result (f) of The performing result of performing branches ofbranches the tree are of much the Eij(xi,xj) 

tree better are much segmented better segmented 

inference ininference the pairwise in the CRF pairwise defined CRF in Sect. defined 2.(g)Oursegmentation 

in Sect. 2.(g)Oursegmentation 

Int J Comput Vis (2009) 82: 302–324 303 

Pixels in one ?le 

pairwise 

sell et al. (2006)thisisnotalwaysthecaseandsegmentsob- 

sell et al. (2006)thisisnotalwaysthecaseandsegmentsob- 2006). These 2006). methods These have methods produced have produced excellent results excell 

should tained using tained take unsupervised using the unsupervised segmentation segmentation methods are methods often are challenging often c challenging data sets. data However, sets. However, they typically they only typica 

wrong. Towrong. overcome To overcome these problems these problems (Hoiem et(Hoiem al. 2005b) et al. 2005b) with one object with one at aobject time at in athe time image in the independently image inde 

and same (Russell and et label (Russell al. 2006) et al. use2006) multiple usesegmentations multiple segmentations of the of dothe not provide do nota provide framework a framework for understanding for understand the wh 

image (instead image of (instead only one) of only in the one) hope in that the hope although that most althoughimage. most Further, image. 

[Kohli 

their Further, 

et 

models 

al.`09] 

their become modelsprohibitively become 66 prohibi larg 

segmentations segmentations are bad, some are bad, aresome correct areand correct thus and would thus would the number theofnumber classesof increases. classes increases. This prevents Thistheir preven ap 

result obtained by augmenting the pairwise CRF with higher order 

prove potentials defined useful prove on the for segments useful theirshown task. for in their (b), They (c) and(d). task. merge (h) They The these merge multiple thesesuper- multiple super- cation to scenarios cation to scenarios where segmentation where segmentation and recognitio and 

i 

+ X 

Ec(xc) 

higher-‐order 

ij

Enforcing label consistency 

Pixels in a superpixel should have the same label 

E(x) 

max 


concave func?on of cardinality è submodular J 

> 2 arguments: Graph cut ?? 

Fig. 1 Incorporating higher order potentials for object segmentation. 

(a) AnimagefromtheMSRC-21dataset.(b), (c) and(d) Unsupervised 

image segmentation results generated by using different parameters 

values in the mean-shift segmentation algorithm (Comaniciu 

and Meer 2002). (e) The object segmentation obtained using the unary 

likelihood potentials from TextonBoost. (f) The result of performing 

inference in the pairwise CRF defined in Sect. 2.(g)Oursegmentation 

result obtain 

potentials de 

rough hand 

can be clear 

67 

significant i 

branches of

Higher-‐order func?ons as graph cuts? 

X 

i Ei(xi) + X 

! works well for some par?cular 

[Billionet & Minoux `85, Freedman & Drineas `05, Živný & Jeavons `10,...] 

! necessary condi?ons complex and 

not all submodular func?ons equal such graph cuts [Živný et al.‘09] 

General strategy: 

reduce to pairwise case by adding auxiliary variables 

possibly many extra nodes in the graph 

! 

ij Eij(xi,xj) + X 

Ec(xc) 

c Ec(xc) 

68

Fast approximate minimiza?on 

! Not all submodular func?ons are graph cuts 

! Avoid adding too many extra nodes 

! parametric maxflow 

[Fujishige & Iwata`99] 

! approximate by a 

series of graph cuts 

[Jegelka, Lin & Bilmes `11] 

time (s) (s) (s) 

10 4 

10 4 

10 4 

10 2 

10 2 

10 2 

10 0 

10 0 

10 0 

10 

10 

10 

minimum norm point algorithm 

≈ O(n4) 

iterative 

approximate algorithm 

10 2 

10 2 

10 2 

n 

parametric maxow 

O(n2 parametric maxow 

O(n ) 2 ) 

10 3 

10 3 

10 3 

69

! Symmetric: 

! Queyranne‘s algorithm: O(n 3 ) [Queyranne, 1998] 

! Concave of modular: 

Other special cases 

F (S) =F (V \ S) 

F (S) = X 

[Kolmogorov `12, Stobbe & Krause `10, Kohli et al, `09] 

! Sum of submodular func?ons, each bounded support 

i 

gi 

⇣ X ⌘ 

w(s) 

s2S 

[Kolmogorov `12] 

70


Op?miza?on 

unconstrained 

constrained 

Maximiza?on 

Online/ 

adap?ve 

op?m. 

Learning 

71

Submodular Minimiza?on 

! Polynomial ?me 

! Empirically beÑer: minimum-‐norm point algorithm 

! Provably beÑer: special cases (graph cuts, concave, …) 

What if we have constraints? 

Hint: graph cuts are submodular func?ons… 

limited cases doable: 

• odd/even cardinality 

• inclusion/exclusion of a set 

• ring family 

• … 

General case: 

NP-‐hard 

polynomial lower bounds 

72

Limita?ons of graph cuts 

Graph cut 

solu?on 

E(x; y) = X 

data fit 

i Ei(xi) 

+ 

ij Eij(xi,xj) 

smoothness prior: 

cut in grid 

• True boundary not “short” 

• Local informa?on scarce 

instead: prefer congruous boundary è look at en?re cut at once 

73

s 

Graph cut as edge selec?on 

F (C) = X 

minimize sum 

w(e) 

e2C 

s.t. C is a cut 

t 

now: 

prefer congruous cuts 

minimize submodular func?on 

F (C) 

s.t. C is a cut 

not a sum 

of weights! 

74

Rewarding co-‐occurrence of edges 

sum of weights: 

use few edges 

One group (13 edges) 

Many groups ( 6 edges) 

submodular cost func?on: 

use few groups S i of edges 

F (C) = X 

cost 

150 

100 

50 

i Fi(C \ Si) 

0 

0 100 200 300 300 400 

400 

|A| 

Op?miza?on? 

efficient itera?ve algorithm 

75

Results 

Graph cut 

solu?on 

Coopera?ve cut 

solu?on 

76

SFM & Combinatorial Constraints 

s 

cut matching path spanning tree 

t 

... with minimum cost 

min 

S C e S 

w(e) 

s 

t 

min 

S C 

F (S) 

General case: very hard. [Goel et al.`09, Iwata & Nagano `09, Jegelka & Bilmes `11,...] 

Approxima?ons: 

relaxa?on via Lovasz extension approximate F 

77

Submodular Minimiza?on 

! Convex Lovász extension 

! (High-‐order) polynomial ?me 

! Empirically beÑer: minimum-‐norm point algorithm 

! Provably beÑer: special cases 

! Constraints: 

usually hard problems è approxima?ons 

! Applica?ons: MAP inference, combinatorial regularizers, 

clustering, … 

78

Op?miza?on 

Op?miza?on 

Minimiza?on 

MaximizaNon 

Online/ 

adap?ve 

op?m. 

Learning 

79

Op?miza?on 

Op?miza?on 

Minimiza?on 

unconstrained 

constrained 

Online/ 

adap?ve 

op?m. 

Learning 

80

Submodular func?on maximiza?on 

feature 

selec?on 

covering summariza?on 

max 

S✓V 

diversity 

F (S) 

sensing 

81

Two faces of submodular func?ons 





82

OF F IC E 

OF F IC E 

OF F IC E 

S 1 

OF F IC E 

S 1 

Reminder: Diminishing returns 

CONFERENCE 

STO RA GE 

ELEC COPY 

KITCHEN 

CONFERENCE 

STO RA GE 

ELEC COPY 

KITCHEN 

S 3 

S 2 

S 2 

LAB 

SERVER 

LAB 

SERVER 

QU I ET PHONE 

S’ 

QU I ET PHONE 

S 4 S’ 

A={S 1 ,S 2 } 

F(A U {S’})-F(A) 

≥ 

F(B U {S’})-F(B) 

B = {S 1 ,S 2 ,S 3 ,S 4 } 

F(A) “intuitively” 

|A| 

83

Maximizing submodular func?ons 

Minimizing convex func?ons: 

Polynomial ?me solvable! 

Maximizing convex func?ons: 

NP hard! 

Minimizing submodular func?ons: 

Polynomial ?me solvable! 

Maximizing submodular func?ons: 

NP hard! 

But can get 

approxima?on 

guarantees J 

84

Maximizing submodular func?ons 

! Suppose we want for submodular F 

A ⇤ = arg max F (A) s.t.A✓V ! Example: 

A 

! F(A) = U(A) – C(A) where U(A) is submodular u?lity, 

and C(A) is supermodular cost func?on 

! In general: NP hard. Moreover: 

maximum 

! If F(A) can take nega?ve values: 

As hard to approximate as maximum independent set 

(i.e., NP hard to get O(n 1-‐ε ) approxima?on) 

|A| 

85

Maximizing posi?ve submodular func?ons 

[Feige, Mirrokni, Vondrak ’09; Buchbinder, Feldman, Naor, Schwartz ’12] 

Theorem 

There is an efficient algorithm, that, given a posi?ve 

submodular func?on F, F({})=0, returns set A LS such that 

F(A LS ) ≥ 1/2 max A F(A) 

! picking a random set gives ¼ approxima?on 

(½ approxima?on if F is symmetric!) 

! we cannot get beÑer than ½ approxima?on unless P = NP 

86

Op?miza?on 

Op?miza?on 

Minimiza?on 

unconstrained 

constrained 

Online/ 

adap?ve 

op?m. 

Learning 

87

Scalariza?on vs. constrained maximiza?on 

Given monotonic u?lity F(A) and cost C(A), op?mize: 

Op?on 1: Op?on 2: 

max 

A 

F (A) C(A) 

s.t. A ✓ V 

“Scalariza?on” “Constrained maximiza?on” 

Can get 1/2 approx… 

if F(A)-‐C(A) ≥ 0 

for all sets A 

Posi?veness is a 

strong requirement L 

max F (A) 

A 

s.t. C(A) apple B 

What is possible? 

88

X 1 

Placement A = {1,2} 

OF FIC E 

OF FIC E 

CONF ER ENCE 

STO RAGE 

ELEC 

COPY 

KITC HEN 

SE RV ER 

F is monotonic: 

LAB 

Monotonicity 

QU IE T PHONE 

X 2 

Adding sensors can only help 

X 1 

Placement B = {1,…,5} 

OF FIC E 

OF FIC E 

CONF ER ENCE 

STO RAGE 

ELEC 

X 4 

COPY 

KITC HEN 

X 3 

LAB 

SE RV ER 

QU IE T PHONE 

8A, s : F (A [{s}) F (A) 0 

(s | A) 0 

X 5 

X 2

Constrained maximiza?on 

! Given: finite set V of loca?ons 

A ⇤ 

! Want: such that 

V 

Typically NP-‐hard! 

OF FIC E 

X 4 

X 2 

OF FIC E 

CONF ER ENCE 

STO RAGE 

ELEC 

COPY 

KITC HEN 

X 1 

LAB 

SE RV ER 

QU IE T PHONE 

X 3 

X 5 

90

Exact maximiza?on of SFs 

! Mixed integer programming 

! Series of mixed integer programs [Nemhauser et al ‘81] 

! Constraint genera?on [Kawahara et al ‘09] 

! Branch-‐and-‐bound 

! „Data-‐Correc?ng Algorithm“ [Goldengorin et al ’99] 

All algorithms worst-‐case exponen?al! 

91

Greedy algorithm 

! Given: finite set V of loca?ons 

A ⇤ 


V 

Typically NP-‐hard! 

Greedy algorithm: 

Start with A = 

For i = 1 to k 

s ⇤ 

arg max F (A ⇥ {s}) 

s 

A A ⇥ {s ⇤ } 

How well can this simple heurisFc do? 

OF FIC E 

X 4 

X 2 

OF FIC E 

CONF ER ENCE 

STO RAGE 

ELEC 

COPY 

KITC HEN 

X 1 

LAB 

SE RV ER 

QU IE T PHONE 

X 3 

X 5 

92

Informa?on gain 

Performance of greedy 

Op?mal 

Greedy 

Greedy empirically close to op?mal. Why? 

44 

49 

47 

50 

45 

42 41 

51 

48 

43 

46 

40 

OFFICE 

52 

39 

OFFICE 

54 

53 

CO N FE R ENC E 

ST OR AG E 

EL EC CO PY 

KI TC HEN 

37 

38 36 

8 

5 

2 

35 

7 

4 

10 

6 

3 

33 

11 

LAB 

SE RVE R 

12 

QUIET PH ONE 

13 14 

Temperature data 

from sensor network 

9 

1 

34 

32 

31 

30 

29 

28 

27 

26 

23 

15 

18 

21 

25 

19 

16 

17 

22 

24 

20 

93

One reason submodularity is useful 

Theorem [Nemhauser, Fisher & Wolsey ’78] 

For monotonic submodular func?ons, 

Greedy algorithm gives constant factor approxima?on 

F(A greedy ) ≥ (1-‐1/e) F(A opt ) 

~63% 

! Greedy algorithm gives near-‐op?mal solu?on! 

! For informa?on gain: Guarantees best possible unless P = NP! 

[Krause & Guestrin ’05] 

94

Scaling up the greedy algorithm [Minoux ’78] 

In round i+1, 

! have picked A i = {s 1 ,…,s i } 

! pick s i+1 = argmax s F(A i U {s})-‐F(A i ) 

I.e., maximize “marginal benefit” Δ(s | A i ) 

Δ(s | A i ) = F(A i U {s})-‐F(A i ) 

Key observaNon: Submodularity implies 

i ≤ j => Δ(s | A i ) ≥ Δ(s | A j ) 

Marginal benefits can never increase! 

Δ (s | A i ) ≥ Δ(s | A i+1 ) 

s 

95

“Lazy” greedy algorithm [Minoux ’78] 

Lazy greedy algorithm: 

§ First itera?on as usual 

§ Keep an ordered list of marginal 

benefits Δ i from previous itera?on 

§ Re-‐evaluate Δ i only for top 

element 

§ If Δ i stays on top, use it, 

otherwise re-‐sort 

M 

a 

d b 

b c 

e d 

c e 

Benefit Δ(s | A) 

Note: Very easy to compute online bounds, lazy evalua?ons, etc. 

[Leskovec, Krause et al. ’07] 

96

Lower is better 

Empirical improvements [Leskovec, Krause et al’06] 

Running ?me (minutes) 

300 

200 

100 

0 

Exhaustive search 

(All subsets) 

Naive 

greedy 

Lower is beÑer 

1 2 3 4 5 6 7 8 9 10 

Number of sensors selected 

Sensor placement 

Fast greedy 

Running ?me (seconds) 

400 

300 

200 

100 

Exhaus?ve search 

(All subsets) 

Naive 

greedy 

0 

1 2 3 4 5 6 7 8 9 10 

Number of blogs selected 

Blog selec?on 

30x speedup 700x speedup 

Fast greedy 

97

TRANSFORM IN OBJECT DETEC- 

uter 

was 

edge 

eral 

ears, 

the 

tion 

for 

nsform [1] is one of the classical computer 

ues which dates 50 years back. It was 

ted as a method for line detection in edge 

s but was then extended to detect general 

objects such as circles [2]. In recent years, 

ethods were successfully adapted to the 

art-based category-level object detection 

ave obtained state-of-the-art results for 

datasets [3]–[8]. 

sical Hough transform and its more modoceed 

by converting the input image into 

tation called the Hough image which lives 

lled the Hough space (Figure 1). Each point 

space corresponds to a hypothesis about 

terest being present in the original image 

location and configuration. The dimen- 

Hough image thus equals the number of 

edom for the configuration(+location) of 

odinto 

lives 

oint 

bout 

age 

ener 

of 

) of 

orks 

ents. 

ight 

that 

on’s 

Object detec?on [Barinova et al.’10] 

Line detection task classic Hough transform 

Pedestrian detection task Hough x4 =1 forest [5] transform 

xj = index of hypothesis 

Fig. 1. explaining Variants of xj Hough transform x5 =1 render the detection 

tasks (left) into the tasks of peaks identification in Hough 

images (right). As can be seen, 

x 

in 6 =1 

the presence of multiple y3 close objects identifying the peaks 

x 

in Hough images is 

7 =1 

a highly non-trivial problem in itself. The probabilistic 

framework developed in this paper x seeking addresses heuristics. this prob- 

8 =1 

lem without invoking non-maxima suppression or mode 

seeking heuristics. Vo?ng elements Hypotheses 

transform based method essentially works 

input image into a set of voting elements. 

ent votes for the hypotheses that might 

this element. For instance, a feature that 

ight vote for the presence of a person’s 

) in location Illustra?ons justcourtesy belowof Pushmeet it. Of course, Kohli 

s do not provide evidence for the exact lo- 

x 1 =1 

x 2 =1 

x 3 =1 


y 1 

Pedestrian detection tasky2 Hough forest [5] transform 

Fig. 1. Variants of Hough transform render the detection 

tasks (left) into the tasks of peaks identification present in Hough 

images (right). As can be seen, in the presence of multiple 

y 

close objects identifying the peaks ini = 0: object i 

Hough images is 

a highly non-trivial problem in itself. The notprobabilistic present 

framework developed in this paper addresses this problem 

without invoking non-maxima suppression or mode 

y i = 1: object i 

different hypothesis in the Hough space. Large values 

of the vote are given to hypotheses that might have 

98

TRANSFORM IN OBJECT DETEC- 

uter 

was 

edge 

eral 

ears, 

the 

tion 

for 

nsform [1] is one of the classical computer 

ues which dates 50 years back. It was 

ted as a method for line detection in edge 

s but was then extended to detect general 

objects such as circles [2]. In recent years, 

ethods were successfully adapted to the 

art-based category-level object detection 

ave obtained state-of-the-art results for 

datasets [3]–[8]. 

sical Hough transform and its more modoceed 

by converting the input image into 

tation called the Hough image which lives 

lled the Hough space (Figure 1). Each point 

space corresponds to a hypothesis about 

terest being present in the original image 

location and configuration. The dimen- 

Hough image thus equals the number of 

edom for the configuration(+location) of 

odinto 

lives 

oint 

bout 

age 

ener 

of 

) of 

orks 

ents. 

ight 

that 

on’s 

Object detec?on 


Pedestrian detection task Hough x4 =2 forest [5] transform 

xj = index of hypothesis 

Fig. 1. explaining Variants of xj Hough transform x5 =2 render the detection 

tasks (left) into the tasks of peaks identification in Hough 

images (right). submodular 

As can be seen, 

x 

in 6 =0 

the presence of multiple 

close objects maximiza?on 

identifying the peaks 

x 

in Hough images is 

7 =2 

a highly non-trivial problem in itself. The probabilistic 

J 

framework developed in this paper x seeking addresses heuristics. this prob- 

8 =2 

lem without invoking non-maxima suppression or mode 

seeking heuristics. Vo?ng elements 

transform based method essentially works 

input image into a set of voting elements. 

ent votes for the hypotheses that might 

this element. For instance, a feature that 

ight vote for the presence of a person’s 

) in location Illustra?ons justcourtesy belowof Pushmeet it. Of course, Kohli 

s do not provide evidence for the exact lo- 

x 1 =1 

x 2 =1 

x 3 =1 


y 1 =1 

y 2 =1 

Pedestrian detection task Hough forest [5] transform 

y i = 1: object i 

Fig. 1. Variants of Hough transform render the detection 

tasks (left) into the tasks of peaks identification present in Hough 

images (right). As can be seen, in the presence of multiple 

y 

close objects identifying the peaks ini = 0: object i 

Joint y3=0 MAP inference: 

Hough images is 

a highly non-trivial problem in itself. The notprobabilistic present 

framework developed F (S) in this = paper addresses this problem 

without invoking non-maxima suppression or mode 

X 

max 

Hypotheses 

Weight element i wrt hyp. j 

different hypothesis in the Hough space. Large values 

of the vote are given to hypotheses that might have 

i 

j2S wi,j 

99

Illustra?ons courtesy of Pushmeet Kohli 

Inference 

Datasets from [Andriluka et al. CVPR 2008] 

(with strongly occluded pedestrians added) 

Using the Hough forest trained in [Gall&Lempitsky CVPR09]

Results for pedestrians detec?on 

Precision 

TUD-‐campus 

Recall 

Precision 

TUD-‐crossing 

Blue = Hough transform + non-‐maximum suppression 

Light-‐blue = greedy detec?on 

submodularity for detec?on also in [Blaschko’11] 

Recall

Network inference 

How can we learn who influences whom? 

102

1 

2 

Inferring diffusion networks 

[Gomez Rodriguez, Leskovec, Krause ACM TKDE 2012] 

Given Want 

3 4 25 

4 

3 

Given traces of influence, wish to infer sparse 

directed network G=(V,E) 

è Formulate as op?miza?on problem 

1 

E ⇤ = arg max 

|E|applek 

F (E) 

103

1 

2 

Es?ma?on problem 

3 5 

4 

! Many influence trees T consistent with data 

! For cascade C i , model P(C i | T) 

! Find sparse graph that maximizes likelihood for all 

observed cascades 

è Log likelihood monotonic submodular in selected edges 

F (E) = X 

i 

1 

2 

3 5 

4 

log max 

tree T ✓E P (Ci | T ) 

104

Evalua?on: Synthe?c networks 

1024 node hierarchical Kronecker 

exponen?al transmission model 

! Performance does not depend on the network 

structure: 

! Synthe?c Networks: Forest Fire, Kronecker, etc. 

! Transmission ?me distribu?on: Exponen?al, Power Law 

! Break-‐even point of > 90% 

1000 node Forest Fire (α = 1.1) 

power law transmission model 

105

Diffusion Network 

[Gomez Rodriguez, Leskovec, Krause ACM TKDE 2012] 

Actual network inferred from 172 million 

ar?cles from 1 million news sources 

Blogs 

Mainstream media 

106

Diffusion Network (small part) 

Blogs 

Mainstream media 

107

Document summariza?on [Lin & Bilmes ‘11] 

! Which sentences should we select that best 

summarize a document? 108

Marginal gain of a sentence 

! Many natural no?ons of „document coverage“ are 

submodular [Lin & Bilmes ‘11] 

109

Document summariza?on 

F (S) =R(S)+ D(S) 

Relevance Diversity 

110

Relevance of a summary 

F (S) =R(S)+ D(S) 

R(S) = X 

i 

How well is sentence i „covered“ by S 

Ci(S) = X 

j2S 

↵Ci(V ) 

min{Ci(S),↵Ci(V )} 

wi,j 

Similarity between i and j 

111

D(S) = 

Diversity of a summary 

KX 

i=1 

s X 

j2Pi\S 

Can be made query-‐specific; mul?-‐resolu?on; etc. 

rj 

Relevance of sentence j to doc. 

rj = 1 

N 

X 

i 

wi,j 

Similarity between i and j 

Clustering of sentences 

in document 

112

Empirical results [Lin & Bilmes ‘11] 

Best F1 score on benchmark corpus DUC-‐07! 113

Submodular Sensing Problems 

[with Guestrin, Leskovec, Singh, Sukhatme, …] 

Environmental monitoring 

[JAIR ’08, ICRA ‘10] 

Experiment design [NIPS ‘10, ‘11] 

Water distribu?on networks 

[J WRPM ’08 Best paper] 

Top score in 

BWSN compe??on 

Recommending blogs & news 

[KDD ‘07, ’10 Best paper] 

Can all be reduced to monotonic submodular maximiza?on 

114

! So far: 

More complex constraints 

! Can one handle more complex constraints? 

115

Ground set 

Example: Camera network 

Configura?on: 

Sensing quality model 

V = {1a, 1b,...,5a, 5b} 

Configura?on is feasible if no camera is pointed in 

two direc?ons at once 

1a 

1b 

3a 

3b 

11 

6

Matroids 

! Abstract no?on of feasibility: independence 

S is independent if … 

… |S| ≤ k 

Uniform matroid 

… S contains at most one 

element from each square 

Par??on matroid 

• S independent è T ✓ 

S also independent 

… S contains no cycles 

Graphic matroid 

• Exchange property: S, U independent, |S| > |U| 

è some can be added to U: independent 

• All maximal independent sets have the same size 

117

Matroids 

! Abstract no?on of feasibility: independence 

S is independent if … 

… |S| ≤ k 

Uniform matroid 

… S contains at most one 

element from each group 

Par??on matroid 

• S independent è T ✓ S also independent 

… S contains no cycles 

Graphic matroid 

• Exchange property: S, U independent, |S| > |U| 

è some e 2 S can be added to U: U [ e 

independent 

• All maximal independent sets have the same size 

118

Ground set 

Example: Camera network 

Configura?on: 

Sensing quality model 

Configura?on is feasible if no camera is pointed in 

two direc?ons at once 

This is a par??on matroid: 

P1 = {1a, 1b},...,P5 = {5a, 5b} 

Independence: 

|S \ Pi| apple1 

V = {1a, 1b,...,5a, 5b} 

1a 

1b 

3a 

3b 

11 

9

Greedy algorithm for matroids: 

! Given: finite set V 

A ⇤ 


V 

A ⇤ = argmax 

A independent 

Greedy algorithm: 

Start with A = 

While 9s : A [{s} indep. 

A A ⇥ {s ⇤ } 

F (A) 

s ⇤ argmax 

s: A[{s} indep. 

F (A [{s}) 

1a 

3b 

120

Maximiza?on over matroids 

Theorem [Nemhauser, Fisher & Wolsey ’78] 

For monotonic submodular func?ons, 

Greedy algorithm gives constant factor approxima?on 

F(A greedy ) ≥ ½ F(A opt ) 

! Greedy gives 1/(p+1) over intersec?on of p matroids 

! Can model rankings with p=2! 

! Can get also obtain (1-‐1/e) for arbitrary matroids [Vondrak et al ’08] 

using con?nuous greedy algorithm 

121

Non-‐constant cost func?ons 

! For each s ∈ V, let c(s)>0 be its cost 

(e.g., length of sentence; feature acquisi?on costs, …) 

! Cost of a set C(A) = ∑ s∈A c(s) 

! Want to solve 

A* = argmax F(A) s.t. C(A) ≤ B 

122

Cost-‐benefit op?miza?on 

[Wolsey ‘82, Sviridenko ‘04, Krause et al ‘05] 

Theorem [Krause and Guestrin‘05] 

Can efficiently find a set A such that 

Then 

F (A) 

Can s?ll speed up using lazy evalua?ons 

Note: Can also get 

⇣ 

1 

⌘ 

1 

p 

e 

max 

C(A0 )appleB F (A0 ) 

~39% 

! (1-‐1/e) approxima?on in ?me O(n 4 ) [Sviridenko ’04] 

! (1-‐1/e) approxima?on for mul?ple linear constraints [Kulik ‘09] 

! 0.38/k approxima?on for k matroid and m linear constraints 

[Chekuri et al ‘11] 

123

Summary: More complex constraints 

! Approximate submodular maximiza?on possible 

under a variety of constraints: 

! Matroid 

! Knapsack 

! Mul?ple matroid and knapsack constraints 

! Path constraints (Submodular orienteering) 

! Connectedness (Submodular Steiner) 

! Robustness (minimax) 

! OÖen the (best) algorithms are non-‐greedy 

124

Key intui?on for approx. maximiza?on 

For submod. funcFons, 

local maxima 

can‘t be too bad 

! E.g., all local maxima under cardinality constraints 

are within factor 2 of global maximum 

! Key insight for more complex maximiza?on 

è Greedy, local search, simulated annealing 

for (non-‐monotone, constrained, ...) 

125

Two-‐faces of submodular func?ons 

Cuts, 

clustering, 

similarity 

MAP 

inference 

Structured 

sparsity 





summariza?on 

sensing 

Coverage, 

diversity 

126

MaximizaNon MinimizaNon 

Summary Op?miza?on 

Unconstrained NP-‐hard, but 

well-‐approximable 

(if nonnega?ve) 

Constrained NP-‐hard but well-‐ 

approximable 

„Greedy-‐(like)“ for 

cardinality, matroid 

constraints; 

Non-‐greedy for more 

complex (e.g., 

connec?vity) constraints 

Polynomial ?me! 

Generally inefficent 

(n^6), but can exploit 

special cases 

(cuts; symmetry; 

decomposable; ...) 

NP-‐hard; hard to 

approximate, s?ll useful 

algorithms 

127


Op?miza?on 

Minimiza?on 

Maximiza?on 

Online/ 

adap?ve 

op?m. 

Learning 

128


Op?miza?on 

Minimiza?on 

Maximiza?on 

Online/ 

adap?ve 

op?m. 

Learning 

129

Example 1: Valuation Functions 

And so on… 

= $ 3 

= $ 88 

! For combinatorial auc?ons, show bidders various subsets 

of items, see their bids 

Can we learn a bidder’s u?lity 

func?on from few bids? 

130

Example 2: Graph Evolu?on 

! Want to track changes in a graph 

! Instead of storing en?re graph at each ?me step, 

store some measurements 

! # of measurements

Random Graph Cut #1 

Cut value = 13 Cut value = 14 

! Choose a random par??on of ver?ces 

! Count total # of edges across par??on 

132

Random Graph Cut #2 

Cut value = 13 Cut value = 12 

! Choose another random par??on of ver?ces 

! Count total # of edges across par??on 

133

Symmetric Graph Cut Func?on 

A 

f(A) = sum of weights of edges between A and V\A 

V\A 

• V = set of ver?ces 

• One-‐to-‐one correspondence of graphs and cut func?ons 

Can we learn a graph from the value of few cuts? 

[E.g., graph sketching, computa?onal biology, …] 

134

General Problem: Learning Set Func?ons 

Base Set 

Set func?on 

! Wish to learn from: 

! Small measurement collec?on 

! Func?on values 

135

Approxima?ng submodular func?ons 

[Goemans, Harvey, Kleinberg, Mirrokni, ’08] 

! Pick m sets, A 1 … A m , get to see F(A 1 ), …, F(A m ) 

! From this, want to approximate F 

by ˆF s.t. 

Theorem: Even if 

! F is monotonic 

ˆF (A) apple F (A) apple ↵ ˆ F (A) 

for all A 

! we can pick polynomially many A i , chosen adap?vely, 

cannot approximate beÑer than α = n ½ / log(n) 

unless one looks at exponen?ally many sets A i 

But can efficiently obtain α = n ½ log(n) 

136

Learning submodular func?ons 

[Balcan, Harvey STOC ‘11] 

! Sample m sets A 1 … A m , from dist. D; see F(A 1 ), …, F(A m ) 

! From this, want to generalize well 

! ˆF is (α,ε,δ)-‐PMAC iff with prob. 1-‐δ it holds that 

h i 

ˆF (A) apple F (A) apple ↵F ˆ(A) 1 " 

PA⇠D 

Theorem: cannot approximate beÑer than 

α = n 1/3 / log(n) 

unless one looks at exponen?ally many samples A i 

But can efficiently obtain α = n ½ 

137

What if we have structure? 

! To learn effec?vely, need addi?onal assump?ons 

beyond submodularity. 

! Sparsity in Fourier domain [Stobbe & Krause ’12] 

Sparsity: Most coefficients ≈0 

! „Submodular“ compressive sensing 

! Cuts and many other func?ons sparse in Fourier domain! 

! Also can learn XOS valua?ons [Balcan et al ‘12] 

138

Compressive sensing with set func?ons 

[Stobbe & Krause ’12] 

Theorem: Suppose number of random measurements 

is propor?onal to the sparsity ?mes log factors 

Then we can recover by solving: 

Random Hadamard-‐Walsh basis func?ons sa?sfy RIP w.h.p 

Can also handle noisy observa?ons 

Many more details [see paper] 

ˆf 

min ||x||1 s.t. fM = Mx 

139

Graph Evolu?on Results 

• Tracking evolu?on 

of 128-‐vertex 

subgraph using 

random cuts 

• Δ = number of 

differences 

between graphs 

! Autonomous Systems Graph (from SNAP) 

Reconstruc?on error 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

Week 

3 to 5 

Δ=122 

Week 

5 to 7 

Δ=62 

Week 

7 to 9 

Δ=161 

Week 

1 to 3 

Δ=245 

0 

0 200 400 600 800 1000 1200 1400 

Number of measurements 

! For low error, observing random cuts suffices 

140


Op?miza?on 

Minimiza?on 

Maximiza?on 

Online/ 

adapNve 

opNm. 

Learning 

141

Learning to op?mize 

! Have seen how to 

! op?mize submodular func?ons 

! learn submodular func?ons 

What if we only want to 

learn enough to op?mize? 

142

Learning to op?mize submodular func?ons 

! Structured predic?on with submodular func?ons 

! Learn func?on parameters to achieve target minimizer 

! ApplicaFon: MAP inference; summariza?on 

! Online submodular op?miza?on 

! Learn to pick a sequence of sets to maximize a sequence of 

(unknown) submodular func?ons 

! ApplicaFon: Building algorithm porïolios 

! Adap?ve submodular op?miza?on 

! Gradually build up a set, taking into account feedback 

! ApplicaFon: Sequen?al experimental design 

143

Structured Predic?on with SFs 

! Given training data {(S1, x1),...,(Sn, xn)} 

and paramerized family of SFs F, 

can we find parameters ✓ such that 

Si ⇡ arg min 

S F (S; ✓, xi) 

! Approaches 

With margin! 

! Parameter learning in submodular MRFs [Taskar et al. ’04] 

è Minimiza?on 

! Learning mixtures of submodular shells [Lin & Bilmes ‘12] 

è Maximiza?on 

144

Learning to summarize [Lin & Bilmes‘12] 

! Input: 

! Collec?on of documents and human summaries 

! Goal: 

! Learn parameterized submodular relevance & diversity 

F (S; dk) = X 

i 

dk 

↵iRi(S; dk)+ X 

Relevance &Parameters, diversity, instan?ated want to learn for each document k 

! Want to achieve that for all k 

! Can efficiently find solu?on with bounded 

generaliza?on error! 

j 

Sk 

jDj(S; dk) 

Sk ⇡ argmax F (S; dk) 

S:|S|appleB 

145

Learning mixtures for summariza?on 

! Training data: Documents with human summaries 

146












147

Online maximiza?on of submodular func?ons 

[Streeter, Golovin NIPS ‘08] 

Pick sets 

SFs 

Reward 

A 1 

F 1 

r 1 =F 1 (A 1 ) 

A 2 

F 2 

r 2 

Theorem 

Can efficiently choose A 1 ,…A t s.t. in expecta?on 

A 3 

F 3 

r 3 

for any sequence F i , as T à ∞ 

… 

… 

Total: ∑ t r t à max 

Can get ‘no-‐regret’ over ‘omniscient’ greedy algorithm 

… 

A T 

F T 

r T 

Observe either 

F t , or only F t (A t ) 

Time 

148

Applica?on: Learning to solve SAT quickly 

[Streeter, Golovin, Smith] 

• Hybridiza?on via Task-‐Switching Schedules 

Example Schedule S 

2 sec 

4 sec 6 sec 2 sec 10 sec 

Time 

• Need to learn which schedules work well! 

Heuris?c #1 

Heuris?c #2 

Heuris?c #3 

Heuris?c #4

Hybridizing Solvers 

[Streeter & Golovin ‘08] 

• V = {run heuris?c h for t seconds : h in H, t >0} 

• f i (S) = Pr[schedule S completes job f in ?me limit]. 

• This is a submodular func?on! J 

• Task: Select {S i : i =1, 2, …, T} online to maximize # instances solved 

• This is an online submodular maximizaNon problem! J 

• Online greedy algorithm achieves no-‐(1-‐1/e)-‐regret

Example Schedule 

[Streeter, Golovin, Smith, AAAI ’07 & CSP ‘08] 

(log scale)

(2008)

Other results on online submodular op?miza?on 

! Online submodular maximiza?on 

! No (1-‐1/e) regret for ranking (par??on matroids) 

[Streeter, Golovin, Krause 2009] 

! Distributed implementa?on [Golovin, Faulkner, Krause ‘2010] 

! Improved bounds for bandits with linear combina?ons of SFs 

[Yue, Guestrin, NIPS 2011] 

! Online submodular coverage 

! Min-‐cost / Min-‐sum submodular cover [Streeter & Golovin 

NIPS 2008] 

! Guillory & Bilmes [NIPS 2011] 

! Online Submodular Minimiza?on 

! Unconstrained [Hazan & Kale NIPS 2009] 

! Constrained [Jegelka & Bilmes ICML 2011] 

153


Op?miza?on 

Minimiza?on 

Maximiza?on 

Online/ 

adapNve 

opNm. 

Learning 

154












155

0 

Adap?ve Sensing / Diagnosis 

1 

0 

x 4 = 

x 1 

0 

= 

1 

Want to effec?vely diagnose while minimizing cost of tes?ng! 

Classical submodularity does not apply L 

Can we generalize submodularity for 

sequen?al decision making? 

x 2 

= 

1 

x 3 

= 

156

Adap?ve selec?on in diagnosis 

! Prior over diseases P(Y) 

! Determinis?c test outcomes P(X V | Y) 

! Each test eliminates hypotheses y 

X 1 

“Fever” 

Y 

“Sick” 

X 2 

“Rash” 

X 3 

“Cough” 

X 1 =1 X 3 =0 

X 2 =0 

States y 

X 2 =1 

157

Given: 

Problem Statement 

! Items (tests, experiments, ac?ons, …) V={1,…,n} 

! Associated with random variables X 1 ,…,X n taking values in O 

! Objec?ve: 

! Policy π maps observa?on x A to next item 

Value of policy π: 

Want 

f :2 V ⇥ O V ! R 

NP-‐hard (also hard to approximate!) 

Tests run by π 

if world in state x V 

158

Adap?ve greedy algorithm 

! Suppose we’ve seen X A = x A. 

! Condi?onal expected benefit of adding item s: 

h 

i 

(s | xA) =E f(A [{s}, xV ) f(A, xV ) | xA 

AdapNve Greedy algorithm: Benefit if world in state xV Start with 

For i = 1:k 

! Pick 

! Observe 

! Set 

A = ; 

When does this adapFve greedy algorithm work?? 

Condi?onal on 

observa?ons x A 

159

Adap?ve submodularity 

[Golovin & Krause, JAIR 2011] 

AdapFve monotonicity: 

(s | xA) 0 

AdapFve submodularity: 

Theorem: If f is adap?ve submodular and adap?ve 

monotone w.r.t. to distribu?on P, then 

F(π greedy ) ≥ (1-‐1/e) F(π opt ) 

x B observes 

more than x A 

whenever 

Many other results about submodular set func?ons 

can also be “liÖed” to the adap?ve señng! 

160

Applies to: set func?ons 

Greedy algorithm provides 

From sets to policies 

Submodularity AdapNve submodularity 

• (1-‐1/e) for max. w card. const. 

• 1/(p+1) for p-‐indep. systems 

• log Q for min-‐cost-‐cover 

• 4 for min-‐sum-‐cover 

policies, value func?ons 

h 

i 

F (s | A) =F (A [{s}) F (A) F (s | xA) =E f(A [{s}, xV ) f(A, xV ) | xA 

F (s | A) 0 

F (s | xA) 0 

A ✓ B ) F (s | A) F (s | B) xA xB ) F (s | xA) F (s | xB) 

max 

A 

F (A) max 

⇡ 

F (⇡) 

Greedy policy provides 

• (1-‐1/e) for max. w card. const. 

• 1/(p+1) for p-‐indep. systems 

• log Q for min-‐cost-‐cover 

• 4 for min-‐sum-‐cover 161

Op?mal Diagnosis 


! Determinis?c test outcomes P(X V | Y) 

! How should we test to 

eliminate all incorrect hypotheses? 

“Generalized binary search” 

Equivalent to max. infogain 

X 1 

“Fever” 

Y 

“Sick” 

X 2 

“Rash” 

X 3 

“Cough” 

X 1 =1 X 3 =0 

X 2 =0 

X 2 =1 

162

OD is Adap?ve Submodular 

(s | {}) = 2g0b0 

g0 + b0 

(s | xv,w) = 2g1b1 

g1 + b1 

Not hard to show that 

Test w 

Objec?ve = probability mass of hypotheses 

you have ruled out. 

Outcome = 0 

Test s 

Outcome = 1 

(s | {}) (s | xv,w) 

Test v 

163

0 

Theore?cal guarantees 

x 1 

0 1 0 

= 

1 

x 2 

0 

= 

1 

x 3 

= 

1 

Garey & Graham, 1974; 

Loveland, 1985; 

Arkin et al., 1993; 

Kosaraju et al., 1999; 

Dasgupta, 2004; 

Guillory & Bilmes, 2009; 

Nowak, 2009; 

Gupta et al., 2010 

With adap?ve 

submodular 

analysis! 

Result requires that tests are exact (no noise)!

What if there is noise? 

[w Daniel Golovin, Deb Ray, NIPS ‘10] 


! Noisy test outcomes P(X V | Y) 

! How should we test 

to learn about y (infer MAP)? 

! Exis?ng approaches: 

! Generalized binary search? 

! Maximize informa?on gain? 

! Maximize value of informa?on? 

X 1 

“Fever” 

Y 

“Sick” 

X 2 

“Rash” 

Theorem: All these approaches can have cost 

more than n/log n ?mes the op?mal cost! 

X 3 

“Cough” 

Not adapNve submodular! 

è Is there an adap?ve submodular criterion?? 

165

Theore?cal guarantees 

[with Daniel Golovin, Deb Ray, NIPS ‘10] 

Theorem: Equivalence class edge-‐cuñng (EC2 ) is 

adap?ve monotone and adap?ve submodular. 

Suppose P (xV ,h) {0} ⇥ [ , 1] 

for all 

Then it holds that 

Cost(⇡Greedy) O 

✓ 

log 1◆ 

Cost(⇡ ⇤ ) 

xV ,h 

First approxima?on guarantees for nonmyopic VOI 

in general graphical models! 

166

Example: The Iowa Gambling Task 

[with Colin Camerer, Deb Ray] 

A) 

What would you prefer? 

B) 

Prob. 

.7 

Prob. .7 

.3 

-‐10$ 0$ +10$ 

Various compe?ng theories on how people make decisions 

under uncertainty 

! Maximize expected u?lity? [von Neumann & Morgenstern ‘47] 

! Constant rela?ve risk aversion? [PraÑ ‘64] 

! Porïolio op?miza?on? [Hanoch & Levy ‘70] 

-‐10$ 0$ +10$ 

! (Normalized) Prospect theory? [Kahnemann & Tversky ’79] 

How should we design tests to disNnguish theories? 167 

.3

Iowa Gambling as BED 

Every possible test X s = (g s,1 ,g s,2 ) is a pair of gambles 

Theories parameterized by θ 

Each theory predicts u?lity for 

every gamble U(g,y,θ) 

P (Xs =1| y, )= 

X 1 

(g 1,1 ,g 1,2 ) 

Y 

Theory 

X 2 

(g 2,1 ,g 2,2 ) 

θ 

Param’s 

X n 

(g n,1 ,g n,2 ) 

1 

1 + exp(U(gs,1,y, ) U(gs,2,y, )) 

P(X i =1 | Δ U ) 

1 

0.8 

0.6 

0.4 

0.2 

… 

0 

−5 0 5 

Difference in utility Δ 

U 

168

Accuracy 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

Simula?on Results 

Adaptive 

Submodular 

BED 

VOI 

InfoGain 

Random 

Generalized 

binary search 

0.2 

0 5 10 15 20 25 30 

Number of tests 

UncertaintySampling 

AdapNve submodular criterion (EC 2 ) 

outperforms exisNng approaches 169

Num. classified 

30 

25 

20 

15 

10 

5 

0 

Expected 

value 

Experimental Study 

[with Colin Camerer, Deb Ray] 

Mean var. 

skewness 

Prospect 

Theory 

Const. rel. 

risk aversion 

! Strongest support for PT, with some heterogeneity 

! Unexpectedly no support for CRRA 

Study with 57 

naïve subjects 

32,000 designs 

40s per test L 

ExploiNng 

submodularity: 

Interac?ve submodular coverage 

! Alterna?ve formaliza?on of adap?ve op?miza?on 

[Guillory & Bilmes, ICML ‘10] 

! Addresses the worst case señng 

! Applica?ons to (noisy) ac?ve learning, viral marke?ng 

[Guillory & Bilmes, ICML ‘11] 

171

! Game theory 

Other direc?ons 

! Equilibria in coopera?ve (supermodular) games / fair alloca?ons 

! Price of anarchy in non-‐coopera?ve games 

! Incen?ve compa?ble submodular op?miza?on 

! New algorithms for submodular maximiza?on 

! Robust submodular op?miza?on 

! Generaliza?ons of submodular func?ons 

! L#-‐convex / discrete convex analysis 

! XOS/Subaddi?ve func?ons 

! Efficient minimiza?on of subclasses 

172

! submodularity.org 

! References 

Further resources 

! Matlab Toolbox for Submodular Op?miza?on 

! discml.cc 

! NIPS Workshops on Discrete Op?miza?on in Machine Learning 

! Videos of invited talks on videolectures.net 

! Submit to DISCML 2012! J 

... 

173

Conclusions 

! Discrete op?miza?on abundant in applica?ons 

! Fortunately, some of those have structure: 

submodularity 

! Submodularity can be exploited to develop efficient, 

scalable algorithms with strong guarantees 

! Can handle complex constraints 

! Can learn to op?mize (online, adap?ve, …) 

174

submodularity-2012

Create successful ePaper yourself

Delete template?

Save as template?