30.05.2013 Views

submodularity-2012

submodularity-2012

submodularity-2012

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Tutorial on Submodularity in<br />

Machine Learning and Computer Vision<br />

Stefanie Jegelka<br />

Andreas Krause


Seman?c Segmenta?on<br />

Int J Comput Vis (2009) 82: 302–324<br />

How can we map pixels to objects?<br />

rder potentials for object segmentation.<br />

C-21dataset.(b), (c) and(d) Unsuper-<br />

result obtained by augmenting2 the p<br />

potentials defined on the segments sh


Document Summariza?on<br />

How can we select<br />

representa?ve sentences?<br />

3


Network inference<br />

How can we learn who influences whom?<br />

4


What‘s common?<br />

! Can be formalized as op?mizing a set func?on F(S)<br />

under constraints<br />

! Generally very hard<br />

! Structure helps: We‘ll see that if F(S) is submodular,<br />

! can solve maximiza?on and minimiza?on problems with<br />

strong guarantees<br />

! can solve learning problems involving submodular func?ons<br />

! You‘ll learn about theory and applica?ons<br />

5


Outline<br />

! What is <strong>submodularity</strong>?<br />

! Proper?es of submodular func?ons<br />

! Op?miza?on<br />

! Minimiza?on<br />

! Maximiza?on<br />

! Learning<br />

! Learning for Op?miza?on<br />

6


Running example: Sensor placement<br />

OF FIC E<br />

OF FIC E<br />

CONF ER ENCE<br />

STO RAGE<br />

ELEC<br />

COPY<br />

KITC HEN<br />

Want to place sensors to monitor temperature<br />

LAB<br />

SE RV ER<br />

QU IE T<br />

PHONE<br />

7


! Finite set V = {1,2,…,n}<br />

! Set func?on<br />

Set func?ons<br />

F :2 V ! R<br />

! Will always assume F({}) = 0 (w.l.o.g.)<br />

! Assume black-­‐box that can evaluate F for any input A<br />

! Approximate (noisy) evalua?on of F is ok<br />

OF FIC E<br />

OF FIC E<br />

CONF ER ENCE<br />

STO RAGE<br />

ELEC COPY<br />

KITC HEN<br />

LAB<br />

SE RV ER<br />

QU IE T PHONE<br />

8


Example: Sensor placement<br />

U?lity F(A) of having sensors at subset A of all loca?ons<br />

OF FIC E<br />

OF FIC E<br />

X 1<br />

CONF ER ENCE<br />

STO RAGE<br />

ELEC<br />

COPY<br />

KITC HEN<br />

X 2<br />

LAB<br />

SE RV ER<br />

QU IE T PHONE<br />

X 3<br />

A={1,2,3}: Very informa?ve<br />

High value F(A)<br />

X 4<br />

X 5<br />

OF FIC E<br />

X 1<br />

OF FIC E<br />

CONF ER ENCE<br />

STO RAGE<br />

ELEC<br />

COPY<br />

KITC HEN<br />

LAB<br />

SE RV ER<br />

QU IE T PHONE<br />

A={1,4,5}: Redundant info<br />

Low value F(A)<br />

9


! Given set func?on<br />

! Marginal gain:<br />

X 1<br />

OF FIC E<br />

OF FIC E<br />

Marginal gains<br />

CONF ER ENCE<br />

STO RAGE<br />

ELEC<br />

COPY<br />

KITC HEN<br />

F :2 V ! R<br />

F (s | A) =F ({s}[A) F (A)<br />

X s<br />

LAB<br />

SE RV ER<br />

New sensor s<br />

QU IE T PHONE<br />

X 2<br />

10


Submodularity: Decreasing marginal gains<br />

X 1<br />

Placement A = {1,2}<br />

OF FIC E<br />

OF FIC E<br />

CONF ER ENCE<br />

STO RAGE<br />

ELEC<br />

COPY<br />

KITC HEN<br />

LAB<br />

SE RV ER<br />

QU IE T PHONE<br />

X 2<br />

X 1<br />

Placement B = {1,…,5}<br />

Adding s will help a lot! Xs Adding s doesn’t help much<br />

New sensor s<br />

+ s Large improvement<br />

Submodularity:<br />

For<br />

B A<br />

OF FIC E<br />

+<br />

OF FIC E<br />

CONF ER ENCE<br />

STO RAGE<br />

ELEC<br />

X 4<br />

COPY<br />

KITC HEN<br />

s<br />

X 3<br />

LAB<br />

SE RV ER<br />

QU IE T PHONE<br />

X 5<br />

Small improvement<br />

A ⇥ B : F (A ⌅ {s}) F (A) ⇤ F (B ⌅ {s}) F (B)<br />

(s | A) (s | B)<br />

X 2<br />

11


Alterna?ve characteriza?ons<br />

! Set func?on F on V is called submodular if<br />

8A, B ✓ V : F (A)+F (B) F (A [ B)+F (A \ B)<br />

+ ≥<br />

B<br />

AA U B<br />

A ∩ B<br />

! Equivalent diminishing returns characteriza?on:<br />

A + S<br />

8A ✓ B ✓ V,s /2 B :<br />

B<br />

+<br />

S<br />

+<br />

Large improvement<br />

Small improvement<br />

F (A [ {s}) F (A) F (B [ {s}) F (B)<br />

(s | A)<br />

(s | B)<br />

12


Ques?ons<br />

How do I prove my problem is<br />

submodular?<br />

Why is <strong>submodularity</strong> useful?<br />

13


Place sensors<br />

in building Possible<br />

locations<br />

V<br />

Node predicts<br />

values of posi?ons<br />

with some radius<br />

Example: Set cover<br />

Want to cover floorplan with discs<br />

OF F IC E<br />

OF F IC E<br />

CONFERENCE<br />

STO RA GE<br />

ELEC<br />

COPY<br />

KITCHEN<br />

For A ⊆ V: F(A) = “area<br />

covered by sensors placed at A”<br />

Formally:<br />

W finite set, collec?on of n subsets S i ⊆ W<br />

For A ⊆ V define F(A) = | i∈ A S i |<br />

LAB<br />

SERVER<br />

QU I ET<br />

PHONE<br />

14


OF F IC E<br />

OF F IC E<br />

OF F IC E<br />

S 1<br />

OF F IC E<br />

S 1<br />

CONFERENCE<br />

ELEC<br />

Set cover is submodular<br />

STO RA GE<br />

COPY<br />

KITCHEN<br />

CONFERENCE<br />

STO RA GE<br />

ELEC<br />

COPY<br />

KITCHEN<br />

S 3<br />

S 2<br />

S 2<br />

LAB<br />

SERVER<br />

LAB<br />

SERVER<br />

QU I ET<br />

QU I ET<br />

PHONE<br />

S’<br />

PHONE<br />

S 4 S’<br />

A={S 1 ,S 2 }<br />

F(A U {S’})-F(A)<br />

≥<br />

F(B U {S’})-F(B)<br />

B = {S 1 ,S 2 ,S 3 ,S 4 }<br />

15


X 1<br />

Y 4<br />

More complex model for sensing<br />

OF FIC E OF FIC E<br />

QU IE T<br />

Y1 CONF ER ENCE Y2 Y3 X 4<br />

STO RAGE<br />

ELEC COPY<br />

X2 KITC HEN<br />

X 5<br />

X 3<br />

Joint probability distribu?on<br />

P(X 1 ,…,X n ,Y 1 ,…,Y n ) = P(Y 1 ,…,Y n ) P(X 1 ,…,X n | Y 1 ,…,Y n )<br />

LAB<br />

SE RV ER<br />

Y 5<br />

PHONE<br />

Y 6<br />

X 6<br />

Prior Likelihood<br />

Y s : temperature<br />

at loca?on s<br />

X s : sensor value<br />

at loca?on s<br />

X s = Y s + noise<br />

16


Example: Sensor placement<br />

U?lity of having sensors at subset A of all loca?ons<br />

F (A) =H(Y) H(Y | XA)<br />

OF FIC E<br />

OF FIC E<br />

X 1<br />

CONF ER ENCE<br />

STO RAGE<br />

ELEC<br />

COPY<br />

KITC HEN<br />

Uncertainty<br />

about temperature Y<br />

before sensing<br />

X 2<br />

LAB<br />

SE RV ER<br />

QU IE T PHONE<br />

X 3<br />

A={1,2,3}: High value F(A)<br />

X 4<br />

X 5<br />

Uncertainty<br />

about temperature Y<br />

aHer sensing<br />

OF FIC E<br />

X 1<br />

OF FIC E<br />

CONF ER ENCE<br />

STO RAGE<br />

ELEC<br />

COPY<br />

KITC HEN<br />

LAB<br />

SE RV ER<br />

QU IE T PHONE<br />

A={1,4,5}: Low value F(A)<br />

17


Example: Submodularity of info-­‐gain<br />

Y 1 ,…,Y m , X 1 , …, X n discrete RVs<br />

F(A) = I(Y; X A ) = H(Y)-­‐H(Y | X A )<br />

! F(A) is NOT always submodular<br />

Theorem [Krause & Guestrin `05]<br />

If X i are all condi?onally independent given Y,<br />

then F(A) is submodular!<br />

Y 1<br />

X 1<br />

X 2<br />

Y 2<br />

X 3<br />

Y 3<br />

X 4<br />

18


Proof: Submodularity of informa?on gain<br />

Y 1 ,…,Y m , X 1 , …, X n discrete RVs<br />

F(A) = I(Y; X A ) = H(Y)-­‐H(Y | X A )<br />

Variables X independent given Y<br />

F(A U {s}) – F(A) = H(X s |X A ) – H(X s |Y)<br />

Nonincreasing in A:<br />

A ⊆ B : H(X s |X A ) ≥ H(X s |X B )<br />

„informa?on never hurts“<br />

Constant<br />

(indep. of A)<br />

Δ(s | A) = F(A U {s})-­‐F(A) monotonically nonincreasing<br />

ó F submodular J<br />

19


eakfast??<br />

t 1<br />

Example: costs<br />

cost:<br />

?me to reach shop<br />

+ price of items<br />

ground set V<br />

t 2<br />

each item<br />

1 €<br />

t 2<br />

20


eakfast??<br />

Example: costs<br />

cost:<br />

?me to shop<br />

+ price of items<br />

F( ) = cost( ) + cost( , )<br />

= t 1 + 1 + t 2 + 2<br />

= #shops + #items<br />

submodular?<br />

21


A<br />

B<br />

Shared fixed costs<br />

(b|A) =1+1<br />

(b|B) =1<br />

marginal cost: #new shops + #new items<br />

diminishing è cost is submodular!<br />

• shops: shared fixed cost<br />

• economies of scale<br />

22


Another example: Cut func?ons<br />

a 2<br />

2<br />

2 2<br />

c<br />

2<br />

b<br />

2<br />

d<br />

1<br />

1<br />

e 3 g<br />

3<br />

3 3 3<br />

f<br />

3<br />

h<br />

Cut func?on is submodular!<br />

V={a,b,c,d,e,f,g,h}<br />

F (A) = X<br />

s2A,t /2A<br />

ws,t<br />

23


Why are cut func?ons submodular?<br />

a 2<br />

2<br />

2 2<br />

c<br />

2<br />

b<br />

2<br />

d<br />

a<br />

w<br />

b<br />

1<br />

1<br />

e 3 g<br />

3<br />

3 3 3<br />

f<br />

3<br />

h<br />

F (S) = X<br />

(i,j)2E<br />

S F ab (S)<br />

{} 0<br />

{a} w<br />

{b} w<br />

{a,b} 0<br />

Submodular if w ≥ 0!<br />

Fi,j(S \{i, j})<br />

Cut func?on in subgraph {i,j}<br />

è Submodular!<br />

24


Closedness proper?es<br />

F 1 ,…,F m submodular func?ons on V and λ 1 ,…,λ m > 0<br />

Then: F(A) = ∑ i λ i F i (A) is submodular<br />

Submodularity closed under nonnega?ve linear<br />

combina?ons!<br />

Extremely useful fact:<br />

! F θ (A) submodular è ∑ θ P(θ) F θ (A) submodular!<br />

! Mul?criterion op?miza?on:<br />

F 1 ,…,F m submodular, λ i >0 è ∑ i λ i F i (A) submodular<br />

! A basic proof technique! J<br />

25


Other closedness proper?es<br />

! Restric?on: F(S) submodular on V, W subset of V<br />

Then is submodular<br />

F 0 (S) =F (S \ W )<br />

V<br />

S<br />

W<br />

26


Other closedness proper?es<br />

! Restric?on: F(S) submodular on V, W subset of V<br />

Then is submodular<br />

F 0 (S) =F (S \ W )<br />

! Condi?oning: F(S) submodular on V, W subset of V<br />

F 0 (S) =F (S [ W )<br />

Then is submodular<br />

V<br />

S<br />

W<br />

27


Other closedness proper?es<br />

! Restric?on: F(S) submodular on V, W subset of V<br />

Then is submodular<br />

F 0 (S) =F (S \ W )<br />

! Condi?oning: F(S) submodular on V, W subset of V<br />

F 0 (S) =F (S [ W )<br />

Then is submodular<br />

! Reflec?on: F(S) submodular on<br />

F 0 (S) =F (V \ S)<br />

Then is submodular<br />

V<br />

S<br />

28


Convex aspects<br />

! Submodularity as discrete analogue of convexity<br />

! Convex extension<br />

! Efficient minimiza?on<br />

! Duality<br />

! However, this is only one half of the story...<br />

f(x)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

1 0<br />

0.5<br />

x b<br />

0<br />

0<br />

x a<br />

0.5<br />

1<br />

29


! Marginal gain<br />

! Submodular:<br />

! Concave:<br />

Concave aspects<br />

F (s | A) =F ({s}[A) F (A)<br />

8 A ✓ B,s /2 B : F (A [ {s}) F (A) F (B [ {s}) F (B)<br />

8 a0 f(a + s) f(a) f(b + s) f(b)<br />

F(A) “intuitively”<br />

|A|<br />

30


Submodularity and Concavity<br />

Suppose g: N à R and F(A) = g(|A|)<br />

Then F(A) submodular if and only if g is …concave<br />

!<br />

g(|A|)<br />

|A|<br />

31


Maximum of submodular func?ons<br />

Suppose F 1 (A) and F 2 (A) submodular.<br />

Is F(A) = max(F 1 (A),F 2 (A)) submodular?<br />

F 2 (A)<br />

F 1 (A)<br />

F(A) = max(F 1 (A),F 2 (A))<br />

|A|<br />

max(F 1 ,F 2 ) not submodular in general!<br />

32


Minimum of submodular func?ons<br />

Well, maybe F(A) = min(F 1 (A),F 2 (A)) instead?<br />

F 1 (A) F 2 (A) F(A)<br />

{} 0 0 0<br />

{a} 1 0 0<br />

{b} 0 1 0<br />

{a,b} 1 1 1<br />

F({b}) – F({})=0<br />

<<br />

F({a,b}) – F({a})=1<br />

min(F 1 ,F 2 ) not submodular in general!<br />

33


Two faces of submodular func?ons<br />

Convex aspects<br />

èminimiza?on!<br />

Concave aspects<br />

èmaximiza?on!<br />

34


What to do with submodular func?ons<br />

Op?miza?on<br />

Minimiza?on<br />

Maximiza?on<br />

Online/<br />

adap?ve<br />

op?m.<br />

Learning<br />

35


Op?miza?on<br />

Op?miza?on<br />

Minimiza?on<br />

Maximiza?on<br />

Online/<br />

adap?ve<br />

op?m.<br />

Learning<br />

Minimiza?on and maximiza?on not the same??<br />

36


Submodular minimiza?on<br />

Op?miza?on<br />

MinimizaNon<br />

Maximiza?on<br />

Online/<br />

adap?ve<br />

op?m.<br />

Learning<br />

37


Submodular minimiza?on<br />

Op?miza?on<br />

unconstrained<br />

constrained<br />

Maximiza?on<br />

Online/<br />

adap?ve<br />

op?m.<br />

Learning<br />

38


Submodular func?on minimiza?on<br />

clustering<br />

MAP<br />

inference<br />

min<br />

S✓V<br />

F (S)<br />

“oh, yes!”<br />

“yes …”<br />

“True.”<br />

“yes, that’s true”<br />

“that’s …!?”<br />

Structured<br />

sparsity<br />

oh<br />

yes<br />

true<br />

that’s<br />

Corpus<br />

training set<br />

extrac?on<br />

39


Submodular func?on minimiza?on<br />

! polynomial-­‐?me?<br />

min<br />

S✓V<br />

F (S)<br />

è <strong>submodularity</strong> and convexity<br />

40


Set func?ons and energy func?ons<br />

Can view any set func?on<br />

equivalently as func?on on binary vectors:<br />

where |V|=n<br />

F :2 V ! R<br />

F : {0, 1} n ! R<br />

A<br />

a<br />

b<br />

c<br />

d<br />

ˆ=<br />

x = e A<br />

1<br />

1<br />

0<br />

0<br />

a<br />

b<br />

c<br />

d<br />

41


Submodularity and Convexity<br />

! Extension:<br />

F : {0, 1} n ! R<br />

Lovász extension<br />

f(x) = max<br />

y2PF<br />

convex<br />

f :[0, 1] n ! R<br />

x · y<br />

• minimum of F is a minimum of f<br />

è submodular minimiza?on reduces to convex min<br />

a<br />

b<br />

c<br />

d<br />

ˆ=<br />

1<br />

1<br />

0<br />

0<br />

a<br />

b<br />

c<br />

d<br />

42


The submodular polyhedron P F<br />

PF = {x 2 R n : x(A) apple F (A) for all A ✓ V }<br />

P F<br />

x(A) = X<br />

-2<br />

-­‐1<br />

i2A<br />

xi<br />

2<br />

1<br />

x {b}<br />

0 1<br />

x({b}) ≤ F({b})<br />

x({a}) ≤ F({a})<br />

Example: V = {a,b}<br />

x({a,b}) ≤ F({a,b})<br />

x {a}<br />

A F(A)<br />

{} 0<br />

{a} -­‐1<br />

{b} 2<br />

{a,b} 0<br />

43


Evalua?ng the Lovász extension<br />

Linear maximiza?on over PF f(x) = max<br />

y2Pf<br />

x > y<br />

Exponen?ally many constraints!!! L<br />

Computable in O(n log n) ?me J<br />

[Edmonds ‘70]<br />

• Subgradient<br />

• Separa?on oracle<br />

• Central for op?miza?on<br />

-2<br />

y*<br />

-­‐1<br />

2<br />

1<br />

x {b}<br />

0 1<br />

x<br />

x {a}<br />

44


f(x)<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

1 0<br />

F(b)<br />

Example Lovasz extension<br />

0.5<br />

x b<br />

0<br />

0<br />

F(a,b)<br />

F({})<br />

x a<br />

0.5<br />

F(a)<br />

1<br />

A F(A)<br />

{} 0<br />

{a} 1<br />

{b} .8<br />

{a,b} .2<br />

45


Submodular func?on minimiza?on<br />

! polynomial-­‐?me?<br />

! Yes -­‐-­‐ ellipsoid algorithm: Grötschel, Lovasz, Schrijver `81<br />

! Combinatorial algorithm?<br />

Edmonds, Cunningham,…<br />

Iwata, Fujishige, Fleischer `01 Schrijver `00<br />

Iwata `03: O(n 4 T + n 5 logM)<br />

min<br />

S✓V<br />

F (S)<br />

“efficient” …?<br />

Orlin `09: O(n 6 + n 5 T)<br />

T = ?me for evalua?ng F<br />

46


The submodular polyhedron P F<br />

PF = {x 2 R n : x(A) apple F (A) for all A ✓ V }<br />

P F<br />

x(A) = X<br />

-2<br />

-­‐1<br />

i2A<br />

xi<br />

2<br />

1<br />

x {b}<br />

0 1<br />

x({b}) ≤ F({b})<br />

x({a}) ≤ F({a})<br />

Example: V = {a,b}<br />

x({a,b}) ≤ F({a,b})<br />

x {a}<br />

A F(A)<br />

{} 0<br />

{a} -­‐1<br />

{b} 2<br />

{a,b} 0<br />

47


A more prac?cal alterna?ve? [Fujishige ’91, Fujishige et al ‘11]<br />

Minimum norm point algorithm<br />

1. Find x = arg min x 2<br />

x⇥BF<br />

2.<br />

Base polytope:<br />

A = {i | x (i) 0}<br />

Theorem [Fujishige `91]:<br />

A* is an op?mal solu?on<br />

Run?me finite but worst-­‐case complexity open<br />

x({a,b})=F({a,b})<br />

[-­‐1,1]<br />

-2<br />

-­‐1<br />

x*<br />

2<br />

1<br />

x {b}<br />

0 1<br />

A F(A)<br />

{} 0<br />

{a} -­‐1<br />

{b} 2<br />

{a,b} 0<br />

x {a}<br />

48


Lower is beÑer (log-­‐scale!)<br />

Running ?me (seconds)<br />

Empirical comparison [Fujishige et al ’06]<br />

64<br />

128<br />

Minimum<br />

norm algorithm<br />

256<br />

Problem size (log-­‐scale!)<br />

512 1024<br />

Minimum norm algorithm: usually O(n 3 ) to O(n 4 )<br />

è orders of magnitude faster<br />

Cut functions<br />

from DIMACS<br />

Challenge<br />

49


p(x|y) / exp( E(x; y))<br />

E(x; y) = X<br />

color,<br />

texture, …<br />

Image segmenta?on<br />

i Ei(xi)<br />

+<br />

ij Eij(xi,xj)<br />

smoothness<br />

MAP inference = energy minimiza?on<br />

50


Set func?ons and energy func?ons<br />

Can view any set func?on<br />

equivalently as func?on on binary vectors:<br />

where |V|=n<br />

F :2 V ! R<br />

F : {0, 1} n ! R<br />

Conversely:<br />

a func?on on binary variables is a set func?on!<br />

A<br />

a<br />

b<br />

c<br />

d<br />

ˆ=<br />

x = e A<br />

1<br />

1<br />

0<br />

0<br />

a<br />

b<br />

c<br />

d<br />

51


Set func?ons & probabilis?c models<br />

! Maximum a posteriori:<br />

observa?ons y:<br />

infer binary labels<br />

maximize<br />

p(x | y) / exp( E(x; y))<br />

, min E(x; y)<br />

x2{0,1} n<br />

, minimize<br />

F (A) :=E(eA; y)<br />

x<br />

0<br />

1 1<br />

1 1<br />

0<br />

A<br />

if F is submodular:<br />

polynomial-­‐?me J<br />

0<br />

0<br />

0<br />

52


pixels<br />

wideband<br />

signal<br />

samples<br />

Example: Sparsity<br />

frequency<br />

?me<br />

large<br />

wavelet<br />

coefficients<br />

(blue = 0)<br />

large<br />

Gabor (TF)<br />

coefficients<br />

Many natural signals sparse in suitable basis.<br />

Can exploit for learning/regulariza?on/compressive sensing...


Sparse reconstruc?on<br />

min<br />

x<br />

• explain y with few columns<br />

of M: few x i<br />

⌦(x) =kxk0 = |S|<br />

⌦(x) =kxk1<br />

⇥y Mx⇥ 2 + ⌦(x)<br />

discrete regulariza?on on support S of x<br />

relax to convex envelope<br />

in nature: sparsity paÑern oÖen not random…<br />

subset<br />

selec?on:<br />

S = {1,3,4,7}<br />

54


m 2<br />

m 1<br />

m 5<br />

m 3 m 4 m 6 m 7<br />

Structured sparsity<br />

Incorporate tree preference in regularizer?<br />

x<br />

Set func?on:<br />

F (T )


m 2<br />

m 1<br />

m 5<br />

m 3 m 4 m 6 m 7<br />

Structured sparsity<br />

Incorporate tree preference in regularizer?<br />

x<br />

Set func?on:<br />

F (T )


m 3<br />

Func?on m1 F is …<br />

m 2<br />

m 5<br />

m 4 m 6 m 7<br />

Structured sparsity<br />

Incorporate tree preference in regularizer?<br />

x<br />

submodular! J<br />

Set func?on:<br />

F (T )


min<br />

x<br />

• explain y with few<br />

columns of M: few x i<br />

⌦(x) =kxk1<br />

[Bach`10]<br />

Sparsity<br />

⇥y Mx⇥ 2<br />

+ ⌦(x)<br />

• prior knowledge: paÑerns<br />

of nonzeros<br />

discrete regulariza?on on support S of x<br />

⌦(x) =kxk0 = |S|<br />

• submodular func?on<br />

⌦(x) =F (S)<br />

relax to convex envelope<br />

è Lovász extension<br />

⌦(x) =f(|x|)<br />

• Op?miza?on: submodular minimiza?on<br />

58


Further connec?ons: Dic?onary Selec?on<br />

min<br />

x<br />

⇥y Mx⇥ 2 + ⌦(x)<br />

Where does the dic?onary M come from?<br />

Want to learn it from data: {y1,...,yn} ✓R d<br />

Selec?ng a dic?onary with near-­‐max. variance reduc?on<br />

ó Maximiza?on of approximately submodular func?on<br />

[Krause & Cevher ‘10; Das & Kempe ’11]<br />

59


More applica?ons ...<br />

clustering<br />

MAP<br />

inference<br />

“oh, yes!”<br />

“yes …”<br />

“True.”<br />

“yes, that’s true”<br />

“that’s …!?”<br />

Corpus<br />

training set<br />

extrac?on<br />

…<br />

oh<br />

yes<br />

true<br />

that’s<br />

60


Special cases<br />

Minimizing general submodular func?ons:<br />

poly-­‐?me, but not very scalable<br />

Special structure è faster algorithms<br />

! Symmetric func?ons<br />

! Graph cuts<br />

! Concave func?ons<br />

! Sums of func?ons with bounded support<br />

! ...<br />

61


Graph cuts<br />

Cut(S) = X<br />

u2S,v/2S<br />

Given F, can we build a graph so that F(S) = Cut(S)?<br />

s t<br />

S<br />

Solving a Min-­‐(s,t)-­‐cut: efficient<br />

w(u, v)<br />

general case: O(mn)<br />

[Orlin`12]<br />

special cases: linear-­‐?me<br />

62


E(x; y) = X<br />

minimize<br />

MAP inference as graph cut<br />

i Ei(xi)<br />

+<br />

ij Eij(xi,xj)<br />

min F (A) ⌘ min E(x; y)<br />

A x2{0,1} n<br />

Cut(A) =E(eA; y)<br />

x1 x x 1 12<br />

03<br />

x1 4 x1 x 5 06<br />

x1 x x 7 08 09<br />

Minimum energy = minimum cut! O(n)<br />

[Boykov&Kolmogorov `04]<br />

s<br />

t<br />

A<br />

63


Which func?ons can be minimized as cuts?<br />

! Func?ons of order two:<br />

E(x) = X<br />

Need regularity:<br />

= X<br />

i<br />

Ei(xi)+ X<br />

i ↵ixi + X<br />

ij Eij(xi,xj)<br />

Eij(0, 0) + Eij(1, 1) apple Eij(1, 0) + Eij(0, 1)<br />

è Each pairwise term must be submodular<br />

ij apple 0<br />

ij<br />

ijxixj<br />

[Queyranne, Picard&Ratliff,...]<br />

64


Int J Comput Vis (2009) 82: 302–324<br />

E(x) =<br />

X<br />

But …<br />

Unary preferences<br />

(texture)<br />

+<br />

Pairwise poten?als<br />

Fig. 1 Incorporating higher order potentials for object segmentation.<br />

(a) AnimagefromtheMSRC-21dataset.(b), (c) and(d) Unsupervised<br />

Ei(xi)+ image segmentation results generated by using different para-<br />

i<br />

meters values in the mean-shift segmentation algorithm (Comaniciu<br />

and Meer 2002). (e) The object segmentation obtained using the unary<br />

likelihood potentials from TextonBoost. (f) The result of performing<br />

inference in the pairwise CRF defined in Sect. 2.(g)Oursegmentation<br />

X<br />

Eij(xi,xj)<br />

ij<br />

Ec(xc)<br />

c<br />

+ X<br />

sky<br />

building<br />

grass<br />

tree<br />

65<br />

result obt<br />

potentials<br />

rough han<br />

can be cl<br />

significan<br />

branches


+<br />

pairwise<br />

smoothness<br />

Poten?als<br />

+ 303<br />

btained by augmenting the pairwise CRF with higher order<br />

ls defined on the segments shown in (b), (c) and(d). (h) The<br />

and labelled segmentations provided in the MSRC data set. It<br />

clearly<br />

tentials<br />

seen<br />

for object<br />

that the<br />

segmentation.<br />

use of higher order potentials results in a<br />

ataset.(b), (c) and(d) Unsuper-<br />

P n poten?als<br />

E(x) =<br />

X<br />

Ei(xi)+ X<br />

Fig. 1 Incorporating Fig. 1 Incorporating higher order higher potentials orderfor potentials object segmentation.<br />

for object segmentation. result obtained result byobtained augmenting by augmenting the pairwisethe CRF pairwise with higher CRF wo<br />

303<br />

(a) AnimagefromtheMSRC-21dataset.(b), (a) AnimagefromtheMSRC-21dataset.(b), (c) and(d) (c) Unsuper- and(d) Unsuper- potentials defined potentials on the defined segments on theshown segments in (b), shown (c) and(d). in (b), (c) (h)<br />

vised imagevised segmentation image segmentation results generated resultsby generated using different by usingpara differentrough para- handrough labelled hand segmentations labelled segmentations provided in provided the MSRCin data the Ms<br />

meters values meters in the values mean-shift in the mean-shift segmentation segmentation algorithm (Comaniciu algorithm (Comaniciu can be clearly canseen be clearly that theseen usethat of higher the useorder of higher potentials orderresults poten<br />

and Meer 2002). and Meer (e) The 2002). object (e) segmentation The object segmentation obtained using obtained the unary using thesignificant unary improvement significant improvement in the segmentation in the segmentation result. For instance result. F<br />

likelihood potentials likelihoodfrom potentials TextonBoost. from TextonBoost. (f) The result (f) of The performing result of performing branches ofbranches the tree are of much the Eij(xi,xj)<br />

tree better are much segmented better segmented<br />

inference ininference the pairwise in the CRF pairwise defined CRF in Sect. defined 2.(g)Oursegmentation<br />

in Sect. 2.(g)Oursegmentation<br />

Int J Comput Vis (2009) 82: 302–324 303<br />

Pixels in one ?le<br />

pairwise<br />

sell et al. (2006)thisisnotalwaysthecaseandsegmentsob-<br />

sell et al. (2006)thisisnotalwaysthecaseandsegmentsob- 2006). These 2006). methods These have methods produced have produced excellent results excell<br />

should tained using tained take unsupervised using the unsupervised segmentation segmentation methods are methods often are challenging often c challenging data sets. data However, sets. However, they typically they only typica<br />

wrong. Towrong. overcome To overcome these problems these problems (Hoiem et(Hoiem al. 2005b) et al. 2005b) with one object with one at aobject time at in athe time image in the independently image inde<br />

and same (Russell and et label (Russell al. 2006) et al. use2006) multiple usesegmentations multiple segmentations of the of dothe not provide do nota provide framework a framework for understanding for understand the wh<br />

image (instead image of (instead only one) of only in the one) hope in that the hope although that most althoughimage. most Further, image.<br />

[Kohli<br />

their Further,<br />

et<br />

models<br />

al.`09]<br />

their become modelsprohibitively become 66 prohibi larg<br />

segmentations segmentations are bad, some are bad, aresome correct areand correct thus and would thus would the number theofnumber classesof increases. classes increases. This prevents Thistheir preven ap<br />

result obtained by augmenting the pairwise CRF with higher order<br />

prove potentials defined useful prove on the for segments useful theirshown task. for in their (b), They (c) and(d). task. merge (h) They The these merge multiple thesesuper- multiple super- cation to scenarios cation to scenarios where segmentation where segmentation and recognitio and<br />

i<br />

+ X<br />

Ec(xc)<br />

higher-­‐order<br />

ij


Enforcing label consistency<br />

Pixels in a superpixel should have the same label<br />

E(x)<br />

max<br />

Int J Comput Vis (2009) 82: 302–324<br />

concave func?on of cardinality è submodular J<br />

> 2 arguments: Graph cut ??<br />

Fig. 1 Incorporating higher order potentials for object segmentation.<br />

(a) AnimagefromtheMSRC-21dataset.(b), (c) and(d) Unsupervised<br />

image segmentation results generated by using different parameters<br />

values in the mean-shift segmentation algorithm (Comaniciu<br />

and Meer 2002). (e) The object segmentation obtained using the unary<br />

likelihood potentials from TextonBoost. (f) The result of performing<br />

inference in the pairwise CRF defined in Sect. 2.(g)Oursegmentation<br />

result obtain<br />

potentials de<br />

rough hand<br />

can be clear<br />

67<br />

significant i<br />

branches of


Higher-­‐order func?ons as graph cuts?<br />

X<br />

i Ei(xi) + X<br />

! works well for some par?cular<br />

[Billionet & Minoux `85, Freedman & Drineas `05, Živný & Jeavons `10,...]<br />

! necessary condi?ons complex and<br />

not all submodular func?ons equal such graph cuts [Živný et al.‘09]<br />

General strategy:<br />

reduce to pairwise case by adding auxiliary variables<br />

possibly many extra nodes in the graph<br />

!<br />

ij Eij(xi,xj) + X<br />

Ec(xc)<br />

c Ec(xc)<br />

68


Fast approximate minimiza?on<br />

! Not all submodular func?ons are graph cuts<br />

! Avoid adding too many extra nodes<br />

! parametric maxflow<br />

[Fujishige & Iwata`99]<br />

! approximate by a<br />

series of graph cuts<br />

[Jegelka, Lin & Bilmes `11]<br />

time (s) (s) (s)<br />

10 4<br />

10 4<br />

10 4<br />

10 2<br />

10 2<br />

10 2<br />

10 0<br />

10 0<br />

10 0<br />

10 <br />

10 <br />

10 <br />

minimum norm point algorithm<br />

≈ O(n4)<br />

iterative<br />

approximate algorithm<br />

10 2<br />

10 2<br />

10 2<br />

n<br />

parametric maxow<br />

O(n2 parametric maxow<br />

O(n ) 2 )<br />

10 3<br />

10 3<br />

10 3<br />

69


! Symmetric:<br />

! Queyranne‘s algorithm: O(n 3 ) [Queyranne, 1998]<br />

! Concave of modular:<br />

Other special cases<br />

F (S) =F (V \ S)<br />

F (S) = X<br />

[Kolmogorov `12, Stobbe & Krause `10, Kohli et al, `09]<br />

! Sum of submodular func?ons, each bounded support<br />

i<br />

gi<br />

⇣ X ⌘<br />

w(s)<br />

s2S<br />

[Kolmogorov `12]<br />

70


Submodular minimiza?on<br />

Op?miza?on<br />

unconstrained<br />

constrained<br />

Maximiza?on<br />

Online/<br />

adap?ve<br />

op?m.<br />

Learning<br />

71


Submodular Minimiza?on<br />

! Polynomial ?me<br />

! Empirically beÑer: minimum-­‐norm point algorithm<br />

! Provably beÑer: special cases (graph cuts, concave, …)<br />

What if we have constraints?<br />

Hint: graph cuts are submodular func?ons…<br />

limited cases doable:<br />

• odd/even cardinality<br />

• inclusion/exclusion of a set<br />

• ring family<br />

• …<br />

General case:<br />

NP-­‐hard<br />

polynomial lower bounds<br />

72


Limita?ons of graph cuts<br />

Graph cut<br />

solu?on<br />

E(x; y) = X<br />

data fit<br />

i Ei(xi)<br />

+<br />

ij Eij(xi,xj)<br />

smoothness prior:<br />

cut in grid<br />

• True boundary not “short”<br />

• Local informa?on scarce<br />

instead: prefer congruous boundary è look at en?re cut at once<br />

73


s<br />

Graph cut as edge selec?on<br />

F (C) = X<br />

minimize sum<br />

w(e)<br />

e2C<br />

s.t. C is a cut<br />

t<br />

now:<br />

prefer congruous cuts<br />

minimize submodular func?on<br />

F (C)<br />

s.t. C is a cut<br />

not a sum<br />

of weights!<br />

74


Rewarding co-­‐occurrence of edges<br />

sum of weights:<br />

use few edges<br />

One group (13 edges)<br />

Many groups ( 6 edges)<br />

submodular cost func?on:<br />

use few groups S i of edges<br />

F (C) = X<br />

cost<br />

150<br />

100<br />

50<br />

i Fi(C \ Si)<br />

0<br />

0 100 200 300 300 400<br />

400<br />

|A|<br />

Op?miza?on?<br />

efficient itera?ve algorithm<br />

75


Results<br />

Graph cut<br />

solu?on<br />

Coopera?ve cut<br />

solu?on<br />

76


SFM & Combinatorial Constraints<br />

s<br />

cut matching path spanning tree<br />

t<br />

... with minimum cost<br />

min<br />

S C e S<br />

w(e)<br />

s<br />

t<br />

min<br />

S C<br />

F (S)<br />

General case: very hard. [Goel et al.`09, Iwata & Nagano `09, Jegelka & Bilmes `11,...]<br />

Approxima?ons:<br />

relaxa?on via Lovasz extension approximate F<br />

77


Submodular Minimiza?on<br />

! Convex Lovász extension<br />

! (High-­‐order) polynomial ?me<br />

! Empirically beÑer: minimum-­‐norm point algorithm<br />

! Provably beÑer: special cases<br />

! Constraints:<br />

usually hard problems è approxima?ons<br />

! Applica?ons: MAP inference, combinatorial regularizers,<br />

clustering, …<br />

78


Op?miza?on<br />

Op?miza?on<br />

Minimiza?on<br />

MaximizaNon<br />

Online/<br />

adap?ve<br />

op?m.<br />

Learning<br />

79


Op?miza?on<br />

Op?miza?on<br />

Minimiza?on<br />

unconstrained<br />

constrained<br />

Online/<br />

adap?ve<br />

op?m.<br />

Learning<br />

80


Submodular func?on maximiza?on<br />

feature<br />

selec?on<br />

covering summariza?on<br />

max<br />

S✓V<br />

diversity<br />

F (S)<br />

sensing<br />

81


Two faces of submodular func?ons<br />

Convex aspects<br />

èminimiza?on!<br />

Concave aspects<br />

èmaximiza?on!<br />

82


OF F IC E<br />

OF F IC E<br />

OF F IC E<br />

S 1<br />

OF F IC E<br />

S 1<br />

Reminder: Diminishing returns<br />

CONFERENCE<br />

STO RA GE<br />

ELEC COPY<br />

KITCHEN<br />

CONFERENCE<br />

STO RA GE<br />

ELEC COPY<br />

KITCHEN<br />

S 3<br />

S 2<br />

S 2<br />

LAB<br />

SERVER<br />

LAB<br />

SERVER<br />

QU I ET PHONE<br />

S’<br />

QU I ET PHONE<br />

S 4 S’<br />

A={S 1 ,S 2 }<br />

F(A U {S’})-F(A)<br />

≥<br />

F(B U {S’})-F(B)<br />

B = {S 1 ,S 2 ,S 3 ,S 4 }<br />

F(A) “intuitively”<br />

|A|<br />

83


Maximizing submodular func?ons<br />

Minimizing convex func?ons:<br />

Polynomial ?me solvable!<br />

Maximizing convex func?ons:<br />

NP hard!<br />

Minimizing submodular func?ons:<br />

Polynomial ?me solvable!<br />

Maximizing submodular func?ons:<br />

NP hard!<br />

But can get<br />

approxima?on<br />

guarantees J<br />

84


Maximizing submodular func?ons<br />

! Suppose we want for submodular F<br />

A ⇤ = arg max F (A) s.t.A✓V ! Example:<br />

A<br />

! F(A) = U(A) – C(A) where U(A) is submodular u?lity,<br />

and C(A) is supermodular cost func?on<br />

! In general: NP hard. Moreover:<br />

maximum<br />

! If F(A) can take nega?ve values:<br />

As hard to approximate as maximum independent set<br />

(i.e., NP hard to get O(n 1-­‐ε ) approxima?on)<br />

|A|<br />

85


Maximizing posi?ve submodular func?ons<br />

[Feige, Mirrokni, Vondrak ’09; Buchbinder, Feldman, Naor, Schwartz ’12]<br />

Theorem<br />

There is an efficient algorithm, that, given a posi?ve<br />

submodular func?on F, F({})=0, returns set A LS such that<br />

F(A LS ) ≥ 1/2 max A F(A)<br />

! picking a random set gives ¼ approxima?on<br />

(½ approxima?on if F is symmetric!)<br />

! we cannot get beÑer than ½ approxima?on unless P = NP<br />

86


Op?miza?on<br />

Op?miza?on<br />

Minimiza?on<br />

unconstrained<br />

constrained<br />

Online/<br />

adap?ve<br />

op?m.<br />

Learning<br />

87


Scalariza?on vs. constrained maximiza?on<br />

Given monotonic u?lity F(A) and cost C(A), op?mize:<br />

Op?on 1: Op?on 2:<br />

max<br />

A<br />

F (A) C(A)<br />

s.t. A ✓ V<br />

“Scalariza?on” “Constrained maximiza?on”<br />

Can get 1/2 approx…<br />

if F(A)-­‐C(A) ≥ 0<br />

for all sets A<br />

Posi?veness is a<br />

strong requirement L<br />

max F (A)<br />

A<br />

s.t. C(A) apple B<br />

What is possible?<br />

88


X 1<br />

Placement A = {1,2}<br />

OF FIC E<br />

OF FIC E<br />

CONF ER ENCE<br />

STO RAGE<br />

ELEC<br />

COPY<br />

KITC HEN<br />

SE RV ER<br />

F is monotonic:<br />

LAB<br />

Monotonicity<br />

QU IE T PHONE<br />

X 2<br />

Adding sensors can only help<br />

X 1<br />

Placement B = {1,…,5}<br />

OF FIC E<br />

OF FIC E<br />

CONF ER ENCE<br />

STO RAGE<br />

ELEC<br />

X 4<br />

COPY<br />

KITC HEN<br />

X 3<br />

LAB<br />

SE RV ER<br />

QU IE T PHONE<br />

8A, s : F (A [{s}) F (A) 0<br />

(s | A) 0<br />

X 5<br />

X 2


Constrained maximiza?on<br />

! Given: finite set V of loca?ons<br />

A ⇤<br />

! Want: such that<br />

V<br />

Typically NP-­‐hard!<br />

OF FIC E<br />

X 4<br />

X 2<br />

OF FIC E<br />

CONF ER ENCE<br />

STO RAGE<br />

ELEC<br />

COPY<br />

KITC HEN<br />

X 1<br />

LAB<br />

SE RV ER<br />

QU IE T PHONE<br />

X 3<br />

X 5<br />

90


Exact maximiza?on of SFs<br />

! Mixed integer programming<br />

! Series of mixed integer programs [Nemhauser et al ‘81]<br />

! Constraint genera?on [Kawahara et al ‘09]<br />

! Branch-­‐and-­‐bound<br />

! „Data-­‐Correc?ng Algorithm“ [Goldengorin et al ’99]<br />

All algorithms worst-­‐case exponen?al!<br />

91


Greedy algorithm<br />

! Given: finite set V of loca?ons<br />

A ⇤<br />

! Want: such that<br />

V<br />

Typically NP-­‐hard!<br />

Greedy algorithm:<br />

Start with A =<br />

For i = 1 to k<br />

s ⇤<br />

arg max F (A ⇥ {s})<br />

s<br />

A A ⇥ {s ⇤ }<br />

How well can this simple heurisFc do?<br />

OF FIC E<br />

X 4<br />

X 2<br />

OF FIC E<br />

CONF ER ENCE<br />

STO RAGE<br />

ELEC<br />

COPY<br />

KITC HEN<br />

X 1<br />

LAB<br />

SE RV ER<br />

QU IE T PHONE<br />

X 3<br />

X 5<br />

92


Informa?on gain<br />

Performance of greedy<br />

Op?mal<br />

Greedy<br />

Greedy empirically close to op?mal. Why?<br />

44<br />

49<br />

47<br />

50<br />

45<br />

42 41<br />

51<br />

48<br />

43<br />

46<br />

40<br />

OFFICE<br />

52<br />

39<br />

OFFICE<br />

54<br />

53<br />

CO N FE R ENC E<br />

ST OR AG E<br />

EL EC CO PY<br />

KI TC HEN<br />

37<br />

38 36<br />

8<br />

5<br />

2<br />

35<br />

7<br />

4<br />

10<br />

6<br />

3<br />

33<br />

11<br />

LAB<br />

SE RVE R<br />

12<br />

QUIET PH ONE<br />

13 14<br />

Temperature data<br />

from sensor network<br />

9<br />

1<br />

34<br />

32<br />

31<br />

30<br />

29<br />

28<br />

27<br />

26<br />

23<br />

15<br />

18<br />

21<br />

25<br />

19<br />

16<br />

17<br />

22<br />

24<br />

20<br />

93


One reason <strong>submodularity</strong> is useful<br />

Theorem [Nemhauser, Fisher & Wolsey ’78]<br />

For monotonic submodular func?ons,<br />

Greedy algorithm gives constant factor approxima?on<br />

F(A greedy ) ≥ (1-­‐1/e) F(A opt )<br />

~63%<br />

! Greedy algorithm gives near-­‐op?mal solu?on!<br />

! For informa?on gain: Guarantees best possible unless P = NP!<br />

[Krause & Guestrin ’05]<br />

94


Scaling up the greedy algorithm [Minoux ’78]<br />

In round i+1,<br />

! have picked A i = {s 1 ,…,s i }<br />

! pick s i+1 = argmax s F(A i U {s})-­‐F(A i )<br />

I.e., maximize “marginal benefit” Δ(s | A i )<br />

Δ(s | A i ) = F(A i U {s})-­‐F(A i )<br />

Key observaNon: Submodularity implies<br />

i ≤ j => Δ(s | A i ) ≥ Δ(s | A j )<br />

Marginal benefits can never increase!<br />

Δ (s | A i ) ≥ Δ(s | A i+1 )<br />

s<br />

95


“Lazy” greedy algorithm [Minoux ’78]<br />

Lazy greedy algorithm:<br />

§ First itera?on as usual<br />

§ Keep an ordered list of marginal<br />

benefits Δ i from previous itera?on<br />

§ Re-­‐evaluate Δ i only for top<br />

element<br />

§ If Δ i stays on top, use it,<br />

otherwise re-­‐sort<br />

M<br />

a<br />

d b<br />

b c<br />

e d<br />

c e<br />

Benefit Δ(s | A)<br />

Note: Very easy to compute online bounds, lazy evalua?ons, etc.<br />

[Leskovec, Krause et al. ’07]<br />

96


Lower is better<br />

Empirical improvements [Leskovec, Krause et al’06]<br />

Running ?me (minutes)<br />

300<br />

200<br />

100<br />

0<br />

Exhaustive search<br />

(All subsets)<br />

Naive<br />

greedy<br />

Lower is beÑer<br />

1 2 3 4 5 6 7 8 9 10<br />

Number of sensors selected<br />

Sensor placement<br />

Fast greedy<br />

Running ?me (seconds)<br />

400<br />

300<br />

200<br />

100<br />

Exhaus?ve search<br />

(All subsets)<br />

Naive<br />

greedy<br />

0<br />

1 2 3 4 5 6 7 8 9 10<br />

Number of blogs selected<br />

Blog selec?on<br />

30x speedup 700x speedup<br />

Fast greedy<br />

97


TRANSFORM IN OBJECT DETEC-<br />

uter<br />

was<br />

edge<br />

eral<br />

ears,<br />

the<br />

tion<br />

for<br />

nsform [1] is one of the classical computer<br />

ues which dates 50 years back. It was<br />

ted as a method for line detection in edge<br />

s but was then extended to detect general<br />

objects such as circles [2]. In recent years,<br />

ethods were successfully adapted to the<br />

art-based category-level object detection<br />

ave obtained state-of-the-art results for<br />

datasets [3]–[8].<br />

sical Hough transform and its more modoceed<br />

by converting the input image into<br />

tation called the Hough image which lives<br />

lled the Hough space (Figure 1). Each point<br />

space corresponds to a hypothesis about<br />

terest being present in the original image<br />

location and configuration. The dimen-<br />

Hough image thus equals the number of<br />

edom for the configuration(+location) of<br />

odinto<br />

lives<br />

oint<br />

bout<br />

age<br />

ener<br />

of<br />

) of<br />

orks<br />

ents.<br />

ight<br />

that<br />

on’s<br />

Object detec?on [Barinova et al.’10]<br />

Line detection task classic Hough transform<br />

Pedestrian detection task Hough x4 =1 forest [5] transform<br />

xj = index of hypothesis<br />

Fig. 1. explaining Variants of xj Hough transform x5 =1 render the detection<br />

tasks (left) into the tasks of peaks identification in Hough<br />

images (right). As can be seen,<br />

x<br />

in 6 =1<br />

the presence of multiple y3 close objects identifying the peaks<br />

x<br />

in Hough images is<br />

7 =1<br />

a highly non-trivial problem in itself. The probabilistic<br />

framework developed in this paper x seeking addresses heuristics. this prob-<br />

8 =1<br />

lem without invoking non-maxima suppression or mode<br />

seeking heuristics. Vo?ng elements Hypotheses<br />

transform based method essentially works<br />

input image into a set of voting elements.<br />

ent votes for the hypotheses that might<br />

this element. For instance, a feature that<br />

ight vote for the presence of a person’s<br />

) in location Illustra?ons justcourtesy belowof Pushmeet it. Of course, Kohli<br />

s do not provide evidence for the exact lo-<br />

x 1 =1<br />

x 2 =1<br />

x 3 =1<br />

Line detection task classic Hough transform<br />

y 1<br />

Pedestrian detection tasky2 Hough forest [5] transform<br />

Fig. 1. Variants of Hough transform render the detection<br />

tasks (left) into the tasks of peaks identification present in Hough<br />

images (right). As can be seen, in the presence of multiple<br />

y<br />

close objects identifying the peaks ini = 0: object i<br />

Hough images is<br />

a highly non-trivial problem in itself. The notprobabilistic present<br />

framework developed in this paper addresses this problem<br />

without invoking non-maxima suppression or mode<br />

y i = 1: object i<br />

different hypothesis in the Hough space. Large values<br />

of the vote are given to hypotheses that might have<br />

98


TRANSFORM IN OBJECT DETEC-<br />

uter<br />

was<br />

edge<br />

eral<br />

ears,<br />

the<br />

tion<br />

for<br />

nsform [1] is one of the classical computer<br />

ues which dates 50 years back. It was<br />

ted as a method for line detection in edge<br />

s but was then extended to detect general<br />

objects such as circles [2]. In recent years,<br />

ethods were successfully adapted to the<br />

art-based category-level object detection<br />

ave obtained state-of-the-art results for<br />

datasets [3]–[8].<br />

sical Hough transform and its more modoceed<br />

by converting the input image into<br />

tation called the Hough image which lives<br />

lled the Hough space (Figure 1). Each point<br />

space corresponds to a hypothesis about<br />

terest being present in the original image<br />

location and configuration. The dimen-<br />

Hough image thus equals the number of<br />

edom for the configuration(+location) of<br />

odinto<br />

lives<br />

oint<br />

bout<br />

age<br />

ener<br />

of<br />

) of<br />

orks<br />

ents.<br />

ight<br />

that<br />

on’s<br />

Object detec?on<br />

Line detection task classic Hough transform<br />

Pedestrian detection task Hough x4 =2 forest [5] transform<br />

xj = index of hypothesis<br />

Fig. 1. explaining Variants of xj Hough transform x5 =2 render the detection<br />

tasks (left) into the tasks of peaks identification in Hough<br />

images (right). submodular<br />

As can be seen,<br />

x<br />

in 6 =0<br />

the presence of multiple<br />

close objects maximiza?on<br />

identifying the peaks<br />

x<br />

in Hough images is<br />

7 =2<br />

a highly non-trivial problem in itself. The probabilistic<br />

J<br />

framework developed in this paper x seeking addresses heuristics. this prob-<br />

8 =2<br />

lem without invoking non-maxima suppression or mode<br />

seeking heuristics. Vo?ng elements<br />

transform based method essentially works<br />

input image into a set of voting elements.<br />

ent votes for the hypotheses that might<br />

this element. For instance, a feature that<br />

ight vote for the presence of a person’s<br />

) in location Illustra?ons justcourtesy belowof Pushmeet it. Of course, Kohli<br />

s do not provide evidence for the exact lo-<br />

x 1 =1<br />

x 2 =1<br />

x 3 =1<br />

Line detection task classic Hough transform<br />

y 1 =1<br />

y 2 =1<br />

Pedestrian detection task Hough forest [5] transform<br />

y i = 1: object i<br />

Fig. 1. Variants of Hough transform render the detection<br />

tasks (left) into the tasks of peaks identification present in Hough<br />

images (right). As can be seen, in the presence of multiple<br />

y<br />

close objects identifying the peaks ini = 0: object i<br />

Joint y3=0 MAP inference:<br />

Hough images is<br />

a highly non-trivial problem in itself. The notprobabilistic present<br />

framework developed F (S) in this = paper addresses this problem<br />

without invoking non-maxima suppression or mode<br />

X<br />

max<br />

Hypotheses<br />

Weight element i wrt hyp. j<br />

different hypothesis in the Hough space. Large values<br />

of the vote are given to hypotheses that might have<br />

i<br />

j2S wi,j<br />

99


Illustra?ons courtesy of Pushmeet Kohli<br />

Inference<br />

Datasets from [Andriluka et al. CVPR 2008]<br />

(with strongly occluded pedestrians added)<br />

Using the Hough forest trained in [Gall&Lempitsky CVPR09]


Results for pedestrians detec?on<br />

Precision<br />

TUD-­‐campus<br />

Recall<br />

Precision<br />

TUD-­‐crossing<br />

Blue = Hough transform + non-­‐maximum suppression<br />

Light-­‐blue = greedy detec?on<br />

<strong>submodularity</strong> for detec?on also in [Blaschko’11]<br />

Recall


Network inference<br />

How can we learn who influences whom?<br />

102


1<br />

2<br />

Inferring diffusion networks<br />

[Gomez Rodriguez, Leskovec, Krause ACM TKDE <strong>2012</strong>]<br />

Given Want<br />

3 4 25<br />

4<br />

3<br />

Given traces of influence, wish to infer sparse<br />

directed network G=(V,E)<br />

è Formulate as op?miza?on problem<br />

1<br />

E ⇤ = arg max<br />

|E|applek<br />

F (E)<br />

103


1<br />

2<br />

Es?ma?on problem<br />

3 5<br />

4<br />

! Many influence trees T consistent with data<br />

! For cascade C i , model P(C i | T)<br />

! Find sparse graph that maximizes likelihood for all<br />

observed cascades<br />

è Log likelihood monotonic submodular in selected edges<br />

F (E) = X<br />

i<br />

1<br />

2<br />

3 5<br />

4<br />

log max<br />

tree T ✓E P (Ci | T )<br />

104


Evalua?on: Synthe?c networks<br />

1024 node hierarchical Kronecker<br />

exponen?al transmission model<br />

! Performance does not depend on the network<br />

structure:<br />

! Synthe?c Networks: Forest Fire, Kronecker, etc.<br />

! Transmission ?me distribu?on: Exponen?al, Power Law<br />

! Break-­‐even point of > 90%<br />

1000 node Forest Fire (α = 1.1)<br />

power law transmission model<br />

105


Diffusion Network<br />

[Gomez Rodriguez, Leskovec, Krause ACM TKDE <strong>2012</strong>]<br />

Actual network inferred from 172 million<br />

ar?cles from 1 million news sources<br />

Blogs<br />

Mainstream media<br />

106


Diffusion Network (small part)<br />

Blogs<br />

Mainstream media<br />

107


Document summariza?on [Lin & Bilmes ‘11]<br />

! Which sentences should we select that best<br />

summarize a document? 108


Marginal gain of a sentence<br />

! Many natural no?ons of „document coverage“ are<br />

submodular [Lin & Bilmes ‘11]<br />

109


Document summariza?on<br />

F (S) =R(S)+ D(S)<br />

Relevance Diversity<br />

110


Relevance of a summary<br />

F (S) =R(S)+ D(S)<br />

R(S) = X<br />

i<br />

How well is sentence i „covered“ by S<br />

Ci(S) = X<br />

j2S<br />

↵Ci(V )<br />

min{Ci(S),↵Ci(V )}<br />

wi,j<br />

Similarity between i and j<br />

111


D(S) =<br />

Diversity of a summary<br />

KX<br />

i=1<br />

s X<br />

j2Pi\S<br />

Can be made query-­‐specific; mul?-­‐resolu?on; etc.<br />

rj<br />

Relevance of sentence j to doc.<br />

rj = 1<br />

N<br />

X<br />

i<br />

wi,j<br />

Similarity between i and j<br />

Clustering of sentences<br />

in document<br />

112


Empirical results [Lin & Bilmes ‘11]<br />

Best F1 score on benchmark corpus DUC-­‐07! 113


Submodular Sensing Problems<br />

[with Guestrin, Leskovec, Singh, Sukhatme, …]<br />

Environmental monitoring<br />

[JAIR ’08, ICRA ‘10]<br />

Experiment design [NIPS ‘10, ‘11]<br />

Water distribu?on networks<br />

[J WRPM ’08 Best paper]<br />

Top score in<br />

BWSN compe??on<br />

Recommending blogs & news<br />

[KDD ‘07, ’10 Best paper]<br />

Can all be reduced to monotonic submodular maximiza?on<br />

114


! So far:<br />

More complex constraints<br />

! Can one handle more complex constraints?<br />

115


Ground set<br />

Example: Camera network<br />

Configura?on:<br />

Sensing quality model<br />

V = {1a, 1b,...,5a, 5b}<br />

Configura?on is feasible if no camera is pointed in<br />

two direc?ons at once<br />

1a<br />

1b<br />

3a<br />

3b<br />

11<br />

6


Matroids<br />

! Abstract no?on of feasibility: independence<br />

S is independent if …<br />

… |S| ≤ k<br />

Uniform matroid<br />

… S contains at most one<br />

element from each square<br />

Par??on matroid<br />

• S independent è T ✓<br />

S also independent<br />

… S contains no cycles<br />

Graphic matroid<br />

• Exchange property: S, U independent, |S| > |U|<br />

è some can be added to U: independent<br />

• All maximal independent sets have the same size<br />

117


Matroids<br />

! Abstract no?on of feasibility: independence<br />

S is independent if …<br />

… |S| ≤ k<br />

Uniform matroid<br />

… S contains at most one<br />

element from each group<br />

Par??on matroid<br />

• S independent è T ✓ S also independent<br />

… S contains no cycles<br />

Graphic matroid<br />

• Exchange property: S, U independent, |S| > |U|<br />

è some e 2 S can be added to U: U [ e<br />

independent<br />

• All maximal independent sets have the same size<br />

118


Ground set<br />

Example: Camera network<br />

Configura?on:<br />

Sensing quality model<br />

Configura?on is feasible if no camera is pointed in<br />

two direc?ons at once<br />

This is a par??on matroid:<br />

P1 = {1a, 1b},...,P5 = {5a, 5b}<br />

Independence:<br />

|S \ Pi| apple1<br />

V = {1a, 1b,...,5a, 5b}<br />

1a<br />

1b<br />

3a<br />

3b<br />

11<br />

9


Greedy algorithm for matroids:<br />

! Given: finite set V<br />

A ⇤<br />

! Want: such that<br />

V<br />

A ⇤ = argmax<br />

A independent<br />

Greedy algorithm:<br />

Start with A =<br />

While 9s : A [{s} indep.<br />

A A ⇥ {s ⇤ }<br />

F (A)<br />

s ⇤ argmax<br />

s: A[{s} indep.<br />

F (A [{s})<br />

1a<br />

3b<br />

120


Maximiza?on over matroids<br />

Theorem [Nemhauser, Fisher & Wolsey ’78]<br />

For monotonic submodular func?ons,<br />

Greedy algorithm gives constant factor approxima?on<br />

F(A greedy ) ≥ ½ F(A opt )<br />

! Greedy gives 1/(p+1) over intersec?on of p matroids<br />

! Can model rankings with p=2!<br />

! Can get also obtain (1-­‐1/e) for arbitrary matroids [Vondrak et al ’08]<br />

using con?nuous greedy algorithm<br />

121


Non-­‐constant cost func?ons<br />

! For each s ∈ V, let c(s)>0 be its cost<br />

(e.g., length of sentence; feature acquisi?on costs, …)<br />

! Cost of a set C(A) = ∑ s∈A c(s)<br />

! Want to solve<br />

A* = argmax F(A) s.t. C(A) ≤ B<br />

122


Cost-­‐benefit op?miza?on<br />

[Wolsey ‘82, Sviridenko ‘04, Krause et al ‘05]<br />

Theorem [Krause and Guestrin‘05]<br />

Can efficiently find a set A such that<br />

Then<br />

F (A)<br />

Can s?ll speed up using lazy evalua?ons<br />

Note: Can also get<br />

⇣<br />

1<br />

⌘<br />

1<br />

p<br />

e<br />

max<br />

C(A0 )appleB F (A0 )<br />

~39%<br />

! (1-­‐1/e) approxima?on in ?me O(n 4 ) [Sviridenko ’04]<br />

! (1-­‐1/e) approxima?on for mul?ple linear constraints [Kulik ‘09]<br />

! 0.38/k approxima?on for k matroid and m linear constraints<br />

[Chekuri et al ‘11]<br />

123


Summary: More complex constraints<br />

! Approximate submodular maximiza?on possible<br />

under a variety of constraints:<br />

! Matroid<br />

! Knapsack<br />

! Mul?ple matroid and knapsack constraints<br />

! Path constraints (Submodular orienteering)<br />

! Connectedness (Submodular Steiner)<br />

! Robustness (minimax)<br />

! OÖen the (best) algorithms are non-­‐greedy<br />

124


Key intui?on for approx. maximiza?on<br />

For submod. funcFons,<br />

local maxima<br />

can‘t be too bad<br />

! E.g., all local maxima under cardinality constraints<br />

are within factor 2 of global maximum<br />

! Key insight for more complex maximiza?on<br />

è Greedy, local search, simulated annealing<br />

for (non-­‐monotone, constrained, ...)<br />

125


Two-­‐faces of submodular func?ons<br />

Cuts,<br />

clustering,<br />

similarity<br />

MAP<br />

inference<br />

Structured<br />

sparsity<br />

Convex aspects<br />

èminimiza?on!<br />

Concave aspects<br />

èmaximiza?on!<br />

summariza?on<br />

sensing<br />

Coverage,<br />

diversity<br />

126


MaximizaNon MinimizaNon<br />

Summary Op?miza?on<br />

Unconstrained NP-­‐hard, but<br />

well-­‐approximable<br />

(if nonnega?ve)<br />

Constrained NP-­‐hard but well-­‐<br />

approximable<br />

„Greedy-­‐(like)“ for<br />

cardinality, matroid<br />

constraints;<br />

Non-­‐greedy for more<br />

complex (e.g.,<br />

connec?vity) constraints<br />

Polynomial ?me!<br />

Generally inefficent<br />

(n^6), but can exploit<br />

special cases<br />

(cuts; symmetry;<br />

decomposable; ...)<br />

NP-­‐hard; hard to<br />

approximate, s?ll useful<br />

algorithms<br />

127


What to do with submodular func?ons<br />

Op?miza?on<br />

Minimiza?on<br />

Maximiza?on<br />

Online/<br />

adap?ve<br />

op?m.<br />

Learning<br />

128


What to do with submodular func?ons<br />

Op?miza?on<br />

Minimiza?on<br />

Maximiza?on<br />

Online/<br />

adap?ve<br />

op?m.<br />

Learning<br />

129


Example 1: Valuation Functions<br />

And so on…<br />

= $ 3<br />

= $ 88<br />

! For combinatorial auc?ons, show bidders various subsets<br />

of items, see their bids<br />

Can we learn a bidder’s u?lity<br />

func?on from few bids?<br />

130


Example 2: Graph Evolu?on<br />

! Want to track changes in a graph<br />

! Instead of storing en?re graph at each ?me step,<br />

store some measurements<br />

! # of measurements


Random Graph Cut #1<br />

Cut value = 13 Cut value = 14<br />

! Choose a random par??on of ver?ces<br />

! Count total # of edges across par??on<br />

132


Random Graph Cut #2<br />

Cut value = 13 Cut value = 12<br />

! Choose another random par??on of ver?ces<br />

! Count total # of edges across par??on<br />

133


Symmetric Graph Cut Func?on<br />

A<br />

f(A) = sum of weights of edges between A and V\A<br />

V\A<br />

• V = set of ver?ces<br />

• One-­‐to-­‐one correspondence of graphs and cut func?ons<br />

Can we learn a graph from the value of few cuts?<br />

[E.g., graph sketching, computa?onal biology, …]<br />

134


General Problem: Learning Set Func?ons<br />

Base Set<br />

Set func?on<br />

! Wish to learn from:<br />

! Small measurement collec?on<br />

! Func?on values<br />

135


Approxima?ng submodular func?ons<br />

[Goemans, Harvey, Kleinberg, Mirrokni, ’08]<br />

! Pick m sets, A 1 … A m , get to see F(A 1 ), …, F(A m )<br />

! From this, want to approximate F<br />

by ˆF s.t.<br />

Theorem: Even if<br />

! F is monotonic<br />

ˆF (A) apple F (A) apple ↵ ˆ F (A)<br />

for all A<br />

! we can pick polynomially many A i , chosen adap?vely,<br />

cannot approximate beÑer than α = n ½ / log(n)<br />

unless one looks at exponen?ally many sets A i<br />

But can efficiently obtain α = n ½ log(n)<br />

136


Learning submodular func?ons<br />

[Balcan, Harvey STOC ‘11]<br />

! Sample m sets A 1 … A m , from dist. D; see F(A 1 ), …, F(A m )<br />

! From this, want to generalize well<br />

! ˆF is (α,ε,δ)-­‐PMAC iff with prob. 1-­‐δ it holds that<br />

h i<br />

ˆF (A) apple F (A) apple ↵F ˆ(A) 1 "<br />

PA⇠D<br />

Theorem: cannot approximate beÑer than<br />

α = n 1/3 / log(n)<br />

unless one looks at exponen?ally many samples A i<br />

But can efficiently obtain α = n ½<br />

137


What if we have structure?<br />

! To learn effec?vely, need addi?onal assump?ons<br />

beyond <strong>submodularity</strong>.<br />

! Sparsity in Fourier domain [Stobbe & Krause ’12]<br />

Sparsity: Most coefficients ≈0<br />

! „Submodular“ compressive sensing<br />

! Cuts and many other func?ons sparse in Fourier domain!<br />

! Also can learn XOS valua?ons [Balcan et al ‘12]<br />

138


Compressive sensing with set func?ons<br />

[Stobbe & Krause ’12]<br />

Theorem: Suppose number of random measurements<br />

is propor?onal to the sparsity ?mes log factors<br />

Then we can recover by solving:<br />

Random Hadamard-­‐Walsh basis func?ons sa?sfy RIP w.h.p<br />

Can also handle noisy observa?ons<br />

Many more details [see paper]<br />

ˆf<br />

min ||x||1 s.t. fM = Mx<br />

139


Graph Evolu?on Results<br />

• Tracking evolu?on<br />

of 128-­‐vertex<br />

subgraph using<br />

random cuts<br />

• Δ = number of<br />

differences<br />

between graphs<br />

! Autonomous Systems Graph (from SNAP)<br />

Reconstruc?on error<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

Week<br />

3 to 5<br />

Δ=122<br />

Week<br />

5 to 7<br />

Δ=62<br />

Week<br />

7 to 9<br />

Δ=161<br />

Week<br />

1 to 3<br />

Δ=245<br />

0<br />

0 200 400 600 800 1000 1200 1400<br />

Number of measurements<br />

! For low error, observing random cuts suffices<br />

140


What to do with submodular func?ons<br />

Op?miza?on<br />

Minimiza?on<br />

Maximiza?on<br />

Online/<br />

adapNve<br />

opNm.<br />

Learning<br />

141


Learning to op?mize<br />

! Have seen how to<br />

! op?mize submodular func?ons<br />

! learn submodular func?ons<br />

What if we only want to<br />

learn enough to op?mize?<br />

142


Learning to op?mize submodular func?ons<br />

! Structured predic?on with submodular func?ons<br />

! Learn func?on parameters to achieve target minimizer<br />

! ApplicaFon: MAP inference; summariza?on<br />

! Online submodular op?miza?on<br />

! Learn to pick a sequence of sets to maximize a sequence of<br />

(unknown) submodular func?ons<br />

! ApplicaFon: Building algorithm porïolios<br />

! Adap?ve submodular op?miza?on<br />

! Gradually build up a set, taking into account feedback<br />

! ApplicaFon: Sequen?al experimental design<br />

143


Structured Predic?on with SFs<br />

! Given training data {(S1, x1),...,(Sn, xn)}<br />

and paramerized family of SFs F,<br />

can we find parameters ✓ such that<br />

Si ⇡ arg min<br />

S F (S; ✓, xi)<br />

! Approaches<br />

With margin!<br />

! Parameter learning in submodular MRFs [Taskar et al. ’04]<br />

è Minimiza?on<br />

! Learning mixtures of submodular shells [Lin & Bilmes ‘12]<br />

è Maximiza?on<br />

144


Learning to summarize [Lin & Bilmes‘12]<br />

! Input:<br />

! Collec?on of documents and human summaries<br />

! Goal:<br />

! Learn parameterized submodular relevance & diversity<br />

F (S; dk) = X<br />

i<br />

dk<br />

↵iRi(S; dk)+ X<br />

Relevance &Parameters, diversity, instan?ated want to learn for each document k<br />

! Want to achieve that for all k<br />

! Can efficiently find solu?on with bounded<br />

generaliza?on error!<br />

j<br />

Sk<br />

jDj(S; dk)<br />

Sk ⇡ argmax F (S; dk)<br />

S:|S|appleB<br />

145


Learning mixtures for summariza?on<br />

! Training data: Documents with human summaries<br />

146


Learning to op?mize submodular func?ons<br />

! Structured predic?on with submodular func?ons<br />

! Learn func?on parameters to achieve target minimizer<br />

! ApplicaFon: MAP inference; summariza?on<br />

! Online submodular op?miza?on<br />

! Learn to pick a sequence of sets to maximize a sequence of<br />

(unknown) submodular func?ons<br />

! ApplicaFon: Building algorithm porïolios<br />

! Adap?ve submodular op?miza?on<br />

! Gradually build up a set, taking into account feedback<br />

! ApplicaFon: Sequen?al experimental design<br />

147


Online maximiza?on of submodular func?ons<br />

[Streeter, Golovin NIPS ‘08]<br />

Pick sets<br />

SFs<br />

Reward<br />

A 1<br />

F 1<br />

r 1 =F 1 (A 1 )<br />

A 2<br />

F 2<br />

r 2<br />

Theorem<br />

Can efficiently choose A 1 ,…A t s.t. in expecta?on<br />

A 3<br />

F 3<br />

r 3<br />

for any sequence F i , as T à ∞<br />

…<br />

…<br />

Total: ∑ t r t à max<br />

Can get ‘no-­‐regret’ over ‘omniscient’ greedy algorithm<br />

…<br />

A T<br />

F T<br />

r T<br />

Observe either<br />

F t , or only F t (A t )<br />

Time<br />

148


Applica?on: Learning to solve SAT quickly<br />

[Streeter, Golovin, Smith]<br />

• Hybridiza?on via Task-­‐Switching Schedules<br />

Example Schedule S<br />

2 sec<br />

4 sec 6 sec 2 sec 10 sec<br />

Time<br />

• Need to learn which schedules work well!<br />

Heuris?c #1<br />

Heuris?c #2<br />

Heuris?c #3<br />

Heuris?c #4


Hybridizing Solvers<br />

[Streeter & Golovin ‘08]<br />

• V = {run heuris?c h for t seconds : h in H, t >0}<br />

• f i (S) = Pr[schedule S completes job f in ?me limit].<br />

• This is a submodular func?on! J<br />

• Task: Select {S i : i =1, 2, …, T} online to maximize # instances solved<br />

• This is an online submodular maximizaNon problem! J<br />

• Online greedy algorithm achieves no-­‐(1-­‐1/e)-­‐regret


Example Schedule<br />

[Streeter, Golovin, Smith, AAAI ’07 & CSP ‘08]<br />

(log scale)


(2008)


Other results on online submodular op?miza?on<br />

! Online submodular maximiza?on<br />

! No (1-­‐1/e) regret for ranking (par??on matroids)<br />

[Streeter, Golovin, Krause 2009]<br />

! Distributed implementa?on [Golovin, Faulkner, Krause ‘2010]<br />

! Improved bounds for bandits with linear combina?ons of SFs<br />

[Yue, Guestrin, NIPS 2011]<br />

! Online submodular coverage<br />

! Min-­‐cost / Min-­‐sum submodular cover [Streeter & Golovin<br />

NIPS 2008]<br />

! Guillory & Bilmes [NIPS 2011]<br />

! Online Submodular Minimiza?on<br />

! Unconstrained [Hazan & Kale NIPS 2009]<br />

! Constrained [Jegelka & Bilmes ICML 2011]<br />

153


What to do with submodular func?ons<br />

Op?miza?on<br />

Minimiza?on<br />

Maximiza?on<br />

Online/<br />

adapNve<br />

opNm.<br />

Learning<br />

154


Learning to op?mize submodular func?ons<br />

! Structured predic?on with submodular func?ons<br />

! Learn func?on parameters to achieve target minimizer<br />

! ApplicaFon: MAP inference; summariza?on<br />

! Online submodular op?miza?on<br />

! Learn to pick a sequence of sets to maximize a sequence of<br />

(unknown) submodular func?ons<br />

! ApplicaFon: Building algorithm porïolios<br />

! Adap?ve submodular op?miza?on<br />

! Gradually build up a set, taking into account feedback<br />

! ApplicaFon: Sequen?al experimental design<br />

155


0<br />

Adap?ve Sensing / Diagnosis<br />

1<br />

0<br />

x 4 =<br />

x 1<br />

0<br />

=<br />

1<br />

Want to effec?vely diagnose while minimizing cost of tes?ng!<br />

Classical <strong>submodularity</strong> does not apply L<br />

Can we generalize <strong>submodularity</strong> for<br />

sequen?al decision making?<br />

x 2<br />

=<br />

1<br />

x 3<br />

=<br />

156


Adap?ve selec?on in diagnosis<br />

! Prior over diseases P(Y)<br />

! Determinis?c test outcomes P(X V | Y)<br />

! Each test eliminates hypotheses y<br />

X 1<br />

“Fever”<br />

Y<br />

“Sick”<br />

X 2<br />

“Rash”<br />

X 3<br />

“Cough”<br />

X 1 =1 X 3 =0<br />

X 2 =0<br />

States y<br />

X 2 =1<br />

157


Given:<br />

Problem Statement<br />

! Items (tests, experiments, ac?ons, …) V={1,…,n}<br />

! Associated with random variables X 1 ,…,X n taking values in O<br />

! Objec?ve:<br />

! Policy π maps observa?on x A to next item<br />

Value of policy π:<br />

Want<br />

f :2 V ⇥ O V ! R<br />

NP-­‐hard (also hard to approximate!)<br />

Tests run by π<br />

if world in state x V<br />

158


Adap?ve greedy algorithm<br />

! Suppose we’ve seen X A = x A.<br />

! Condi?onal expected benefit of adding item s:<br />

h<br />

i<br />

(s | xA) =E f(A [{s}, xV ) f(A, xV ) | xA<br />

AdapNve Greedy algorithm: Benefit if world in state xV Start with<br />

For i = 1:k<br />

! Pick<br />

! Observe<br />

! Set<br />

A = ;<br />

When does this adapFve greedy algorithm work??<br />

Condi?onal on<br />

observa?ons x A<br />

159


Adap?ve <strong>submodularity</strong><br />

[Golovin & Krause, JAIR 2011]<br />

AdapFve monotonicity:<br />

(s | xA) 0<br />

AdapFve <strong>submodularity</strong>:<br />

Theorem: If f is adap?ve submodular and adap?ve<br />

monotone w.r.t. to distribu?on P, then<br />

F(π greedy ) ≥ (1-­‐1/e) F(π opt )<br />

x B observes<br />

more than x A<br />

whenever<br />

Many other results about submodular set func?ons<br />

can also be “liÖed” to the adap?ve señng!<br />

160


Applies to: set func?ons<br />

Greedy algorithm provides<br />

From sets to policies<br />

Submodularity AdapNve <strong>submodularity</strong><br />

• (1-­‐1/e) for max. w card. const.<br />

• 1/(p+1) for p-­‐indep. systems<br />

• log Q for min-­‐cost-­‐cover<br />

• 4 for min-­‐sum-­‐cover<br />

policies, value func?ons<br />

h<br />

i<br />

F (s | A) =F (A [{s}) F (A) F (s | xA) =E f(A [{s}, xV ) f(A, xV ) | xA<br />

F (s | A) 0<br />

F (s | xA) 0<br />

A ✓ B ) F (s | A) F (s | B) xA xB ) F (s | xA) F (s | xB)<br />

max<br />

A<br />

F (A) max<br />

⇡<br />

F (⇡)<br />

Greedy policy provides<br />

• (1-­‐1/e) for max. w card. const.<br />

• 1/(p+1) for p-­‐indep. systems<br />

• log Q for min-­‐cost-­‐cover<br />

• 4 for min-­‐sum-­‐cover 161


Op?mal Diagnosis<br />

! Prior over diseases P(Y)<br />

! Determinis?c test outcomes P(X V | Y)<br />

! How should we test to<br />

eliminate all incorrect hypotheses?<br />

“Generalized binary search”<br />

Equivalent to max. infogain<br />

X 1<br />

“Fever”<br />

Y<br />

“Sick”<br />

X 2<br />

“Rash”<br />

X 3<br />

“Cough”<br />

X 1 =1 X 3 =0<br />

X 2 =0<br />

X 2 =1<br />

162


OD is Adap?ve Submodular<br />

(s | {}) = 2g0b0<br />

g0 + b0<br />

(s | xv,w) = 2g1b1<br />

g1 + b1<br />

Not hard to show that<br />

Test w<br />

Objec?ve = probability mass of hypotheses<br />

you have ruled out.<br />

Outcome = 0<br />

Test s<br />

Outcome = 1<br />

(s | {}) (s | xv,w)<br />

Test v<br />

163


0<br />

Theore?cal guarantees<br />

x 1<br />

0 1 0<br />

=<br />

1<br />

x 2<br />

0<br />

=<br />

1<br />

x 3<br />

=<br />

1<br />

Garey & Graham, 1974;<br />

Loveland, 1985;<br />

Arkin et al., 1993;<br />

Kosaraju et al., 1999;<br />

Dasgupta, 2004;<br />

Guillory & Bilmes, 2009;<br />

Nowak, 2009;<br />

Gupta et al., 2010<br />

With adap?ve<br />

submodular<br />

analysis!<br />

Result requires that tests are exact (no noise)!


What if there is noise?<br />

[w Daniel Golovin, Deb Ray, NIPS ‘10]<br />

! Prior over diseases P(Y)<br />

! Noisy test outcomes P(X V | Y)<br />

! How should we test<br />

to learn about y (infer MAP)?<br />

! Exis?ng approaches:<br />

! Generalized binary search?<br />

! Maximize informa?on gain?<br />

! Maximize value of informa?on?<br />

X 1<br />

“Fever”<br />

Y<br />

“Sick”<br />

X 2<br />

“Rash”<br />

Theorem: All these approaches can have cost<br />

more than n/log n ?mes the op?mal cost!<br />

X 3<br />

“Cough”<br />

Not adapNve submodular!<br />

è Is there an adap?ve submodular criterion??<br />

165


Theore?cal guarantees<br />

[with Daniel Golovin, Deb Ray, NIPS ‘10]<br />

Theorem: Equivalence class edge-­‐cuñng (EC2 ) is<br />

adap?ve monotone and adap?ve submodular.<br />

Suppose P (xV ,h) {0} ⇥ [ , 1]<br />

for all<br />

Then it holds that<br />

Cost(⇡Greedy) O<br />

✓<br />

log 1◆<br />

Cost(⇡ ⇤ )<br />

xV ,h<br />

First approxima?on guarantees for nonmyopic VOI<br />

in general graphical models!<br />

166


Example: The Iowa Gambling Task<br />

[with Colin Camerer, Deb Ray]<br />

A)<br />

What would you prefer?<br />

B)<br />

Prob.<br />

.7<br />

Prob. .7<br />

.3<br />

-­‐10$ 0$ +10$<br />

Various compe?ng theories on how people make decisions<br />

under uncertainty<br />

! Maximize expected u?lity? [von Neumann & Morgenstern ‘47]<br />

! Constant rela?ve risk aversion? [PraÑ ‘64]<br />

! Porïolio op?miza?on? [Hanoch & Levy ‘70]<br />

-­‐10$ 0$ +10$<br />

! (Normalized) Prospect theory? [Kahnemann & Tversky ’79]<br />

How should we design tests to disNnguish theories? 167<br />

.3


Iowa Gambling as BED<br />

Every possible test X s = (g s,1 ,g s,2 ) is a pair of gambles<br />

Theories parameterized by θ<br />

Each theory predicts u?lity for<br />

every gamble U(g,y,θ)<br />

P (Xs =1| y, )=<br />

X 1<br />

(g 1,1 ,g 1,2 )<br />

Y<br />

Theory<br />

X 2<br />

(g 2,1 ,g 2,2 )<br />

θ<br />

Param’s<br />

X n<br />

(g n,1 ,g n,2 )<br />

1<br />

1 + exp(U(gs,1,y, ) U(gs,2,y, ))<br />

P(X i =1 | Δ U )<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

…<br />

0<br />

−5 0 5<br />

Difference in utility Δ<br />

U<br />

168


Accuracy<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

Simula?on Results<br />

Adaptive<br />

Submodular<br />

BED<br />

VOI<br />

InfoGain<br />

Random<br />

Generalized<br />

binary search<br />

0.2<br />

0 5 10 15 20 25 30<br />

Number of tests<br />

UncertaintySampling<br />

AdapNve submodular criterion (EC 2 )<br />

outperforms exisNng approaches 169


Num. classified<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

Expected<br />

value<br />

Experimental Study<br />

[with Colin Camerer, Deb Ray]<br />

Mean var.<br />

skewness<br />

Prospect<br />

Theory<br />

Const. rel.<br />

risk aversion<br />

! Strongest support for PT, with some heterogeneity<br />

! Unexpectedly no support for CRRA<br />

Study with 57<br />

naïve subjects<br />

32,000 designs<br />

40s per test L<br />

ExploiNng<br />

<strong>submodularity</strong>:<br />


Interac?ve submodular coverage<br />

! Alterna?ve formaliza?on of adap?ve op?miza?on<br />

[Guillory & Bilmes, ICML ‘10]<br />

! Addresses the worst case señng<br />

! Applica?ons to (noisy) ac?ve learning, viral marke?ng<br />

[Guillory & Bilmes, ICML ‘11]<br />

171


! Game theory<br />

Other direc?ons<br />

! Equilibria in coopera?ve (supermodular) games / fair alloca?ons<br />

! Price of anarchy in non-­‐coopera?ve games<br />

! Incen?ve compa?ble submodular op?miza?on<br />

! New algorithms for submodular maximiza?on<br />

! Robust submodular op?miza?on<br />

! Generaliza?ons of submodular func?ons<br />

! L#-­‐convex / discrete convex analysis<br />

! XOS/Subaddi?ve func?ons<br />

! Efficient minimiza?on of subclasses<br />

172


! <strong>submodularity</strong>.org<br />

! References<br />

Further resources<br />

! Matlab Toolbox for Submodular Op?miza?on<br />

! discml.cc<br />

! NIPS Workshops on Discrete Op?miza?on in Machine Learning<br />

! Videos of invited talks on videolectures.net<br />

! Submit to DISCML <strong>2012</strong>! J<br />

...<br />

173


Conclusions<br />

! Discrete op?miza?on abundant in applica?ons<br />

! Fortunately, some of those have structure:<br />

<strong>submodularity</strong><br />

! Submodularity can be exploited to develop efficient,<br />

scalable algorithms with strong guarantees<br />

! Can handle complex constraints<br />

! Can learn to op?mize (online, adap?ve, …)<br />

174

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!