27.10.2014 Views

slides

slides

slides

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Dynamic Programming Algorithms<br />

in Semiring and Hypergraph Frameworks<br />

(University of Pennsylvania)<br />

<br />

Fall 2007


Guest Lectures: Outline<br />

• I plan to teach 3 topics on Algorithms in NLP/MT<br />

• Dynamic Programming: An Advanced Survey<br />

• Semiring and Hypergraph Frameworks<br />

• Viterbi and Dijkstra/Knuth Algorithms<br />

• k-best Algorithms<br />

• k-best parsing and MT decoding<br />

• other k-best problems and algorithms<br />

• Perceptron-like Algorithms<br />

Liang Huang<br />

2<br />

Dynamic Programming


Dynamic Programming<br />

Liang Huang<br />

3<br />

Dynamic Programming


Dynamic Programming<br />

• Dynamic Programming is everywhere in NLP<br />

•<br />

• CKY Algorithm for Context-free Parsing<br />

•<br />

Viterbi Algorithm for Hidden Markov Models<br />

Forward-Backward and Inside-Outside Algorithms<br />

• A* Parsing 3<br />

Liang Huang<br />

Dynamic Programming


Dynamic Programming<br />

• Dynamic Programming is everywhere in NLP<br />

•<br />

• CKY Algorithm for Context-free Parsing<br />

•<br />

• A* Parsing<br />

• Also everywhere in Machine Translation<br />

• Decoding: Phrase-based, Syntax-based, etc.<br />

• Alignment: IBM models, HMM model, etc.<br />

Viterbi Algorithm for Hidden Markov Models<br />

Forward-Backward and Inside-Outside Algorithms<br />

Liang Huang<br />

3<br />

Dynamic Programming


Dynamic Programming<br />

• Dynamic Programming is everywhere in NLP<br />

•<br />

• CKY Algorithm for Context-free Parsing<br />

•<br />

• A* Parsing<br />

• Also everywhere in Machine Translation<br />

• Decoding: Phrase-based, Syntax-based, etc.<br />

• Alignment: IBM models, HMM model, etc.<br />

• Focus on Optimization Problems<br />

Liang Huang<br />

Viterbi Algorithm for Hidden Markov Models<br />

Forward-Backward and Inside-Outside Algorithms<br />

3<br />

Dynamic Programming


Two Dimensional Survey<br />

traversing order<br />

search space<br />

Viterbi<br />

Generalized<br />

Viterbi<br />

Dijkstra<br />

Knuth<br />

Liang Huang<br />

4<br />

Dynamic Programming


Graphs in NLP<br />

Liang Huang<br />

5<br />

Dynamic Programming


Motivations for Semirings<br />

• in a weighted graph, we need two operators:<br />

•<br />

•<br />

•<br />

extension (multiplicative) and summary (additive)<br />

the weight of a path is the product of edge weights<br />

the weight of a vertex is the summary of path weights<br />

e1<br />

e2<br />

d(π 1 )= ⊗<br />

e i ∈π 1<br />

w(e i )=w(e 1 ) ⊗ w(e 2 ) ⊗ w(e 3 )<br />

e3<br />

...<br />

...<br />

d(t) = ⊕ π i<br />

w(π i )<br />

= w(p 1 ) ⊕ w(p 2 ) ⊕···<br />

Liang Huang<br />

6<br />

Dynamic Programming


Definitions<br />

A monoid is a triple (A, ⊗, 1) where<br />

1. ⊗ is a closed associative binary operator on the set A,<br />

2. 1istheidentity element for ⊗, i.e., for all a ∈ A, a ⊗ 1=1 ⊗ a = a.<br />

A monoid is commutative if ⊗ is commutative.<br />

Liang Huang<br />

7<br />

Dynamic Programming


Definitions<br />

A monoid is a triple (A, ⊗, 1) where<br />

1. ⊗ is a closed associative binary operator on the set A,<br />

2. 1istheidentity element for ⊗, i.e., for all a ∈ A, a ⊗ 1=1 ⊗ a = a.<br />

A monoid is commutative if ⊗ is commutative. ([0, 1], +, 0)<br />

([0, 1], , 1)<br />

([0, 1], max, 0)<br />

Liang Huang<br />

7<br />

Dynamic Programming


Definitions<br />

A monoid is a triple (A, ⊗, 1) where<br />

1. ⊗ is a closed associative binary operator on the set A,<br />

2. 1istheidentity element for ⊗, i.e., for all a ∈ A, a ⊗ 1=1 ⊗ a = a.<br />

A monoid is commutative if ⊗ is commutative.<br />

A semiring is a 5-tuple R =(A, ⊕, ⊗, 0, 1) such that<br />

1. (A, ⊕, 0) is a commutative monoid.<br />

([0, 1], +, 0)<br />

([0, 1], , 1)<br />

([0, 1], max, 0)<br />

2. (A, ⊗, 1) is a monoid.<br />

3. ⊗ distributes over ⊕: for all a, b, c in A,<br />

(a ⊕ b) ⊗ c =(a ⊗ c) ⊕ (b ⊗ c),<br />

c ⊗ (a ⊕ b) =(c ⊗ a) ⊕ (c ⊗ b).<br />

4. 0isanannihilator for ⊗: for all a in A, 0 ⊗ a = a ⊗ 0=0.<br />

Liang Huang<br />

7<br />

Dynamic Programming


Definitions<br />

A monoid is a triple (A, ⊗, 1) where<br />

1. ⊗ is a closed associative binary operator on the set A,<br />

2. 1istheidentity element for ⊗, i.e., for all a ∈ A, a ⊗ 1=1 ⊗ a = a.<br />

A monoid is commutative if ⊗ is commutative.<br />

A semiring is a 5-tuple R =(A, ⊕, ⊗, 0, 1) such that<br />

1. (A, ⊕, 0) is a commutative monoid.<br />

2. (A, ⊗, 1) is a monoid.<br />

3. ⊗ distributes over ⊕: for all a, b, c in A,<br />

([0, 1], +, 0)<br />

([0, 1], , 1)<br />

([0, 1], max, 0)<br />

([0, 1], max, , 0, 1)<br />

(a ⊕ b) ⊗ c =(a ⊗ c) ⊕ (b ⊗ c),<br />

c ⊗ (a ⊕ b) =(c ⊗ a) ⊕ (c ⊗ b).<br />

4. 0isanannihilator for ⊗: for all a in A, 0 ⊗ a = a ⊗ 0=0.<br />

Liang Huang<br />

7<br />

Dynamic Programming


Definitions<br />

A monoid is a triple (A, ⊗, 1) where<br />

1. ⊗ is a closed associative binary operator on the set A,<br />

2. 1istheidentity element for ⊗, i.e., for all a ∈ A, a ⊗ 1=1 ⊗ a = a.<br />

A monoid is commutative if ⊗ is commutative.<br />

A semiring is a 5-tuple R =(A, ⊕, ⊗, 0, 1) such that<br />

1. (A, ⊕, 0) is a commutative monoid.<br />

2. (A, ⊗, 1) is a monoid.<br />

3. ⊗ distributes over ⊕: for all a, b, c in A,<br />

(a ⊕ b) ⊗ c =(a ⊗ c) ⊕ (b ⊗ c),<br />

c ⊗ (a ⊕ b) =(c ⊗ a) ⊕ (c ⊗ b).<br />

([0, 1], +, 0)<br />

([0, 1], , 1)<br />

([0, 1], max, 0)<br />

([0, 1], max, , 0, 1)<br />

([0, 1], +, , 0, 1)<br />

4. 0isanannihilator for ⊗: for all a in A, 0 ⊗ a = a ⊗ 0=0.<br />

Liang Huang<br />

7<br />

Dynamic Programming


Examples<br />

Semiring Set ⊕ ⊗ 0 1 intuition/application<br />

Boolean {0, 1} ∨ ∧ 0 1 logical deduction, recognition<br />

Viterbi [0, 1] max × 0 1 prob. of the best derivation<br />

Inside R + ∪{+∞} + × 0 1 prob. of a string<br />

Real R ∪{+∞} min + +∞ 0 shortest-distance<br />

Tropical R + ∪{+∞} min + +∞ 0 with non-negative weights<br />

Counting N + × 0 1 number of paths<br />

how about ?<br />

Liang Huang<br />

8<br />

Dynamic Programming


Ordering<br />

Liang Huang<br />

9<br />

Dynamic Programming


Ordering<br />

• idempotent 9<br />

A semiring (A, ⊕, ⊗, 0, 1) is idempotent if for all a in A, a ⊕ a = a.<br />

Liang Huang<br />

Dynamic Programming


Ordering<br />

• idempotent<br />

A semiring (A, ⊕, ⊗, 0, 1) is idempotent if for all a in A, a ⊕ a = a.<br />

• comparison<br />

(a ≤ b) ⇔ (a ⊕ b = a) defines a partial ordering.<br />

• examples: boolean, viterbi, tropical, real, ...<br />

({0, 1}, ∨, ∧, 0, 1) (R + ∪{+∞}, min, +, +∞, 0)<br />

([0, 1], max, ⊗, 0, 1) (R ∪{+∞}, min, +, +∞, 0)<br />

Liang Huang<br />

9<br />

Dynamic Programming


Ordering<br />

• idempotent<br />

A semiring (A, ⊕, ⊗, 0, 1) is idempotent if for all a in A, a ⊕ a = a.<br />

• comparison<br />

(a ≤ b) ⇔ (a ⊕ b = a) defines a partial ordering.<br />

• examples: boolean, viterbi, tropical, real, ...<br />

({0, 1}, ∨, ∧, 0, 1) (R + ∪{+∞}, min, +, +∞, 0)<br />

([0, 1], max, ⊗, 0, 1) (R ∪{+∞}, min, +, +∞, 0)<br />

• total-order for optimization problems<br />

A semiring is totally-ordered if ⊕ defines a total ordering.<br />

• examples: all of the above 9<br />

Liang Huang<br />

Dynamic Programming


Monotonicity<br />

Liang Huang<br />

10<br />

Dynamic Programming


Monotonicity<br />

• monotonicity 10<br />

Liang Huang<br />

Dynamic Programming


Monotonicity<br />

• monotonicity 10<br />

Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />

We say K is monotonic if for all a, b, c ∈ A<br />

(a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)<br />

Liang Huang<br />

Dynamic Programming


• monotonicity<br />

Monotonicity<br />

Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />

We say K is monotonic if for all a, b, c ∈ A<br />

(a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)<br />

• optimal substructure in dynamic programming<br />

Liang Huang<br />

10<br />

Dynamic Programming


• monotonicity<br />

Monotonicity<br />

Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />

We say K is monotonic if for all a, b, c ∈ A<br />

(a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)<br />

• optimal substructure in dynamic programming<br />

A: b ⊗ c<br />

Liang Huang<br />

10<br />

Dynamic Programming


• monotonicity<br />

Monotonicity<br />

Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />

We say K is monotonic if for all a, b, c ∈ A<br />

(a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)<br />

• optimal substructure in dynamic programming<br />

A: b ⊗ c<br />

A: b’ ⊗ c<br />

b ⊗ c<br />

<br />

Liang Huang<br />

10<br />

Dynamic Programming


• monotonicity<br />

Monotonicity<br />

Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />

We say K is monotonic if for all a, b, c ∈ A<br />

(a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)<br />

• optimal substructure in dynamic programming<br />

A: b ⊗ c<br />

• [lemma] idempotent => monotone<br />

A: b’ ⊗ c<br />

b ⊗ c<br />

<br />

Liang Huang<br />

10<br />

Dynamic Programming


• monotonicity<br />

Monotonicity<br />

Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />

We say K is monotonic if for all a, b, c ∈ A<br />

(a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)<br />

• optimal substructure in dynamic programming<br />

A: b ⊗ c<br />

A: b’ ⊗ c<br />

b ⊗ c<br />

<br />

• [lemma] idempotent => monotone<br />

•<br />

Liang Huang<br />

our focus, totally-ordered semirings, are monotone<br />

10<br />

Dynamic Programming


• monotonicity<br />

Monotonicity<br />

Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />

We say K is monotonic if for all a, b, c ∈ A<br />

(a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)<br />

• optimal substructure in dynamic programming<br />

A: b ⊗ c<br />

A: b’ ⊗ c<br />

b ⊗ c<br />

<br />

• [lemma] idempotent => monotone<br />

•<br />

Liang Huang<br />

our focus, totally-ordered semirings, are monotone<br />

10<br />

free lunch!<br />

Dynamic Programming


Two Dimensional Survey<br />

traversing order<br />

search space<br />

Viterbi<br />

Generalized<br />

Viterbi<br />

Dijkstra<br />

Knuth<br />

Liang Huang<br />

11<br />

Dynamic Programming


DP on Graphs<br />

• optimization problems on graphs<br />

=> generic shortest-path problem<br />

• weighted directed graph G=(V, E) with a function w<br />

that assigns each edge a weight from a semiring<br />

• compute the best weight of the target vertex t<br />

• generic update along edge (u, v)<br />

w(u, v) d(v) ⊕ = d(u) ⊗ w(u, v)<br />

• how to avoid cyclic updates?<br />

•<br />

only update when d(u) is fixed<br />

d(v) ← d(v) ⊕ (d(u) ⊗ w(u, v))<br />

Liang Huang<br />

12<br />

Dynamic Programming


Viterbi Algorithm for DAGs<br />

1. topological sort<br />

2. visit each vertex v in sorted order and do updates<br />

• for each incoming edge (u, v) in E<br />

• use d(u) to update d(v):<br />

•<br />

key observation: d(u) is fixed to optimal at this time<br />

w(u, v)<br />

d(v) ⊕ = d(u) ⊗ w(u, v)<br />

• time complexity: O( V + E )<br />

Liang Huang<br />

13<br />

Dynamic Programming


Variant 1: forward-update<br />

1. topological sort<br />

2. visit each vertex v in sorted order and do updates<br />

• for each outgoing edge (v, u) in E<br />

• use d(v) to update d(u):<br />

•<br />

d(u) ⊕ = d(v) ⊗ w(v, u)<br />

key observation: d(v) is fixed to optimal at this time<br />

w(v, u)<br />

• time complexity: O( V + E )<br />

Liang Huang<br />

14<br />

Dynamic Programming


Examples<br />

Liang Huang<br />

15<br />

Dynamic Programming


Examples<br />

• [Number of Paths in a DAG]<br />

Liang Huang<br />

15<br />

Dynamic Programming


Examples<br />

• [Number of Paths in a DAG]<br />

•<br />

•<br />

just use the counting semiring (N, +, , 0, 1)<br />

note: this is not an optimization problem!<br />

Liang Huang<br />

15<br />

Dynamic Programming


Examples<br />

• [Number of Paths in a DAG]<br />

•<br />

•<br />

just use the counting semiring (N, +, , 0, 1)<br />

note: this is not an optimization problem!<br />

• [Longest Path in a DAG] 15<br />

Liang Huang<br />

Dynamic Programming


Examples<br />

• [Number of Paths in a DAG]<br />

•<br />

•<br />

• [Longest Path in a DAG]<br />

just use the counting semiring (N, +, , 0, 1)<br />

note: this is not an optimization problem!<br />

• just use the semiring 15<br />

(R ∪{−∞}, max, +, −∞, 0)<br />

Liang Huang<br />

Dynamic Programming


Examples<br />

• [Number of Paths in a DAG]<br />

•<br />

•<br />

• [Longest Path in a DAG]<br />

• just use the semiring<br />

•<br />

just use the counting semiring (N, +, , 0, 1)<br />

note: this is not an optimization problem!<br />

(R ∪{−∞}, max, +, −∞, 0)<br />

[Part-of-Speech Tagging with a Hidden Markov Model]<br />

Liang Huang<br />

15<br />

Dynamic Programming


Examples<br />

• [Number of Paths in a DAG]<br />

•<br />

•<br />

• [Longest Path in a DAG]<br />

• just use the semiring<br />

•<br />

just use the counting semiring (N, +, , 0, 1)<br />

note: this is not an optimization problem!<br />

(R ∪{−∞}, max, +, −∞, 0)<br />

[Part-of-Speech Tagging with a Hidden Markov Model]<br />

Liang Huang<br />

15<br />

Dynamic Programming


Example: Word-based Decoding<br />

Liang Huang<br />

16<br />

Dynamic Programming


Example: Word Alignment<br />

Liang Huang<br />

17<br />

Dynamic Programming


Dijkstra Algorithm<br />

Liang Huang<br />

18<br />

Dynamic Programming


Dijkstra Algorithm<br />

• Dijkstra does not require acyclicity<br />

•<br />

•<br />

instead of topological order, we use best-first order<br />

but this requires superiority of the semiring<br />

Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />

We say K is superior if for all a, b ∈ A<br />

a ≤ a ⊗ b, b ≤ a ⊗ b.<br />

• basically, combination always gets worse<br />

• or, no negative edge in a graph<br />

Liang Huang<br />

18<br />

Dynamic Programming


Dijkstra Algorithm<br />

• Dijkstra does not require acyclicity<br />

•<br />

•<br />

instead of topological order, we use best-first order<br />

but this requires superiority of the semiring<br />

Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />

We say K is superior if for all a, b ∈ A<br />

a ≤ a ⊗ b, b ≤ a ⊗ b.<br />

• basically, combination always gets worse<br />

• or, no negative edge in a graph<br />

d(u)<br />

w(e)<br />

d(u) ⊗ w(e)<br />

Liang Huang<br />

18<br />

Dynamic Programming


Dijkstra Algorithm<br />

• Dijkstra does not require acyclicity<br />

•<br />

•<br />

instead of topological order, we use best-first order<br />

but this requires superiority of the semiring<br />

Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />

We say K is superior if for all a, b ∈ A<br />

d(u)<br />

• basically, combination always gets worse<br />

• or, no negative edge in a graph<br />

w(e)<br />

a ≤ a ⊗ b, b ≤ a ⊗ b.<br />

d(u) ⊗ w(e)<br />

({0, 1}, ∨, ∧, 0, 1)<br />

([0, 1], max, ×, 0, 1)<br />

(R + ∪{+∞}, min, +, +∞, 0)<br />

(R ∪{+∞}, min, +, +∞, 0)<br />

Liang Huang<br />

18<br />

Dynamic Programming


Dijkstra Algorithm<br />

• Dijkstra does not require acyclicity<br />

•<br />

•<br />

instead of topological order, we use best-first order<br />

but this requires superiority of the semiring<br />

Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />

We say K is superior if for all a, b ∈ A<br />

d(u)<br />

• basically, combination always gets worse<br />

• or, no negative edge in a graph<br />

w(e)<br />

a ≤ a ⊗ b, b ≤ a ⊗ b.<br />

d(u) ⊗ w(e)<br />

({0, 1}, ∨, ∧, 0, 1)<br />

([0, 1], max, ×, 0, 1)<br />

(R + ∪{+∞}, min, +, +∞, 0)<br />

(R ∪{+∞}, min, +, +∞, 0)<br />

Liang Huang<br />

18<br />

Dynamic Programming


Dijkstra Algorithm<br />

• Dijkstra does not require acyclicity<br />

•<br />

•<br />

instead of topological order, we use best-first order<br />

but this requires superiority of the semiring<br />

Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />

We say K is superior if for all a, b ∈ A<br />

d(u)<br />

• basically, combination always gets worse<br />

• or, no negative edge in a graph<br />

w(e)<br />

a ≤ a ⊗ b, b ≤ a ⊗ b.<br />

d(u) ⊗ w(e)<br />

({0, 1}, ∨, ∧, 0, 1)<br />

([0, 1], max, ×, 0, 1)<br />

(R + ∪{+∞}, min, +, +∞, 0)<br />

(R ∪{+∞}, min, +, +∞, 0)<br />

Liang Huang<br />

18<br />

Dynamic Programming


Dijkstra Algorithm<br />

• Dijkstra does not require acyclicity<br />

•<br />

•<br />

instead of topological order, we use best-first order<br />

but this requires superiority of the semiring<br />

Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />

We say K is superior if for all a, b ∈ A<br />

d(u)<br />

• basically, combination always gets worse<br />

• or, no negative edge in a graph<br />

w(e)<br />

a ≤ a ⊗ b, b ≤ a ⊗ b.<br />

d(u) ⊗ w(e)<br />

({0, 1}, ∨, ∧, 0, 1)<br />

([0, 1], max, ×, 0, 1)<br />

(R + ∪{+∞}, min, +, +∞, 0)<br />

(R ∪{+∞}, min, +, +∞, 0)<br />

Liang Huang<br />

18<br />

Dynamic Programming


Dijkstra Algorithm<br />

• Dijkstra does not require acyclicity<br />

•<br />

•<br />

instead of topological order, we use best-first order<br />

but this requires superiority of the semiring<br />

Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />

We say K is superior if for all a, b ∈ A<br />

d(u)<br />

• basically, combination always gets worse<br />

• or, no negative edge in a graph<br />

w(e)<br />

a ≤ a ⊗ b, b ≤ a ⊗ b.<br />

d(u) ⊗ w(e)<br />

({0, 1}, ∨, ∧, 0, 1)<br />

([0, 1], max, ×, 0, 1)<br />

(R + ∪{+∞}, min, +, +∞, 0)<br />

(R ∪{+∞}, min, +, +∞, 0)<br />

Liang Huang<br />

18<br />

Dynamic Programming


Dijkstra Algorithm<br />

• keep a cut (S : V - S) where S vertices are fixed<br />

• maintain a priority queue Q of V - S vertices<br />

• each iteration choose the best vertex v from Q<br />

•<br />

move v to S, and use d(v) to forward-update others<br />

...<br />

d(u) ⊕ = d(v) ⊗ w(v, u)<br />

S<br />

V - S<br />

time complexity:<br />

O((V+E) lgV) (binary heap)<br />

O(V lgV + E) (fib. heap)<br />

Liang Huang<br />

19<br />

Dynamic Programming


Dijkstra Algorithm<br />

• keep a cut (S : V - S) where S vertices are fixed<br />

• maintain a priority queue Q of V - S vertices<br />

• each iteration choose the best vertex v from Q<br />

•<br />

move v to S, and use d(v) to forward-update others<br />

...<br />

d(u) ⊕ = d(v) ⊗ w(v, u)<br />

S<br />

V - S<br />

time complexity:<br />

O((V+E) lgV) (binary heap)<br />

O(V lgV + E) (fib. heap)<br />

Liang Huang<br />

19<br />

Dynamic Programming


Dijkstra Algorithm<br />

• keep a cut (S : V - S) where S vertices are fixed<br />

• maintain a priority queue Q of V - S vertices<br />

• each iteration choose the best vertex v from Q<br />

•<br />

move v to S, and use d(v) to forward-update others<br />

w(v, u)<br />

d(u) ⊕ = d(v) ⊗ w(v, u)<br />

...<br />

S<br />

V - S<br />

time complexity:<br />

O((V+E) lgV) (binary heap)<br />

O(V lgV + E) (fib. heap)<br />

Liang Huang<br />

19<br />

Dynamic Programming


Viterbi vs. Dijkstra<br />

• structural vs. algebraic constraints<br />

•<br />

Dijkstra only applicable to optimization problems<br />

monotonic optimization problems<br />

Liang Huang<br />

20<br />

Dynamic Programming


Viterbi vs. Dijkstra<br />

• structural vs. algebraic constraints<br />

•<br />

Dijkstra only applicable to optimization problems<br />

monotonic optimization problems<br />

Liang Huang<br />

20<br />

Dynamic Programming


Viterbi vs. Dijkstra<br />

• structural vs. algebraic constraints<br />

•<br />

Dijkstra only applicable to optimization problems<br />

monotonic optimization problems<br />

Liang Huang<br />

20<br />

Dynamic Programming


Viterbi vs. Dijkstra<br />

• structural vs. algebraic constraints<br />

•<br />

Dijkstra only applicable to optimization problems<br />

monotonic optimization problems<br />

many<br />

NLP<br />

problems<br />

Liang Huang<br />

20<br />

Dynamic Programming


Viterbi vs. Dijkstra<br />

• structural vs. algebraic constraints<br />

•<br />

Dijkstra only applicable to optimization problems<br />

monotonic optimization problems<br />

many<br />

NLP<br />

problems<br />

forward-backward<br />

(Inside semiring)<br />

Liang Huang<br />

20<br />

Dynamic Programming


Viterbi vs. Dijkstra<br />

• structural vs. algebraic constraints<br />

•<br />

Dijkstra only applicable to optimization problems<br />

monotonic optimization problems<br />

many<br />

NLP<br />

problems<br />

forward-backward<br />

(Inside semiring)<br />

Liang Huang<br />

max-margin<br />

models<br />

20<br />

Dynamic Programming


Viterbi vs. Dijkstra<br />

• structural vs. algebraic constraints<br />

•<br />

Dijkstra only applicable to optimization problems<br />

monotonic optimization problems<br />

many<br />

NLP<br />

problems<br />

forward-backward<br />

(Inside semiring)<br />

Liang Huang<br />

max-margin<br />

models<br />

20<br />

cyclic FSMs/<br />

grammars<br />

Dynamic Programming


What if both fail?<br />

monotonic optimization problems<br />

many<br />

NLP<br />

problems<br />

generalized Bellman-Ford<br />

(CLR, 1990; Mohri, 2002)<br />

or, first do strongly-connected components (SCC)<br />

which gives a DAG; use Viterbi globally on this SCC-DAG;<br />

use Bellman-Ford locally within each SCC<br />

Liang Huang<br />

21<br />

Dynamic Programming


What if both work?<br />

monotonic optimization problems<br />

many<br />

NLP<br />

problems<br />

full Dijkstra is slower than Viterbi<br />

O((V + E) lgV) vs. O(V + E)<br />

but it can finish as early as the target vertex is popped<br />

a (V + E) lgV vs. V + E<br />

Q: how to (magically) reduce a?<br />

Liang Huang<br />

22<br />

Dynamic Programming


A* Search<br />

d(v)<br />

h(v)<br />

• d(v): the distance from source s to v<br />

• h(v): the distance from v to target t<br />

• (v) must be an optimistic estimate of h(v): (v) h(v)<br />

• now, prioritize the queue by d(v) ⊗ (v)<br />

• Dijkstra is a special case where (v) = <br />

• also requires d(v) ⊗ (v) never improves (superior)<br />

• hope: d(t) ⊗ (t) = d(t) can be popped sooner<br />

Liang Huang<br />

23<br />

(v)<br />

Dynamic Programming


Two Dimensional Survey<br />

traversing order<br />

search space<br />

Viterbi<br />

Generalized<br />

Viterbi<br />

Dijkstra<br />

Knuth<br />

Liang Huang<br />

24<br />

Dynamic Programming


Two Dimensional Survey<br />

traversing order<br />

search space<br />

Viterbi<br />

Generalized<br />

Viterbi<br />

Dijkstra<br />

Knuth<br />

Liang Huang<br />

24<br />

Dynamic Programming


Background: CFG and Parsing<br />

Liang Huang<br />

25<br />

Dynamic Programming


(Directed) Hypergraphs<br />

• a generalization of graphs<br />

•<br />

• e = (T(e), h(e), f e). arity |e| = |T(e)|<br />

• a totally-ordered weight set R<br />

• we borrow the operator to be the comparison<br />

• weight function f e : R |e| to R<br />

• generalizes the ⊗ operator in semirings<br />

edge => hyperedge: several vertices to one vertex<br />

tails<br />

Liang Huang<br />

1<br />

2<br />

fe<br />

head<br />

d(v) ⊕ = f e (d(u 1 ),d(u 2 ))<br />

26<br />

Dynamic Programming


Hypergraphs and Deduction<br />

tails<br />

: a<br />

1 2<br />

: b<br />

: a : b<br />

1 2<br />

antecedents<br />

fe<br />

: fe (a,b)<br />

: fe (a,b)<br />

fe<br />

head<br />

consequent<br />

Liang Huang<br />

27<br />

(Nederhof, 2003)<br />

Dynamic Programming


Hypergraphs and Deduction<br />

tails<br />

: a<br />

1 2<br />

: b<br />

: a : b<br />

1 2<br />

antecedents<br />

fe<br />

: fe (a,b)<br />

: fe (a,b)<br />

fe<br />

head<br />

consequent<br />

(B, i, k)<br />

: a<br />

(C, k, j)<br />

: b<br />

1 2<br />

Liang Huang<br />

fe<br />

: a b Pr(A B C)<br />

(A, i, j)<br />

27<br />

(Nederhof, 2003)<br />

Dynamic Programming


Hypergraphs and Deduction<br />

tails<br />

: a<br />

1 2<br />

: b<br />

: a : b<br />

1 2<br />

antecedents<br />

fe<br />

: fe (a,b)<br />

: fe (a,b)<br />

fe<br />

head<br />

consequent<br />

(B, i, k)<br />

: a<br />

(C, k, j)<br />

: b<br />

1 2<br />

fe<br />

(B, i, k) (C, k, j)<br />

(A, i, j)<br />

AB C<br />

: a b Pr(A B C)<br />

(A, i, j)<br />

(Nederhof, 2003)<br />

Liang Huang<br />

27<br />

Dynamic Programming


Weight Functions and Semirings<br />

Liang Huang<br />

28<br />

Dynamic Programming


Weight Functions and Semirings<br />

d(u)<br />

w(e)<br />

d(u) ⊗ w(e)<br />

Liang Huang<br />

28<br />

Dynamic Programming


Weight Functions and Semirings<br />

d(u)<br />

w(e)<br />

d(u) ⊗ w(e)<br />

d(u)<br />

fe<br />

fe(d(u))<br />

Liang Huang<br />

28<br />

Dynamic Programming


Weight Functions and Semirings<br />

d(u)<br />

d(u)<br />

w(e)<br />

fe<br />

d(u) ⊗ w(e)<br />

fe(d(u))<br />

fe(a) = a ⊗ w(e)<br />

Liang Huang<br />

28<br />

Dynamic Programming


Weight Functions and Semirings<br />

d(u)<br />

d(u)<br />

w(e)<br />

fe<br />

d(u) ⊗ w(e)<br />

fe(d(u))<br />

fe(a) = a ⊗ w(e)<br />

1<br />

tails<br />

2<br />

...<br />

fe<br />

head<br />

k<br />

Liang Huang<br />

28<br />

Dynamic Programming


Weight Functions and Semirings<br />

d(u)<br />

d(u)<br />

w(e)<br />

fe<br />

d(u) ⊗ w(e)<br />

fe(d(u))<br />

fe(a) = a ⊗ w(e)<br />

tails<br />

1<br />

2<br />

...<br />

fe<br />

fe(a1, ..., ak)<br />

head<br />

k<br />

Liang Huang<br />

28<br />

Dynamic Programming


Weight Functions and Semirings<br />

d(u)<br />

d(u)<br />

w(e)<br />

fe<br />

d(u) ⊗ w(e)<br />

fe(d(u))<br />

fe(a) = a ⊗ w(e)<br />

tails<br />

1<br />

2<br />

...<br />

fe<br />

fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e)<br />

head<br />

k<br />

Liang Huang<br />

28<br />

Dynamic Programming


Weight Functions and Semirings<br />

d(u)<br />

d(u)<br />

w(e)<br />

fe<br />

d(u) ⊗ w(e)<br />

fe(d(u))<br />

fe(a) = a ⊗ w(e)<br />

tails<br />

1<br />

2<br />

...<br />

fe<br />

fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e)<br />

head<br />

k<br />

Liang Huang<br />

28<br />

Dynamic Programming


Weight Functions and Semirings<br />

d(u)<br />

w(e)<br />

d(u) ⊗ w(e)<br />

fe(a) = a ⊗ w(e)<br />

d(u)<br />

fe<br />

fe(d(u))<br />

semiringcomposed<br />

tails<br />

1<br />

2<br />

...<br />

fe<br />

fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e)<br />

head<br />

k<br />

Liang Huang<br />

28<br />

Dynamic Programming


Weight Functions and Semirings<br />

d(u)<br />

w(e)<br />

d(u) ⊗ w(e)<br />

fe(a) = a ⊗ w(e)<br />

d(u)<br />

fe<br />

fe(d(u))<br />

semiringcomposed<br />

tails<br />

1<br />

2<br />

...<br />

fe<br />

fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e)<br />

head<br />

k<br />

also extend monotonicity and<br />

superiority to weight functions<br />

Liang Huang<br />

28<br />

Dynamic Programming


Two Dimensional Survey<br />

traversing order<br />

search space<br />

Viterbi<br />

Generalized<br />

Viterbi<br />

Dijkstra<br />

Knuth<br />

Liang Huang<br />

29<br />

Dynamic Programming


Viterbi Algorithm for DAGs<br />

1. topological sort<br />

2. visit each vertex v in sorted order and do updates<br />

• for each incoming edge (u, v) in E<br />

• use d(u) to update d(v):<br />

•<br />

key observation: d(u) is fixed to optimal at this time<br />

w(u, v)<br />

d(v) ⊕ = d(u) ⊗ w(u, v)<br />

• time complexity: O( V + E )<br />

Liang Huang<br />

30<br />

Dynamic Programming


Viterbi Algorithm for DAHs<br />

1. topological sort<br />

2. visit each vertex v in sorted order and do updates<br />

• for each incoming hyperedge e = ((u 1, .., u|e|), v, fe)<br />

• use d(u i)’s to update d(v)<br />

• key observation: d(ui)’s are fixed to optimal at this time<br />

1<br />

2<br />

fe<br />

d(v) ⊕ = f e (d(u 1 ), ··· ,d(u |e| ))<br />

• time complexity: O( V + E ) (assuming constant arity)<br />

Liang Huang<br />

31<br />

Dynamic Programming


Example: CKY Parsing<br />

• parsing with CFGs in Chomsky Normal Form (CNF)<br />

• typical instance of the generalized Viterbi for DAHs<br />

• many variants of CKY ~ various topological ordering<br />

(S, 1, n) (S, 1, n) (S, 1, n)<br />

Liang Huang<br />

O(n 3 |P|)<br />

with CKY pseudo-code<br />

32<br />

Dynamic Programming


Example: CKY Parsing<br />

• parsing with CFGs in Chomsky Normal Form (CNF)<br />

• typical instance of the generalized Viterbi for DAHs<br />

• many variants of CKY ~ various topological ordering<br />

(S, 1, n) (S, 1, n) (S, 1, n)<br />

Liang Huang<br />

O(n 3 |P|)<br />

with CKY pseudo-code<br />

32<br />

Dynamic Programming


Example: CKY Parsing<br />

• parsing with CFGs in Chomsky Normal Form (CNF)<br />

• typical instance of the generalized Viterbi for DAHs<br />

• many variants of CKY ~ various topological ordering<br />

(S, 1, n) (S, 1, n) (S, 1, n)<br />

bottom-up<br />

Liang Huang<br />

O(n 3 |P|)<br />

with CKY pseudo-code<br />

32<br />

Dynamic Programming


Example: CKY Parsing<br />

• parsing with CFGs in Chomsky Normal Form (CNF)<br />

• typical instance of the generalized Viterbi for DAHs<br />

• many variants of CKY ~ various topological ordering<br />

(S, 1, n) (S, 1, n) (S, 1, n)<br />

bottom-up<br />

Liang Huang<br />

O(n 3 |P|)<br />

with CKY pseudo-code<br />

32<br />

Dynamic Programming


Example: CKY Parsing<br />

• parsing with CFGs in Chomsky Normal Form (CNF)<br />

• typical instance of the generalized Viterbi for DAHs<br />

• many variants of CKY ~ various topological ordering<br />

(S, 1, n) (S, 1, n) (S, 1, n)<br />

bottom-up<br />

left-to-right<br />

Liang Huang<br />

O(n 3 |P|)<br />

with CKY pseudo-code<br />

32<br />

Dynamic Programming


Example: CKY Parsing<br />

• parsing with CFGs in Chomsky Normal Form (CNF)<br />

• typical instance of the generalized Viterbi for DAHs<br />

• many variants of CKY ~ various topological ordering<br />

(S, 1, n) (S, 1, n) (S, 1, n)<br />

bottom-up<br />

left-to-right<br />

Liang Huang<br />

O(n 3 |P|)<br />

with CKY pseudo-code<br />

32<br />

Dynamic Programming


Example: CKY Parsing<br />

• parsing with CFGs in Chomsky Normal Form (CNF)<br />

• typical instance of the generalized Viterbi for DAHs<br />

• many variants of CKY ~ various topological ordering<br />

(S, 1, n) (S, 1, n) (S, 1, n)<br />

bottom-up left-to-right right-to-left<br />

Liang Huang<br />

O(n 3 |P|)<br />

with CKY pseudo-code<br />

32<br />

Dynamic Programming


Forward Variant for DAHs<br />

1. topological sort<br />

2. visit each vertex v in sorted order and do updates<br />

• for each outgoing hyperedge e = ((u 1, .., u|e|), h(e), fe)<br />

• if d(ui)’s have all been fixed to optimal<br />

• use d(ui)’s to update d(h(e))<br />

v = i<br />

1 fe<br />

2 =<br />

• time complexity: O( V + E )<br />

Liang Huang<br />

33<br />

Dynamic Programming


Forward Variant for DAHs<br />

1. topological sort<br />

2. visit each vertex v in sorted order and do updates<br />

• for each outgoing hyperedge e = ((u 1, .., u|e|), h(e), fe)<br />

• if d(ui)’s have all been fixed to optimal<br />

• use d(ui)’s to update d(h(e))<br />

v = i<br />

1 fe<br />

2 =<br />

• time complexity: O( V + E )<br />

Liang Huang<br />

33<br />

Dynamic Programming


Forward Variant for DAHs<br />

1. topological sort<br />

2. visit each vertex v in sorted order and do updates<br />

• for each outgoing hyperedge e = ((u 1, .., u|e|), h(e), fe)<br />

• if d(ui)’s have all been fixed to optimal<br />

• use d(ui)’s to update d(h(e))<br />

v = i<br />

Q: how to avoid repeated checking?<br />

maintain a counter r[e] for each e:<br />

how many tails yet to be fixed?<br />

fire this hyperedge only if r[e]=0<br />

• time complexity: O( V + E )<br />

Liang Huang<br />

33<br />

2 =<br />

1 fe<br />

Dynamic Programming


Example: Treebank Parsers<br />

• State-of-the-art statistical parsers<br />

• (Collins, 1999; Charniak, 2000)<br />

•<br />

• can’t do backward updates<br />

• don’t know how to decompose a big item<br />

• forward update from vertex (X, i, j)<br />

no fixed grammar (every production is possible)<br />

• check all vertices like (Y, j, k) or (Y, k, i) in the chart (fixed)<br />

• try combine them to form bigger item (Z, i, k) or (Z, k, j)<br />

Liang Huang<br />

34<br />

Dynamic Programming


Dijkstra Algorithm<br />

• keep a cut (S : V - S) where S vertices are fixed<br />

• maintain a priority queue Q of V - S vertices<br />

• each iteration choose the best vertex v from Q<br />

•<br />

move v to S, and use d(v) to forward-update others<br />

...<br />

d(u) ⊕ = d(v) ⊗ w(v, u)<br />

S<br />

V - S<br />

time complexity:<br />

O((V+E) lgV) (binary heap)<br />

O(V lgV + E) (fib. heap)<br />

Liang Huang<br />

35<br />

Dynamic Programming


Dijkstra Algorithm<br />

• keep a cut (S : V - S) where S vertices are fixed<br />

• maintain a priority queue Q of V - S vertices<br />

• each iteration choose the best vertex v from Q<br />

•<br />

move v to S, and use d(v) to forward-update others<br />

...<br />

d(u) ⊕ = d(v) ⊗ w(v, u)<br />

S<br />

V - S<br />

time complexity:<br />

O((V+E) lgV) (binary heap)<br />

O(V lgV + E) (fib. heap)<br />

Liang Huang<br />

35<br />

Dynamic Programming


Dijkstra Algorithm<br />

• keep a cut (S : V - S) where S vertices are fixed<br />

• maintain a priority queue Q of V - S vertices<br />

• each iteration choose the best vertex v from Q<br />

•<br />

move v to S, and use d(v) to forward-update others<br />

w(v, u)<br />

d(u) ⊕ = d(v) ⊗ w(v, u)<br />

...<br />

S<br />

V - S<br />

time complexity:<br />

O((V+E) lgV) (binary heap)<br />

O(V lgV + E) (fib. heap)<br />

Liang Huang<br />

35<br />

Dynamic Programming


Knuth (1977) Algorithm<br />

• keep a cut (S : V - S) where S vertices are fixed<br />

• maintain a priority queue Q of V - S vertices<br />

• each iteration choose the best vertex v from Q<br />

•<br />

move v to S, and use d(v) to forward-update others<br />

...<br />

1<br />

S<br />

V - S<br />

time complexity:<br />

O((V+E) lgV) (binary heap)<br />

O(V lgV + E) (fib. heap)<br />

Liang Huang<br />

36<br />

Dynamic Programming


Knuth (1977) Algorithm<br />

• keep a cut (S : V - S) where S vertices are fixed<br />

• maintain a priority queue Q of V - S vertices<br />

• each iteration choose the best vertex v from Q<br />

•<br />

move v to S, and use d(v) to forward-update others<br />

...<br />

1<br />

S<br />

V - S<br />

time complexity:<br />

O((V+E) lgV) (binary heap)<br />

O(V lgV + E) (fib. heap)<br />

Liang Huang<br />

36<br />

Dynamic Programming


Knuth (1977) Algorithm<br />

• keep a cut (S : V - S) where S vertices are fixed<br />

• maintain a priority queue Q of V - S vertices<br />

• each iteration choose the best vertex v from Q<br />

•<br />

move v to S, and use d(v) to forward-update others<br />

...<br />

1 fe<br />

S<br />

V - S<br />

time complexity:<br />

O((V+E) lgV) (binary heap)<br />

O(V lgV + E) (fib. heap)<br />

Liang Huang<br />

36<br />

Dynamic Programming


Knuth (1977) Algorithm<br />

• keep a cut (S : V - S) where S vertices are fixed<br />

• maintain a priority queue Q of V - S vertices<br />

• each iteration choose the best vertex v from Q<br />

•<br />

move v to S, and use d(v) to forward-update others<br />

...<br />

1 fe<br />

S<br />

V - S<br />

time complexity:<br />

O((V+E) lgV) (binary heap)<br />

O(V lgV + E) (fib. heap)<br />

Liang Huang<br />

36<br />

Dynamic Programming


Example: A* Parsing<br />

• Use A* search on top of the Knuth Algorithm<br />

•<br />

Showed significant speed up with carefully designed<br />

heuristic functions (Klein and Manning, 2003)<br />

[open problem] can you still define heuristic function<br />

if weight functions are not semiring-composed?<br />

Liang Huang<br />

37<br />

Dynamic Programming


Example: A* Parsing<br />

• Use A* search on top of the Knuth Algorithm<br />

•<br />

Showed significant speed up with carefully designed<br />

heuristic functions (Klein and Manning, 2003)<br />

[open problem] can you still define heuristic function<br />

if weight functions are not semiring-composed?<br />

Liang Huang<br />

37<br />

Dynamic Programming


Same Picture Again<br />

monotonic optimization problems<br />

many<br />

NLP<br />

problems<br />

Liang Huang<br />

38<br />

Dynamic Programming


Same Picture Again<br />

monotonic optimization problems<br />

many<br />

NLP<br />

problems<br />

PCFG parsing<br />

with CNF<br />

Liang Huang<br />

38<br />

Dynamic Programming


Same Picture Again<br />

monotonic optimization problems<br />

many<br />

NLP<br />

problems<br />

Inside-Outside Alg.<br />

(Inside semiring)<br />

PCFG parsing<br />

with CNF<br />

Liang Huang<br />

38<br />

Dynamic Programming


Same Picture Again<br />

monotonic optimization problems<br />

many<br />

NLP<br />

problems<br />

Inside-Outside Alg.<br />

(Inside semiring)<br />

max-margin<br />

parsing<br />

PCFG parsing<br />

with CNF<br />

Liang Huang<br />

38<br />

Dynamic Programming


Same Picture Again<br />

monotonic optimization problems<br />

many<br />

NLP<br />

problems<br />

Inside-Outside Alg.<br />

(Inside semiring)<br />

max-margin<br />

parsing<br />

PCFG parsing<br />

with CNF<br />

cyclic<br />

grammars<br />

Liang Huang<br />

38<br />

Dynamic Programming


Same Picture Again<br />

monotonic optimization problems<br />

many<br />

NLP<br />

problems<br />

Inside-Outside Alg.<br />

(Inside semiring)<br />

max-margin<br />

parsing<br />

PCFG parsing<br />

with CNF<br />

cyclic<br />

grammars<br />

generalized<br />

generalized<br />

Bellman-Ford<br />

(open)<br />

Liang Huang<br />

38<br />

Dynamic Programming


Conclusion<br />

• surveyed two general frameworks and two<br />

important algorithms in DP<br />

• adapted the theory of semirings to weight functions<br />

• demonstrated advantages of modularity<br />

• focused on optimization problems<br />

• covered typical NLP applications<br />

• a better understanding of these theories helps NLP<br />

• a generic DP language for NLP: Dyna (Eisner et al.)<br />

• suggested some open problems<br />

Liang Huang<br />

39<br />

Dynamic Programming

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!