slides
slides
slides
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Dynamic Programming Algorithms<br />
in Semiring and Hypergraph Frameworks<br />
(University of Pennsylvania)<br />
<br />
Fall 2007
Guest Lectures: Outline<br />
• I plan to teach 3 topics on Algorithms in NLP/MT<br />
• Dynamic Programming: An Advanced Survey<br />
• Semiring and Hypergraph Frameworks<br />
• Viterbi and Dijkstra/Knuth Algorithms<br />
• k-best Algorithms<br />
• k-best parsing and MT decoding<br />
• other k-best problems and algorithms<br />
• Perceptron-like Algorithms<br />
Liang Huang<br />
2<br />
Dynamic Programming
Dynamic Programming<br />
Liang Huang<br />
3<br />
Dynamic Programming
Dynamic Programming<br />
• Dynamic Programming is everywhere in NLP<br />
•<br />
• CKY Algorithm for Context-free Parsing<br />
•<br />
Viterbi Algorithm for Hidden Markov Models<br />
Forward-Backward and Inside-Outside Algorithms<br />
• A* Parsing 3<br />
Liang Huang<br />
Dynamic Programming
Dynamic Programming<br />
• Dynamic Programming is everywhere in NLP<br />
•<br />
• CKY Algorithm for Context-free Parsing<br />
•<br />
• A* Parsing<br />
• Also everywhere in Machine Translation<br />
• Decoding: Phrase-based, Syntax-based, etc.<br />
• Alignment: IBM models, HMM model, etc.<br />
Viterbi Algorithm for Hidden Markov Models<br />
Forward-Backward and Inside-Outside Algorithms<br />
Liang Huang<br />
3<br />
Dynamic Programming
Dynamic Programming<br />
• Dynamic Programming is everywhere in NLP<br />
•<br />
• CKY Algorithm for Context-free Parsing<br />
•<br />
• A* Parsing<br />
• Also everywhere in Machine Translation<br />
• Decoding: Phrase-based, Syntax-based, etc.<br />
• Alignment: IBM models, HMM model, etc.<br />
• Focus on Optimization Problems<br />
Liang Huang<br />
Viterbi Algorithm for Hidden Markov Models<br />
Forward-Backward and Inside-Outside Algorithms<br />
3<br />
Dynamic Programming
Two Dimensional Survey<br />
traversing order<br />
search space<br />
Viterbi<br />
Generalized<br />
Viterbi<br />
Dijkstra<br />
Knuth<br />
Liang Huang<br />
4<br />
Dynamic Programming
Graphs in NLP<br />
Liang Huang<br />
5<br />
Dynamic Programming
Motivations for Semirings<br />
• in a weighted graph, we need two operators:<br />
•<br />
•<br />
•<br />
extension (multiplicative) and summary (additive)<br />
the weight of a path is the product of edge weights<br />
the weight of a vertex is the summary of path weights<br />
e1<br />
e2<br />
d(π 1 )= ⊗<br />
e i ∈π 1<br />
w(e i )=w(e 1 ) ⊗ w(e 2 ) ⊗ w(e 3 )<br />
e3<br />
...<br />
...<br />
d(t) = ⊕ π i<br />
w(π i )<br />
= w(p 1 ) ⊕ w(p 2 ) ⊕···<br />
Liang Huang<br />
6<br />
Dynamic Programming
Definitions<br />
A monoid is a triple (A, ⊗, 1) where<br />
1. ⊗ is a closed associative binary operator on the set A,<br />
2. 1istheidentity element for ⊗, i.e., for all a ∈ A, a ⊗ 1=1 ⊗ a = a.<br />
A monoid is commutative if ⊗ is commutative.<br />
Liang Huang<br />
7<br />
Dynamic Programming
Definitions<br />
A monoid is a triple (A, ⊗, 1) where<br />
1. ⊗ is a closed associative binary operator on the set A,<br />
2. 1istheidentity element for ⊗, i.e., for all a ∈ A, a ⊗ 1=1 ⊗ a = a.<br />
A monoid is commutative if ⊗ is commutative. ([0, 1], +, 0)<br />
([0, 1], , 1)<br />
([0, 1], max, 0)<br />
Liang Huang<br />
7<br />
Dynamic Programming
Definitions<br />
A monoid is a triple (A, ⊗, 1) where<br />
1. ⊗ is a closed associative binary operator on the set A,<br />
2. 1istheidentity element for ⊗, i.e., for all a ∈ A, a ⊗ 1=1 ⊗ a = a.<br />
A monoid is commutative if ⊗ is commutative.<br />
A semiring is a 5-tuple R =(A, ⊕, ⊗, 0, 1) such that<br />
1. (A, ⊕, 0) is a commutative monoid.<br />
([0, 1], +, 0)<br />
([0, 1], , 1)<br />
([0, 1], max, 0)<br />
2. (A, ⊗, 1) is a monoid.<br />
3. ⊗ distributes over ⊕: for all a, b, c in A,<br />
(a ⊕ b) ⊗ c =(a ⊗ c) ⊕ (b ⊗ c),<br />
c ⊗ (a ⊕ b) =(c ⊗ a) ⊕ (c ⊗ b).<br />
4. 0isanannihilator for ⊗: for all a in A, 0 ⊗ a = a ⊗ 0=0.<br />
Liang Huang<br />
7<br />
Dynamic Programming
Definitions<br />
A monoid is a triple (A, ⊗, 1) where<br />
1. ⊗ is a closed associative binary operator on the set A,<br />
2. 1istheidentity element for ⊗, i.e., for all a ∈ A, a ⊗ 1=1 ⊗ a = a.<br />
A monoid is commutative if ⊗ is commutative.<br />
A semiring is a 5-tuple R =(A, ⊕, ⊗, 0, 1) such that<br />
1. (A, ⊕, 0) is a commutative monoid.<br />
2. (A, ⊗, 1) is a monoid.<br />
3. ⊗ distributes over ⊕: for all a, b, c in A,<br />
([0, 1], +, 0)<br />
([0, 1], , 1)<br />
([0, 1], max, 0)<br />
([0, 1], max, , 0, 1)<br />
(a ⊕ b) ⊗ c =(a ⊗ c) ⊕ (b ⊗ c),<br />
c ⊗ (a ⊕ b) =(c ⊗ a) ⊕ (c ⊗ b).<br />
4. 0isanannihilator for ⊗: for all a in A, 0 ⊗ a = a ⊗ 0=0.<br />
Liang Huang<br />
7<br />
Dynamic Programming
Definitions<br />
A monoid is a triple (A, ⊗, 1) where<br />
1. ⊗ is a closed associative binary operator on the set A,<br />
2. 1istheidentity element for ⊗, i.e., for all a ∈ A, a ⊗ 1=1 ⊗ a = a.<br />
A monoid is commutative if ⊗ is commutative.<br />
A semiring is a 5-tuple R =(A, ⊕, ⊗, 0, 1) such that<br />
1. (A, ⊕, 0) is a commutative monoid.<br />
2. (A, ⊗, 1) is a monoid.<br />
3. ⊗ distributes over ⊕: for all a, b, c in A,<br />
(a ⊕ b) ⊗ c =(a ⊗ c) ⊕ (b ⊗ c),<br />
c ⊗ (a ⊕ b) =(c ⊗ a) ⊕ (c ⊗ b).<br />
([0, 1], +, 0)<br />
([0, 1], , 1)<br />
([0, 1], max, 0)<br />
([0, 1], max, , 0, 1)<br />
([0, 1], +, , 0, 1)<br />
4. 0isanannihilator for ⊗: for all a in A, 0 ⊗ a = a ⊗ 0=0.<br />
Liang Huang<br />
7<br />
Dynamic Programming
Examples<br />
Semiring Set ⊕ ⊗ 0 1 intuition/application<br />
Boolean {0, 1} ∨ ∧ 0 1 logical deduction, recognition<br />
Viterbi [0, 1] max × 0 1 prob. of the best derivation<br />
Inside R + ∪{+∞} + × 0 1 prob. of a string<br />
Real R ∪{+∞} min + +∞ 0 shortest-distance<br />
Tropical R + ∪{+∞} min + +∞ 0 with non-negative weights<br />
Counting N + × 0 1 number of paths<br />
how about ?<br />
Liang Huang<br />
8<br />
Dynamic Programming
Ordering<br />
Liang Huang<br />
9<br />
Dynamic Programming
Ordering<br />
• idempotent 9<br />
A semiring (A, ⊕, ⊗, 0, 1) is idempotent if for all a in A, a ⊕ a = a.<br />
Liang Huang<br />
Dynamic Programming
Ordering<br />
• idempotent<br />
A semiring (A, ⊕, ⊗, 0, 1) is idempotent if for all a in A, a ⊕ a = a.<br />
• comparison<br />
(a ≤ b) ⇔ (a ⊕ b = a) defines a partial ordering.<br />
• examples: boolean, viterbi, tropical, real, ...<br />
({0, 1}, ∨, ∧, 0, 1) (R + ∪{+∞}, min, +, +∞, 0)<br />
([0, 1], max, ⊗, 0, 1) (R ∪{+∞}, min, +, +∞, 0)<br />
Liang Huang<br />
9<br />
Dynamic Programming
Ordering<br />
• idempotent<br />
A semiring (A, ⊕, ⊗, 0, 1) is idempotent if for all a in A, a ⊕ a = a.<br />
• comparison<br />
(a ≤ b) ⇔ (a ⊕ b = a) defines a partial ordering.<br />
• examples: boolean, viterbi, tropical, real, ...<br />
({0, 1}, ∨, ∧, 0, 1) (R + ∪{+∞}, min, +, +∞, 0)<br />
([0, 1], max, ⊗, 0, 1) (R ∪{+∞}, min, +, +∞, 0)<br />
• total-order for optimization problems<br />
A semiring is totally-ordered if ⊕ defines a total ordering.<br />
• examples: all of the above 9<br />
Liang Huang<br />
Dynamic Programming
Monotonicity<br />
Liang Huang<br />
10<br />
Dynamic Programming
Monotonicity<br />
• monotonicity 10<br />
Liang Huang<br />
Dynamic Programming
Monotonicity<br />
• monotonicity 10<br />
Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />
We say K is monotonic if for all a, b, c ∈ A<br />
(a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)<br />
Liang Huang<br />
Dynamic Programming
• monotonicity<br />
Monotonicity<br />
Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />
We say K is monotonic if for all a, b, c ∈ A<br />
(a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)<br />
• optimal substructure in dynamic programming<br />
Liang Huang<br />
10<br />
Dynamic Programming
• monotonicity<br />
Monotonicity<br />
Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />
We say K is monotonic if for all a, b, c ∈ A<br />
(a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)<br />
• optimal substructure in dynamic programming<br />
A: b ⊗ c<br />
Liang Huang<br />
10<br />
Dynamic Programming
• monotonicity<br />
Monotonicity<br />
Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />
We say K is monotonic if for all a, b, c ∈ A<br />
(a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)<br />
• optimal substructure in dynamic programming<br />
A: b ⊗ c<br />
A: b’ ⊗ c<br />
b ⊗ c<br />
<br />
Liang Huang<br />
10<br />
Dynamic Programming
• monotonicity<br />
Monotonicity<br />
Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />
We say K is monotonic if for all a, b, c ∈ A<br />
(a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)<br />
• optimal substructure in dynamic programming<br />
A: b ⊗ c<br />
• [lemma] idempotent => monotone<br />
A: b’ ⊗ c<br />
b ⊗ c<br />
<br />
Liang Huang<br />
10<br />
Dynamic Programming
• monotonicity<br />
Monotonicity<br />
Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />
We say K is monotonic if for all a, b, c ∈ A<br />
(a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)<br />
• optimal substructure in dynamic programming<br />
A: b ⊗ c<br />
A: b’ ⊗ c<br />
b ⊗ c<br />
<br />
• [lemma] idempotent => monotone<br />
•<br />
Liang Huang<br />
our focus, totally-ordered semirings, are monotone<br />
10<br />
Dynamic Programming
• monotonicity<br />
Monotonicity<br />
Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />
We say K is monotonic if for all a, b, c ∈ A<br />
(a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)<br />
• optimal substructure in dynamic programming<br />
A: b ⊗ c<br />
A: b’ ⊗ c<br />
b ⊗ c<br />
<br />
• [lemma] idempotent => monotone<br />
•<br />
Liang Huang<br />
our focus, totally-ordered semirings, are monotone<br />
10<br />
free lunch!<br />
Dynamic Programming
Two Dimensional Survey<br />
traversing order<br />
search space<br />
Viterbi<br />
Generalized<br />
Viterbi<br />
Dijkstra<br />
Knuth<br />
Liang Huang<br />
11<br />
Dynamic Programming
DP on Graphs<br />
• optimization problems on graphs<br />
=> generic shortest-path problem<br />
• weighted directed graph G=(V, E) with a function w<br />
that assigns each edge a weight from a semiring<br />
• compute the best weight of the target vertex t<br />
• generic update along edge (u, v)<br />
w(u, v) d(v) ⊕ = d(u) ⊗ w(u, v)<br />
• how to avoid cyclic updates?<br />
•<br />
only update when d(u) is fixed<br />
d(v) ← d(v) ⊕ (d(u) ⊗ w(u, v))<br />
Liang Huang<br />
12<br />
Dynamic Programming
Viterbi Algorithm for DAGs<br />
1. topological sort<br />
2. visit each vertex v in sorted order and do updates<br />
• for each incoming edge (u, v) in E<br />
• use d(u) to update d(v):<br />
•<br />
key observation: d(u) is fixed to optimal at this time<br />
w(u, v)<br />
d(v) ⊕ = d(u) ⊗ w(u, v)<br />
• time complexity: O( V + E )<br />
Liang Huang<br />
13<br />
Dynamic Programming
Variant 1: forward-update<br />
1. topological sort<br />
2. visit each vertex v in sorted order and do updates<br />
• for each outgoing edge (v, u) in E<br />
• use d(v) to update d(u):<br />
•<br />
d(u) ⊕ = d(v) ⊗ w(v, u)<br />
key observation: d(v) is fixed to optimal at this time<br />
w(v, u)<br />
• time complexity: O( V + E )<br />
Liang Huang<br />
14<br />
Dynamic Programming
Examples<br />
Liang Huang<br />
15<br />
Dynamic Programming
Examples<br />
• [Number of Paths in a DAG]<br />
Liang Huang<br />
15<br />
Dynamic Programming
Examples<br />
• [Number of Paths in a DAG]<br />
•<br />
•<br />
just use the counting semiring (N, +, , 0, 1)<br />
note: this is not an optimization problem!<br />
Liang Huang<br />
15<br />
Dynamic Programming
Examples<br />
• [Number of Paths in a DAG]<br />
•<br />
•<br />
just use the counting semiring (N, +, , 0, 1)<br />
note: this is not an optimization problem!<br />
• [Longest Path in a DAG] 15<br />
Liang Huang<br />
Dynamic Programming
Examples<br />
• [Number of Paths in a DAG]<br />
•<br />
•<br />
• [Longest Path in a DAG]<br />
just use the counting semiring (N, +, , 0, 1)<br />
note: this is not an optimization problem!<br />
• just use the semiring 15<br />
(R ∪{−∞}, max, +, −∞, 0)<br />
Liang Huang<br />
Dynamic Programming
Examples<br />
• [Number of Paths in a DAG]<br />
•<br />
•<br />
• [Longest Path in a DAG]<br />
• just use the semiring<br />
•<br />
just use the counting semiring (N, +, , 0, 1)<br />
note: this is not an optimization problem!<br />
(R ∪{−∞}, max, +, −∞, 0)<br />
[Part-of-Speech Tagging with a Hidden Markov Model]<br />
Liang Huang<br />
15<br />
Dynamic Programming
Examples<br />
• [Number of Paths in a DAG]<br />
•<br />
•<br />
• [Longest Path in a DAG]<br />
• just use the semiring<br />
•<br />
just use the counting semiring (N, +, , 0, 1)<br />
note: this is not an optimization problem!<br />
(R ∪{−∞}, max, +, −∞, 0)<br />
[Part-of-Speech Tagging with a Hidden Markov Model]<br />
Liang Huang<br />
15<br />
Dynamic Programming
Example: Word-based Decoding<br />
Liang Huang<br />
16<br />
Dynamic Programming
Example: Word Alignment<br />
Liang Huang<br />
17<br />
Dynamic Programming
Dijkstra Algorithm<br />
Liang Huang<br />
18<br />
Dynamic Programming
Dijkstra Algorithm<br />
• Dijkstra does not require acyclicity<br />
•<br />
•<br />
instead of topological order, we use best-first order<br />
but this requires superiority of the semiring<br />
Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />
We say K is superior if for all a, b ∈ A<br />
a ≤ a ⊗ b, b ≤ a ⊗ b.<br />
• basically, combination always gets worse<br />
• or, no negative edge in a graph<br />
Liang Huang<br />
18<br />
Dynamic Programming
Dijkstra Algorithm<br />
• Dijkstra does not require acyclicity<br />
•<br />
•<br />
instead of topological order, we use best-first order<br />
but this requires superiority of the semiring<br />
Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />
We say K is superior if for all a, b ∈ A<br />
a ≤ a ⊗ b, b ≤ a ⊗ b.<br />
• basically, combination always gets worse<br />
• or, no negative edge in a graph<br />
d(u)<br />
w(e)<br />
d(u) ⊗ w(e)<br />
Liang Huang<br />
18<br />
Dynamic Programming
Dijkstra Algorithm<br />
• Dijkstra does not require acyclicity<br />
•<br />
•<br />
instead of topological order, we use best-first order<br />
but this requires superiority of the semiring<br />
Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />
We say K is superior if for all a, b ∈ A<br />
d(u)<br />
• basically, combination always gets worse<br />
• or, no negative edge in a graph<br />
w(e)<br />
a ≤ a ⊗ b, b ≤ a ⊗ b.<br />
d(u) ⊗ w(e)<br />
({0, 1}, ∨, ∧, 0, 1)<br />
([0, 1], max, ×, 0, 1)<br />
(R + ∪{+∞}, min, +, +∞, 0)<br />
(R ∪{+∞}, min, +, +∞, 0)<br />
Liang Huang<br />
18<br />
Dynamic Programming
Dijkstra Algorithm<br />
• Dijkstra does not require acyclicity<br />
•<br />
•<br />
instead of topological order, we use best-first order<br />
but this requires superiority of the semiring<br />
Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />
We say K is superior if for all a, b ∈ A<br />
d(u)<br />
• basically, combination always gets worse<br />
• or, no negative edge in a graph<br />
w(e)<br />
a ≤ a ⊗ b, b ≤ a ⊗ b.<br />
d(u) ⊗ w(e)<br />
({0, 1}, ∨, ∧, 0, 1)<br />
([0, 1], max, ×, 0, 1)<br />
(R + ∪{+∞}, min, +, +∞, 0)<br />
(R ∪{+∞}, min, +, +∞, 0)<br />
Liang Huang<br />
18<br />
Dynamic Programming
Dijkstra Algorithm<br />
• Dijkstra does not require acyclicity<br />
•<br />
•<br />
instead of topological order, we use best-first order<br />
but this requires superiority of the semiring<br />
Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />
We say K is superior if for all a, b ∈ A<br />
d(u)<br />
• basically, combination always gets worse<br />
• or, no negative edge in a graph<br />
w(e)<br />
a ≤ a ⊗ b, b ≤ a ⊗ b.<br />
d(u) ⊗ w(e)<br />
({0, 1}, ∨, ∧, 0, 1)<br />
([0, 1], max, ×, 0, 1)<br />
(R + ∪{+∞}, min, +, +∞, 0)<br />
(R ∪{+∞}, min, +, +∞, 0)<br />
Liang Huang<br />
18<br />
Dynamic Programming
Dijkstra Algorithm<br />
• Dijkstra does not require acyclicity<br />
•<br />
•<br />
instead of topological order, we use best-first order<br />
but this requires superiority of the semiring<br />
Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />
We say K is superior if for all a, b ∈ A<br />
d(u)<br />
• basically, combination always gets worse<br />
• or, no negative edge in a graph<br />
w(e)<br />
a ≤ a ⊗ b, b ≤ a ⊗ b.<br />
d(u) ⊗ w(e)<br />
({0, 1}, ∨, ∧, 0, 1)<br />
([0, 1], max, ×, 0, 1)<br />
(R + ∪{+∞}, min, +, +∞, 0)<br />
(R ∪{+∞}, min, +, +∞, 0)<br />
Liang Huang<br />
18<br />
Dynamic Programming
Dijkstra Algorithm<br />
• Dijkstra does not require acyclicity<br />
•<br />
•<br />
instead of topological order, we use best-first order<br />
but this requires superiority of the semiring<br />
Let K =(A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A.<br />
We say K is superior if for all a, b ∈ A<br />
d(u)<br />
• basically, combination always gets worse<br />
• or, no negative edge in a graph<br />
w(e)<br />
a ≤ a ⊗ b, b ≤ a ⊗ b.<br />
d(u) ⊗ w(e)<br />
({0, 1}, ∨, ∧, 0, 1)<br />
([0, 1], max, ×, 0, 1)<br />
(R + ∪{+∞}, min, +, +∞, 0)<br />
(R ∪{+∞}, min, +, +∞, 0)<br />
Liang Huang<br />
18<br />
Dynamic Programming
Dijkstra Algorithm<br />
• keep a cut (S : V - S) where S vertices are fixed<br />
• maintain a priority queue Q of V - S vertices<br />
• each iteration choose the best vertex v from Q<br />
•<br />
move v to S, and use d(v) to forward-update others<br />
...<br />
d(u) ⊕ = d(v) ⊗ w(v, u)<br />
S<br />
V - S<br />
time complexity:<br />
O((V+E) lgV) (binary heap)<br />
O(V lgV + E) (fib. heap)<br />
Liang Huang<br />
19<br />
Dynamic Programming
Dijkstra Algorithm<br />
• keep a cut (S : V - S) where S vertices are fixed<br />
• maintain a priority queue Q of V - S vertices<br />
• each iteration choose the best vertex v from Q<br />
•<br />
move v to S, and use d(v) to forward-update others<br />
...<br />
d(u) ⊕ = d(v) ⊗ w(v, u)<br />
S<br />
V - S<br />
time complexity:<br />
O((V+E) lgV) (binary heap)<br />
O(V lgV + E) (fib. heap)<br />
Liang Huang<br />
19<br />
Dynamic Programming
Dijkstra Algorithm<br />
• keep a cut (S : V - S) where S vertices are fixed<br />
• maintain a priority queue Q of V - S vertices<br />
• each iteration choose the best vertex v from Q<br />
•<br />
move v to S, and use d(v) to forward-update others<br />
w(v, u)<br />
d(u) ⊕ = d(v) ⊗ w(v, u)<br />
...<br />
S<br />
V - S<br />
time complexity:<br />
O((V+E) lgV) (binary heap)<br />
O(V lgV + E) (fib. heap)<br />
Liang Huang<br />
19<br />
Dynamic Programming
Viterbi vs. Dijkstra<br />
• structural vs. algebraic constraints<br />
•<br />
Dijkstra only applicable to optimization problems<br />
monotonic optimization problems<br />
Liang Huang<br />
20<br />
Dynamic Programming
Viterbi vs. Dijkstra<br />
• structural vs. algebraic constraints<br />
•<br />
Dijkstra only applicable to optimization problems<br />
monotonic optimization problems<br />
Liang Huang<br />
20<br />
Dynamic Programming
Viterbi vs. Dijkstra<br />
• structural vs. algebraic constraints<br />
•<br />
Dijkstra only applicable to optimization problems<br />
monotonic optimization problems<br />
Liang Huang<br />
20<br />
Dynamic Programming
Viterbi vs. Dijkstra<br />
• structural vs. algebraic constraints<br />
•<br />
Dijkstra only applicable to optimization problems<br />
monotonic optimization problems<br />
many<br />
NLP<br />
problems<br />
Liang Huang<br />
20<br />
Dynamic Programming
Viterbi vs. Dijkstra<br />
• structural vs. algebraic constraints<br />
•<br />
Dijkstra only applicable to optimization problems<br />
monotonic optimization problems<br />
many<br />
NLP<br />
problems<br />
forward-backward<br />
(Inside semiring)<br />
Liang Huang<br />
20<br />
Dynamic Programming
Viterbi vs. Dijkstra<br />
• structural vs. algebraic constraints<br />
•<br />
Dijkstra only applicable to optimization problems<br />
monotonic optimization problems<br />
many<br />
NLP<br />
problems<br />
forward-backward<br />
(Inside semiring)<br />
Liang Huang<br />
max-margin<br />
models<br />
20<br />
Dynamic Programming
Viterbi vs. Dijkstra<br />
• structural vs. algebraic constraints<br />
•<br />
Dijkstra only applicable to optimization problems<br />
monotonic optimization problems<br />
many<br />
NLP<br />
problems<br />
forward-backward<br />
(Inside semiring)<br />
Liang Huang<br />
max-margin<br />
models<br />
20<br />
cyclic FSMs/<br />
grammars<br />
Dynamic Programming
What if both fail?<br />
monotonic optimization problems<br />
many<br />
NLP<br />
problems<br />
generalized Bellman-Ford<br />
(CLR, 1990; Mohri, 2002)<br />
or, first do strongly-connected components (SCC)<br />
which gives a DAG; use Viterbi globally on this SCC-DAG;<br />
use Bellman-Ford locally within each SCC<br />
Liang Huang<br />
21<br />
Dynamic Programming
What if both work?<br />
monotonic optimization problems<br />
many<br />
NLP<br />
problems<br />
full Dijkstra is slower than Viterbi<br />
O((V + E) lgV) vs. O(V + E)<br />
but it can finish as early as the target vertex is popped<br />
a (V + E) lgV vs. V + E<br />
Q: how to (magically) reduce a?<br />
Liang Huang<br />
22<br />
Dynamic Programming
A* Search<br />
d(v)<br />
h(v)<br />
• d(v): the distance from source s to v<br />
• h(v): the distance from v to target t<br />
• (v) must be an optimistic estimate of h(v): (v) h(v)<br />
• now, prioritize the queue by d(v) ⊗ (v)<br />
• Dijkstra is a special case where (v) = <br />
• also requires d(v) ⊗ (v) never improves (superior)<br />
• hope: d(t) ⊗ (t) = d(t) can be popped sooner<br />
Liang Huang<br />
23<br />
(v)<br />
Dynamic Programming
Two Dimensional Survey<br />
traversing order<br />
search space<br />
Viterbi<br />
Generalized<br />
Viterbi<br />
Dijkstra<br />
Knuth<br />
Liang Huang<br />
24<br />
Dynamic Programming
Two Dimensional Survey<br />
traversing order<br />
search space<br />
Viterbi<br />
Generalized<br />
Viterbi<br />
Dijkstra<br />
Knuth<br />
Liang Huang<br />
24<br />
Dynamic Programming
Background: CFG and Parsing<br />
Liang Huang<br />
25<br />
Dynamic Programming
(Directed) Hypergraphs<br />
• a generalization of graphs<br />
•<br />
• e = (T(e), h(e), f e). arity |e| = |T(e)|<br />
• a totally-ordered weight set R<br />
• we borrow the operator to be the comparison<br />
• weight function f e : R |e| to R<br />
• generalizes the ⊗ operator in semirings<br />
edge => hyperedge: several vertices to one vertex<br />
tails<br />
Liang Huang<br />
1<br />
2<br />
fe<br />
head<br />
d(v) ⊕ = f e (d(u 1 ),d(u 2 ))<br />
26<br />
Dynamic Programming
Hypergraphs and Deduction<br />
tails<br />
: a<br />
1 2<br />
: b<br />
: a : b<br />
1 2<br />
antecedents<br />
fe<br />
: fe (a,b)<br />
: fe (a,b)<br />
fe<br />
head<br />
consequent<br />
Liang Huang<br />
27<br />
(Nederhof, 2003)<br />
Dynamic Programming
Hypergraphs and Deduction<br />
tails<br />
: a<br />
1 2<br />
: b<br />
: a : b<br />
1 2<br />
antecedents<br />
fe<br />
: fe (a,b)<br />
: fe (a,b)<br />
fe<br />
head<br />
consequent<br />
(B, i, k)<br />
: a<br />
(C, k, j)<br />
: b<br />
1 2<br />
Liang Huang<br />
fe<br />
: a b Pr(A B C)<br />
(A, i, j)<br />
27<br />
(Nederhof, 2003)<br />
Dynamic Programming
Hypergraphs and Deduction<br />
tails<br />
: a<br />
1 2<br />
: b<br />
: a : b<br />
1 2<br />
antecedents<br />
fe<br />
: fe (a,b)<br />
: fe (a,b)<br />
fe<br />
head<br />
consequent<br />
(B, i, k)<br />
: a<br />
(C, k, j)<br />
: b<br />
1 2<br />
fe<br />
(B, i, k) (C, k, j)<br />
(A, i, j)<br />
AB C<br />
: a b Pr(A B C)<br />
(A, i, j)<br />
(Nederhof, 2003)<br />
Liang Huang<br />
27<br />
Dynamic Programming
Weight Functions and Semirings<br />
Liang Huang<br />
28<br />
Dynamic Programming
Weight Functions and Semirings<br />
d(u)<br />
w(e)<br />
d(u) ⊗ w(e)<br />
Liang Huang<br />
28<br />
Dynamic Programming
Weight Functions and Semirings<br />
d(u)<br />
w(e)<br />
d(u) ⊗ w(e)<br />
d(u)<br />
fe<br />
fe(d(u))<br />
Liang Huang<br />
28<br />
Dynamic Programming
Weight Functions and Semirings<br />
d(u)<br />
d(u)<br />
w(e)<br />
fe<br />
d(u) ⊗ w(e)<br />
fe(d(u))<br />
fe(a) = a ⊗ w(e)<br />
Liang Huang<br />
28<br />
Dynamic Programming
Weight Functions and Semirings<br />
d(u)<br />
d(u)<br />
w(e)<br />
fe<br />
d(u) ⊗ w(e)<br />
fe(d(u))<br />
fe(a) = a ⊗ w(e)<br />
1<br />
tails<br />
2<br />
...<br />
fe<br />
head<br />
k<br />
Liang Huang<br />
28<br />
Dynamic Programming
Weight Functions and Semirings<br />
d(u)<br />
d(u)<br />
w(e)<br />
fe<br />
d(u) ⊗ w(e)<br />
fe(d(u))<br />
fe(a) = a ⊗ w(e)<br />
tails<br />
1<br />
2<br />
...<br />
fe<br />
fe(a1, ..., ak)<br />
head<br />
k<br />
Liang Huang<br />
28<br />
Dynamic Programming
Weight Functions and Semirings<br />
d(u)<br />
d(u)<br />
w(e)<br />
fe<br />
d(u) ⊗ w(e)<br />
fe(d(u))<br />
fe(a) = a ⊗ w(e)<br />
tails<br />
1<br />
2<br />
...<br />
fe<br />
fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e)<br />
head<br />
k<br />
Liang Huang<br />
28<br />
Dynamic Programming
Weight Functions and Semirings<br />
d(u)<br />
d(u)<br />
w(e)<br />
fe<br />
d(u) ⊗ w(e)<br />
fe(d(u))<br />
fe(a) = a ⊗ w(e)<br />
tails<br />
1<br />
2<br />
...<br />
fe<br />
fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e)<br />
head<br />
k<br />
Liang Huang<br />
28<br />
Dynamic Programming
Weight Functions and Semirings<br />
d(u)<br />
w(e)<br />
d(u) ⊗ w(e)<br />
fe(a) = a ⊗ w(e)<br />
d(u)<br />
fe<br />
fe(d(u))<br />
semiringcomposed<br />
tails<br />
1<br />
2<br />
...<br />
fe<br />
fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e)<br />
head<br />
k<br />
Liang Huang<br />
28<br />
Dynamic Programming
Weight Functions and Semirings<br />
d(u)<br />
w(e)<br />
d(u) ⊗ w(e)<br />
fe(a) = a ⊗ w(e)<br />
d(u)<br />
fe<br />
fe(d(u))<br />
semiringcomposed<br />
tails<br />
1<br />
2<br />
...<br />
fe<br />
fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e)<br />
head<br />
k<br />
also extend monotonicity and<br />
superiority to weight functions<br />
Liang Huang<br />
28<br />
Dynamic Programming
Two Dimensional Survey<br />
traversing order<br />
search space<br />
Viterbi<br />
Generalized<br />
Viterbi<br />
Dijkstra<br />
Knuth<br />
Liang Huang<br />
29<br />
Dynamic Programming
Viterbi Algorithm for DAGs<br />
1. topological sort<br />
2. visit each vertex v in sorted order and do updates<br />
• for each incoming edge (u, v) in E<br />
• use d(u) to update d(v):<br />
•<br />
key observation: d(u) is fixed to optimal at this time<br />
w(u, v)<br />
d(v) ⊕ = d(u) ⊗ w(u, v)<br />
• time complexity: O( V + E )<br />
Liang Huang<br />
30<br />
Dynamic Programming
Viterbi Algorithm for DAHs<br />
1. topological sort<br />
2. visit each vertex v in sorted order and do updates<br />
• for each incoming hyperedge e = ((u 1, .., u|e|), v, fe)<br />
• use d(u i)’s to update d(v)<br />
• key observation: d(ui)’s are fixed to optimal at this time<br />
1<br />
2<br />
fe<br />
d(v) ⊕ = f e (d(u 1 ), ··· ,d(u |e| ))<br />
• time complexity: O( V + E ) (assuming constant arity)<br />
Liang Huang<br />
31<br />
Dynamic Programming
Example: CKY Parsing<br />
• parsing with CFGs in Chomsky Normal Form (CNF)<br />
• typical instance of the generalized Viterbi for DAHs<br />
• many variants of CKY ~ various topological ordering<br />
(S, 1, n) (S, 1, n) (S, 1, n)<br />
Liang Huang<br />
O(n 3 |P|)<br />
with CKY pseudo-code<br />
32<br />
Dynamic Programming
Example: CKY Parsing<br />
• parsing with CFGs in Chomsky Normal Form (CNF)<br />
• typical instance of the generalized Viterbi for DAHs<br />
• many variants of CKY ~ various topological ordering<br />
(S, 1, n) (S, 1, n) (S, 1, n)<br />
Liang Huang<br />
O(n 3 |P|)<br />
with CKY pseudo-code<br />
32<br />
Dynamic Programming
Example: CKY Parsing<br />
• parsing with CFGs in Chomsky Normal Form (CNF)<br />
• typical instance of the generalized Viterbi for DAHs<br />
• many variants of CKY ~ various topological ordering<br />
(S, 1, n) (S, 1, n) (S, 1, n)<br />
bottom-up<br />
Liang Huang<br />
O(n 3 |P|)<br />
with CKY pseudo-code<br />
32<br />
Dynamic Programming
Example: CKY Parsing<br />
• parsing with CFGs in Chomsky Normal Form (CNF)<br />
• typical instance of the generalized Viterbi for DAHs<br />
• many variants of CKY ~ various topological ordering<br />
(S, 1, n) (S, 1, n) (S, 1, n)<br />
bottom-up<br />
Liang Huang<br />
O(n 3 |P|)<br />
with CKY pseudo-code<br />
32<br />
Dynamic Programming
Example: CKY Parsing<br />
• parsing with CFGs in Chomsky Normal Form (CNF)<br />
• typical instance of the generalized Viterbi for DAHs<br />
• many variants of CKY ~ various topological ordering<br />
(S, 1, n) (S, 1, n) (S, 1, n)<br />
bottom-up<br />
left-to-right<br />
Liang Huang<br />
O(n 3 |P|)<br />
with CKY pseudo-code<br />
32<br />
Dynamic Programming
Example: CKY Parsing<br />
• parsing with CFGs in Chomsky Normal Form (CNF)<br />
• typical instance of the generalized Viterbi for DAHs<br />
• many variants of CKY ~ various topological ordering<br />
(S, 1, n) (S, 1, n) (S, 1, n)<br />
bottom-up<br />
left-to-right<br />
Liang Huang<br />
O(n 3 |P|)<br />
with CKY pseudo-code<br />
32<br />
Dynamic Programming
Example: CKY Parsing<br />
• parsing with CFGs in Chomsky Normal Form (CNF)<br />
• typical instance of the generalized Viterbi for DAHs<br />
• many variants of CKY ~ various topological ordering<br />
(S, 1, n) (S, 1, n) (S, 1, n)<br />
bottom-up left-to-right right-to-left<br />
Liang Huang<br />
O(n 3 |P|)<br />
with CKY pseudo-code<br />
32<br />
Dynamic Programming
Forward Variant for DAHs<br />
1. topological sort<br />
2. visit each vertex v in sorted order and do updates<br />
• for each outgoing hyperedge e = ((u 1, .., u|e|), h(e), fe)<br />
• if d(ui)’s have all been fixed to optimal<br />
• use d(ui)’s to update d(h(e))<br />
v = i<br />
1 fe<br />
2 =<br />
• time complexity: O( V + E )<br />
Liang Huang<br />
33<br />
Dynamic Programming
Forward Variant for DAHs<br />
1. topological sort<br />
2. visit each vertex v in sorted order and do updates<br />
• for each outgoing hyperedge e = ((u 1, .., u|e|), h(e), fe)<br />
• if d(ui)’s have all been fixed to optimal<br />
• use d(ui)’s to update d(h(e))<br />
v = i<br />
1 fe<br />
2 =<br />
• time complexity: O( V + E )<br />
Liang Huang<br />
33<br />
Dynamic Programming
Forward Variant for DAHs<br />
1. topological sort<br />
2. visit each vertex v in sorted order and do updates<br />
• for each outgoing hyperedge e = ((u 1, .., u|e|), h(e), fe)<br />
• if d(ui)’s have all been fixed to optimal<br />
• use d(ui)’s to update d(h(e))<br />
v = i<br />
Q: how to avoid repeated checking?<br />
maintain a counter r[e] for each e:<br />
how many tails yet to be fixed?<br />
fire this hyperedge only if r[e]=0<br />
• time complexity: O( V + E )<br />
Liang Huang<br />
33<br />
2 =<br />
1 fe<br />
Dynamic Programming
Example: Treebank Parsers<br />
• State-of-the-art statistical parsers<br />
• (Collins, 1999; Charniak, 2000)<br />
•<br />
• can’t do backward updates<br />
• don’t know how to decompose a big item<br />
• forward update from vertex (X, i, j)<br />
no fixed grammar (every production is possible)<br />
• check all vertices like (Y, j, k) or (Y, k, i) in the chart (fixed)<br />
• try combine them to form bigger item (Z, i, k) or (Z, k, j)<br />
Liang Huang<br />
34<br />
Dynamic Programming
Dijkstra Algorithm<br />
• keep a cut (S : V - S) where S vertices are fixed<br />
• maintain a priority queue Q of V - S vertices<br />
• each iteration choose the best vertex v from Q<br />
•<br />
move v to S, and use d(v) to forward-update others<br />
...<br />
d(u) ⊕ = d(v) ⊗ w(v, u)<br />
S<br />
V - S<br />
time complexity:<br />
O((V+E) lgV) (binary heap)<br />
O(V lgV + E) (fib. heap)<br />
Liang Huang<br />
35<br />
Dynamic Programming
Dijkstra Algorithm<br />
• keep a cut (S : V - S) where S vertices are fixed<br />
• maintain a priority queue Q of V - S vertices<br />
• each iteration choose the best vertex v from Q<br />
•<br />
move v to S, and use d(v) to forward-update others<br />
...<br />
d(u) ⊕ = d(v) ⊗ w(v, u)<br />
S<br />
V - S<br />
time complexity:<br />
O((V+E) lgV) (binary heap)<br />
O(V lgV + E) (fib. heap)<br />
Liang Huang<br />
35<br />
Dynamic Programming
Dijkstra Algorithm<br />
• keep a cut (S : V - S) where S vertices are fixed<br />
• maintain a priority queue Q of V - S vertices<br />
• each iteration choose the best vertex v from Q<br />
•<br />
move v to S, and use d(v) to forward-update others<br />
w(v, u)<br />
d(u) ⊕ = d(v) ⊗ w(v, u)<br />
...<br />
S<br />
V - S<br />
time complexity:<br />
O((V+E) lgV) (binary heap)<br />
O(V lgV + E) (fib. heap)<br />
Liang Huang<br />
35<br />
Dynamic Programming
Knuth (1977) Algorithm<br />
• keep a cut (S : V - S) where S vertices are fixed<br />
• maintain a priority queue Q of V - S vertices<br />
• each iteration choose the best vertex v from Q<br />
•<br />
move v to S, and use d(v) to forward-update others<br />
...<br />
1<br />
S<br />
V - S<br />
time complexity:<br />
O((V+E) lgV) (binary heap)<br />
O(V lgV + E) (fib. heap)<br />
Liang Huang<br />
36<br />
Dynamic Programming
Knuth (1977) Algorithm<br />
• keep a cut (S : V - S) where S vertices are fixed<br />
• maintain a priority queue Q of V - S vertices<br />
• each iteration choose the best vertex v from Q<br />
•<br />
move v to S, and use d(v) to forward-update others<br />
...<br />
1<br />
S<br />
V - S<br />
time complexity:<br />
O((V+E) lgV) (binary heap)<br />
O(V lgV + E) (fib. heap)<br />
Liang Huang<br />
36<br />
Dynamic Programming
Knuth (1977) Algorithm<br />
• keep a cut (S : V - S) where S vertices are fixed<br />
• maintain a priority queue Q of V - S vertices<br />
• each iteration choose the best vertex v from Q<br />
•<br />
move v to S, and use d(v) to forward-update others<br />
...<br />
1 fe<br />
S<br />
V - S<br />
time complexity:<br />
O((V+E) lgV) (binary heap)<br />
O(V lgV + E) (fib. heap)<br />
Liang Huang<br />
36<br />
Dynamic Programming
Knuth (1977) Algorithm<br />
• keep a cut (S : V - S) where S vertices are fixed<br />
• maintain a priority queue Q of V - S vertices<br />
• each iteration choose the best vertex v from Q<br />
•<br />
move v to S, and use d(v) to forward-update others<br />
...<br />
1 fe<br />
S<br />
V - S<br />
time complexity:<br />
O((V+E) lgV) (binary heap)<br />
O(V lgV + E) (fib. heap)<br />
Liang Huang<br />
36<br />
Dynamic Programming
Example: A* Parsing<br />
• Use A* search on top of the Knuth Algorithm<br />
•<br />
Showed significant speed up with carefully designed<br />
heuristic functions (Klein and Manning, 2003)<br />
[open problem] can you still define heuristic function<br />
if weight functions are not semiring-composed?<br />
Liang Huang<br />
37<br />
Dynamic Programming
Example: A* Parsing<br />
• Use A* search on top of the Knuth Algorithm<br />
•<br />
Showed significant speed up with carefully designed<br />
heuristic functions (Klein and Manning, 2003)<br />
[open problem] can you still define heuristic function<br />
if weight functions are not semiring-composed?<br />
Liang Huang<br />
37<br />
Dynamic Programming
Same Picture Again<br />
monotonic optimization problems<br />
many<br />
NLP<br />
problems<br />
Liang Huang<br />
38<br />
Dynamic Programming
Same Picture Again<br />
monotonic optimization problems<br />
many<br />
NLP<br />
problems<br />
PCFG parsing<br />
with CNF<br />
Liang Huang<br />
38<br />
Dynamic Programming
Same Picture Again<br />
monotonic optimization problems<br />
many<br />
NLP<br />
problems<br />
Inside-Outside Alg.<br />
(Inside semiring)<br />
PCFG parsing<br />
with CNF<br />
Liang Huang<br />
38<br />
Dynamic Programming
Same Picture Again<br />
monotonic optimization problems<br />
many<br />
NLP<br />
problems<br />
Inside-Outside Alg.<br />
(Inside semiring)<br />
max-margin<br />
parsing<br />
PCFG parsing<br />
with CNF<br />
Liang Huang<br />
38<br />
Dynamic Programming
Same Picture Again<br />
monotonic optimization problems<br />
many<br />
NLP<br />
problems<br />
Inside-Outside Alg.<br />
(Inside semiring)<br />
max-margin<br />
parsing<br />
PCFG parsing<br />
with CNF<br />
cyclic<br />
grammars<br />
Liang Huang<br />
38<br />
Dynamic Programming
Same Picture Again<br />
monotonic optimization problems<br />
many<br />
NLP<br />
problems<br />
Inside-Outside Alg.<br />
(Inside semiring)<br />
max-margin<br />
parsing<br />
PCFG parsing<br />
with CNF<br />
cyclic<br />
grammars<br />
generalized<br />
generalized<br />
Bellman-Ford<br />
(open)<br />
Liang Huang<br />
38<br />
Dynamic Programming
Conclusion<br />
• surveyed two general frameworks and two<br />
important algorithms in DP<br />
• adapted the theory of semirings to weight functions<br />
• demonstrated advantages of modularity<br />
• focused on optimization problems<br />
• covered typical NLP applications<br />
• a better understanding of these theories helps NLP<br />
• a generic DP language for NLP: Dyna (Eisner et al.)<br />
• suggested some open problems<br />
Liang Huang<br />
39<br />
Dynamic Programming