A Tree to String Transducer

Machine translation 

A Tree to String Transducer 

K. Yamada, K. Knight: A Syntax-Based Statistical Translation Model ACL 2001. 

• The input sentence is preprocessed by a syntactic parser 

• The channel performs operations on each node of the parse tree: 

– reordering child nodes 

– inserting extra words at each node 

– translating leaf words 

• The output of the the model is a string. 

March 7, 2011 T2S 1


Parse Tree(E) 

PRP VB1 

VB 

PRP VB2 VB1 

he ha 

he adores 

TO VB 

MN TO 

music 

Sentence(J) 

to 

VB 

listening 

VB2 

VB TO 

TO 


ga 

listening no 

MN 

music 

adores 

An Example ∗ 

desu 

Reorder 

Insert 

Translate 

Take Leaves 

VB 

PRP VB2 VB1 

he 

TO VB 

MN TO 

music 

Kare ha ongaku wo kiku no ga daisuki desu 


listening 

VB 

adores 

PRP VB2 VB1 

kare ha 

TO VB 

MN TO 

ongaku wo kiku 

∗ Source: http://www.isi.edu/natural-language/people/cs562-8-22-06.pdf 

March 7, 2011 T2S 2 

ga 

no 

daisuki 

desu 

.


⇒ The reordering is decided according to the r-table 

VB 

PRP VB1 VB2 

He adores VB TO 

listening TO NN 

original order reordering P(reorder) 

PRP VB1 VB2 0.074 

PRP VB2 VB1 0.723 

PRP VB1 VB2 VB1 PRP VB2 0.061 

· · · · · · 

VB TO VB TO 0.252 

TO VB 0.749 

TO NN TO NN 0.107 

NN TO 0.893 

· · · · · · 

to music 

VB 

PRP VB2 

VB1 

He TO VB adores 

NN 

music 

Reordering probability: 0.723 · 0.749 · 0.893 = 0.484 

TO 


listening 

March 7, 2011 T2S 3


⇒The insertion of a new node is decided according to the n-table 

VB 

PRP VB2 

VB1 

parent TOP VB VB VB TO TO · · · 

node VB VB PRP TO TO NN · · · 

P(None) 0.735 0.687 0.344 0.709 0.900 0.800 · · · 

P(Left) 0.004 0.061 0.004 0.030 0.003 0.096 · · · 

P(right) 0.260 0.252 0.652 0.261 0.007 0.104 · · · 

He TO VB adores 

NN 

music 

TO 


listening 

VB 

w P(ins-w) 

ha 0.219 

ta 0.131 

wo 0.099 

no 0.094 

ni 0.080 

te 0.078 

ga 0.062 

. . 

desu 0.0007 

PRP VB2 

VB1 

He ha TO VB ga adores desu 

NN 

music 

TO 

listening no 

Insertion probability: (0.652·0.219)·(0.252·0.094)·(0.252·0.062)·(0.252·0.0007)· 

0.735 · 0.709 · 0.900 · 0.800 = 3.498e − 9 

March 7, 2011 T2S 4 


. 

.


⇒The translation is decided according to the t-table 

adores he listening music to · · · 

daisuki 1.000 kare 0.952 kiku 0.333 ongaku 0.900 ni 0.216 · · · 

NULL 0.016 kii 0.333 naru 0.100 NULL 0.204 

nani 0.005 mi 0.333 to 0.133 

VB 

PRP VB2 

VB1 

He ha TO VB ga adores desu 

NN 

music 

TO 


listening no 

. 

. 

. 

. 

. 

VB 

PRP VB2 

VB1 

kare ha TO VB ga daisuki desu 

NN 

ongaku 

TO 

wo 

. 

kiku no 

Translation probability: 0.952 · 0.900 · 0.038 · 1.000 = 0.0108 

March 7, 2011 T2S 5


Formal description 

• Goal: Transform an English parse tree E into a French sentence f 

• Definitions 

- E consists of nodes ε1, ε2,...,εn 

- f consists of words f1,f2, ...,fn 

- θi = (νi,ρi, τi) is a set of values of random variables associated to εi 

- θ = θ1,θ2, ...,θn is the set of all random variables associated with a parse 

tree E = ε1, ε2,...,εn 

P(f|E) = 

 

P(θ|E) 

θ:Str(θ(E))=f 

P(θ|E) = P(θ1, θ2, ...,θn|ε1,ε2, ...,εn) 

n 

= P(θi|θ1, θ2, ...,θn,ε1, ε2,...,εn) 

≈ 

i=1 

n 

P(θi|εi) 

i=1 

March 7, 2011 T2S 6


Where 

Formal description 

P(θi|εi) = P(νi, ρi,τi|εi) ≈ P(νi|εi)P(ρi|εi)P(τi|εi) 

= P(νi|N(εi))P(ρi|R(εi))P(τi|T (εi)) 

= n(νi|N(εi))r(ρi|R(εi))t(τi|T (εi)) 

n(ν|N(ε)) ≡ n(ν|N), r(ρ|R(ε)) ≡ r(ρ|R), t(τ|T (ε)) ≡ t(τ|T) 

are the parameters of the model 

For example: 

• n(ν|N) = P(right, ha|VB − PRP) 

• r(ρ|R) = P(PRP − VB2 − VB1|PRP − VB1 − VB2) 

P(f|E) = 

θ:Str(θ(E))=f 

n 

i=1 n(νi|N(εi))r(ρi|R(εi))t(τi|T (εi)) 

March 7, 2011 T2S 7


Estimation of the parameters 

1. Initialize all probability tables: n(ν|N), r(ρ|R) and t(τ|T) 

2. Reset all counters: c(ν, N), c(ρ, R) and c(τ, T) 

3. For each pair 〈E,f〉 in the training corpus 

For all θ , such that f = Str(θ(E)) 

- Let cnt = P(θ|E)/ 

θ:Str(θ(E))=f P(θ|E) 

- For i = 1...n, 

c(νi, N(εi))+ = cnt 

c(ρi, R(εi))+ = cnt 

c(τi, T (εi))+ = cnt 

4. For each 〈ν, N〉, 〈ρ, R〉, and 〈τ, T 〉 

n(ν|N) = c(ν, N)/ 

ν 

c(ν, N) 

r(ρ|R) = c(ρ, R)/ 

ρ c(ρ, R) 

t(τ|T) = c(τ, T)/ 

τ 

c(τ, T) 

5. Repeat steps 2-4 for several iterations 

March 7, 2011 T2S 8


Efficient EM training 

The EM algorithm uses a graph structure for a pair 〈E,f〉 

• A major-node v(εi,f l k ) shows a pairing of a subtree of E and a substring of f 

• Each major node connects to several ν-subnode v(ν; εi,f l k ), showing which 

value of ν is selected. The arc has weight P(ν|εi) 

• A ν-subnode v(ν; εi,f l k ) connects to a 

final-node with weight P(τ|εi) if εi is a 

terminal node 

• A ν-subnode connects to several ρsubnodes 

v(ρ; ν, εi,f l P(ρ|εi) 

k ) with weight 

• A ρ-subnode is connected to π-subnodes 

v(π; ρ, ν, εi,f l k ) with weight 1.0. The 

variable π shows a particular way of 

partitioning fl k 

P(ν|ε) 

P(ρ|ε) 

major-node 

ν-subnode 

ρ-subnode 

π-subnode 

• A π-subnode is connected to major-nodes corresponding to children of εi 

with weight 1.0. A major-node can be connected from different π-subnodes. 

major-node 

March 7, 2011 T2S 9


Efficient EM training 

• A trace starting from the graph root, selecting one of the arcs from 

major-nodes, ν-subnodes and ρ-subnodes and all the arcs from πsubnodes 

corresponds to a particular to a particular θ 

• The product of the weight on the trace corresponds to P(θ|E) 

• An estimation algorithm similar to the inside-outside algorithm can 

be defined. 

• The time complexity is O(n 3 |ν||ρ||π|) 

March 7, 2011 T2S 10


Decoder description 

K. Yamada, K. Knight: A decoder for Syntax-based Statistical MT ACL 2001. 

Modifications to the original MT for phrasal translations: 

• Fertility µ is used to allow 1-to-N mapping: 

t(τ|T) = t(f1f2 ...fl|e) = µ(l|e) 

l 

t(fi|e) 

• Direct translation φ of an English phrase e1e2 ...em to a foreign phrase 

f1f2 ...fl at non-terminal tree nodes: 

i=1 

ph(φ|Φ) = t(f1f2 . . .fl|e1e2 ...em) = µ(l|e1e2 . ..em) 

• Linear mix (if εi is non-terminal): 

l 

t(fi|e1e2 ...em) 

i=1 

P(θi|εi) = λΦi ph(φi|Φi) + (1 − λΦi )r(ρi|Ri)n(νi|Ni) 

March 7, 2011 T2S 11


Decoder description 

• Given a French sentence, the decoder will find the most plausible English 

parse tree 

• Idea: a mechanism similar to normal parsing is used 

• Steps: 

1. Start from an English context-free grammar and incorporate to it the 

channel operations 

2. For each non-lexical rule (such as “VP → VB NP PP”), supplement the 

grammar with reordered rules and probabilities are taken from the r-table 

3. Rules such as “VP → VP X” and “X → word” are added and probabilities 

are taken from the n-table 

4. For each lexical rule in the English grammar, we add rules such as 

“englishWord → foreingWord” 

5. Parse a string of foreign words 

6. Undo reordering operations and remove leaf nodes with foreign words 

7. Among all possible tree, choose pick the best in which the product of the 

LM and the TM probability is the highest 

March 7, 2011 T2S 12

A Tree to String Transducer

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?