12.12.2012 Views

Induction of Word and Phrase Alignments for Automatic Document ...

Induction of Word and Phrase Alignments for Automatic Document ...

Induction of Word and Phrase Alignments for Automatic Document ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Computational Linguistics Volume 1, Number 1<br />

leverage the vast number <strong>of</strong> 〈document, abstract〉 pairs that are freely available on the<br />

internet <strong>and</strong> in other collections, <strong>and</strong> to create word- <strong>and</strong> phrase-aligned 〈document,<br />

abstract〉 corpora automatically.<br />

This paper presents a statistical model <strong>for</strong> learning such alignments in a completely<br />

unsupervised manner. The model is based on an extension <strong>of</strong> a hidden Markov model,<br />

in which multiple emissions are made in the course <strong>of</strong> one transition. We have described<br />

efficient algorithms in this framework, all based on dynamic programming. Using this<br />

framework, we have experimented with complex models <strong>of</strong> movement <strong>and</strong> lexical correspondences.<br />

Unlike the approaches used in machine translation, where only very simple<br />

models are used, we have shown how to efficiently <strong>and</strong> effectively leverage such<br />

disparate knowledge sources <strong>and</strong> <strong>Word</strong>Net, syntax trees <strong>and</strong> identity models.<br />

We have empirically demonstrated that our model is able to learn the complex structure<br />

<strong>of</strong> 〈document, abstract〉 pairs. Our system outper<strong>for</strong>ms competing approaches, including<br />

the st<strong>and</strong>ard machine translation alignment models (Brown et al., 1993; Vogel,<br />

Ney, <strong>and</strong> Tillmann, 1996) <strong>and</strong> the state <strong>of</strong> the art Cut & Paste summary alignment technique<br />

(Jing, 2002).<br />

We have analyzed two sources <strong>of</strong> error in our model, including issues <strong>of</strong> nullgenerated<br />

summary words <strong>and</strong> lexical identity. Within the model itself, we have already<br />

suggested two major sources <strong>of</strong> error in our alignment procedure. Clearly more work<br />

needs to be done to fix these problems. One approach that we believe will be particularly<br />

fruitful would be to add a fifth model to the linearly interpolated rewrite model<br />

based on lists <strong>of</strong> synonyms automatically extracted from large corpora. Additionally,<br />

investigating the possibility <strong>of</strong> including some sort <strong>of</strong> weak coreference knowledge into<br />

the model might serve to help with the second class <strong>of</strong> errors made by the model.<br />

One obvious aspect <strong>of</strong> our method that may reduce its general usefulness is the<br />

computation time. In fact, we found that despite the efficient dynamic programming<br />

algorithms available <strong>for</strong> this model, the state space <strong>and</strong> output alphabet are simply so<br />

large <strong>and</strong> complex that we were <strong>for</strong>ced to first map documents down to extracts be<strong>for</strong>e<br />

we could process them (<strong>and</strong> even so, computation took roughly 1000 processor hours).<br />

Though we have not done so in this work, we do believe that there is room <strong>for</strong> improvement<br />

computationally, as well. One obvious first approach would be to run a simpler<br />

model <strong>for</strong> the first iteration (<strong>for</strong> example, Model 1 from machine translation (Brown et<br />

al., 1993), which tends to be very “recall oriented”) <strong>and</strong> use this to see subsequent iterations<br />

<strong>of</strong> the more complex model. By doing so, one could recreate the extracts at each<br />

iteration using the previous iteration’s parameters to make better <strong>and</strong> shorter extracts.<br />

Similarly, one might only allow summary words to align to words found in their corresponding<br />

extract sentences, which would serve to significantly speed up training <strong>and</strong>,<br />

combined with the the parameterized extracts, might not hurt per<strong>for</strong>mance. A final<br />

option, but one that we do not advocate, would be to give up on phrases <strong>and</strong> train the<br />

model in a word-to-word fashion. This could be coupled with heuristic phrasal creation<br />

as is done in machine translation (Och <strong>and</strong> Ney, 2000), but by doing so, one completely<br />

loses the probabilistic interpretation that makes this model so pleasing.<br />

Aside from computational considerations, the most obvious future ef<strong>for</strong>t along the<br />

vein <strong>of</strong> this model is to incorporate it into a full document summarization system. Since<br />

this can be done in many ways, including training extraction systems, compression systems,<br />

headline generation systems <strong>and</strong> even extraction systems, we leave this to future<br />

work so that we could focus specifically on the alignment task in this paper. Nevertheless,<br />

the true usefulness <strong>of</strong> this model will be borne out by its application to true<br />

summarization tasks.<br />

22

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!