12.12.2012 Views

Induction of Word and Phrase Alignments for Automatic Document ...

Induction of Word and Phrase Alignments for Automatic Document ...

Induction of Word and Phrase Alignments for Automatic Document ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Computational Linguistics Volume 1, Number 1<br />

Connecting Point has become the single largest Mac retailer after tripling it ’s Macintosh sales since January 1989 .<br />

Connecting Point Systems tripled it ’s sales <strong>of</strong> Apple Macintosh systems since last January . It is now the single largest seller <strong>of</strong> Macintosh .<br />

Figure 1<br />

Example alignment <strong>of</strong> a single abstract sentence with two document sentences.<br />

tively simple alignment between a document fragment <strong>and</strong> its corresponding abstract<br />

fragment from our corpus. In this example, a single abstract sentence (shown along<br />

the top <strong>of</strong> the figure) corresponds to exactly two document sentences (shown along the<br />

bottom <strong>of</strong> the figure). If we are able to automatically generate such alignments, one can<br />

envision the development <strong>of</strong> models <strong>of</strong> summarization that take into account effects<br />

<strong>of</strong> word choice, phrasal <strong>and</strong> sentence reordering, <strong>and</strong> content selection. Such models<br />

could be simultaneously linguistically motivated <strong>and</strong> data-driven. Furthermore, such<br />

alignments are potentially useful <strong>for</strong> current-day summarization techniques, including<br />

sentence extraction, headline generation <strong>and</strong> document compression.<br />

A close examination <strong>of</strong> the alignment shown in Figure 1 leads us to three observations<br />

about the nature <strong>of</strong> the relationship between a document <strong>and</strong> its abstract, <strong>and</strong><br />

hence about the alignment itself:<br />

• <strong>Alignments</strong> can occur at the granularity <strong>of</strong> words <strong>and</strong> <strong>of</strong> phrases.<br />

• The ordering <strong>of</strong> phrases in an abstract can be different from the ordering <strong>of</strong><br />

phrases in the document.<br />

• Some abstract words do not have direct correspondents in the document, <strong>and</strong><br />

many document words are never used in an abstract.<br />

In order to develop an alignment model that could potentially recreate such an<br />

alignment, we need our model to be able to operate both at the word level <strong>and</strong> at the<br />

phrase level, we need it to be able to allow arbitrary reorderings, <strong>and</strong> we need it to be<br />

able to account <strong>for</strong> words on both the document <strong>and</strong> abstract side that have no direct correspondence.<br />

In this paper, we develop an alignment model that is capable <strong>of</strong> learning<br />

all these aspects <strong>of</strong> the alignment problem in a completely unsupervised fashion.<br />

1.2 Shortcomings <strong>of</strong> current summarization models<br />

Current state <strong>of</strong> the art automatic single-document summarization systems employ one<br />

<strong>of</strong> three techniques: sentence extraction, bag-<strong>of</strong>-words headline generation, or document<br />

compression. Sentence extraction systems take full sentences from a document<br />

<strong>and</strong> concatenate them to <strong>for</strong>m a summary. Research in sentence extraction can be traced<br />

back to work in the mid fifties <strong>and</strong> late sixties by Luhn (1956) <strong>and</strong> Edmundson (1969).<br />

Recent techniques are startlingly not terribly divergent from these original methods;<br />

see (Mani <strong>and</strong> Maybury, 1999; Marcu, 2000; Mani, 2001) <strong>for</strong> a comprehensive overview.<br />

Headline generation systems, on the other h<strong>and</strong>, typically extract individual words<br />

from a document to produce a very short headline-style summary; see (Banko, Mittal,<br />

<strong>and</strong> Witbrock, 2000; Berger <strong>and</strong> Mittal, 2000; Schwartz, Zajic, <strong>and</strong> Dorr, 2002) <strong>for</strong> representative<br />

examples. Between these two extremes, there has been a relatively modest<br />

amount <strong>of</strong> work in sentence simplification (Ch<strong>and</strong>rasekar, Doran, <strong>and</strong> Bangalore, 1996;<br />

Mahesh, 1997; Carroll et al., 1998; Grefenstette, 1998; Jing, 2000; Knight <strong>and</strong> Marcu,<br />

2002) <strong>and</strong> document compression (Daumé III <strong>and</strong> Marcu, 2002; Daumé III <strong>and</strong> Marcu,<br />

2004; Zajic, Dorr, <strong>and</strong> Schwartz, 2004) in which words, phrases <strong>and</strong> sentences are selected<br />

in an extraction process.<br />

2

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!