29.01.2014 Views

Extractive Summarization of Development Emails

Extractive Summarization of Development Emails

Extractive Summarization of Development Emails

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

- the produced summary<br />

- the short version <strong>of</strong> the summary<br />

- the length <strong>of</strong> the summary<br />

- a FeaturesExtractor instance<br />

- a global tagger<br />

Class Summary behavior:<br />

- summUpMail() appends to the same text all the email sentences marked as relevant, and passes the obtained<br />

summary to MailSummaryView<br />

- makeShortSummary() creates a summary considering only sentences whose relevance leaf is at higher levels<br />

<strong>of</strong> the tree, until reaching 60% <strong>of</strong> the length <strong>of</strong> the original message.<br />

The task <strong>of</strong> class FeaturesExtractor, in a nutshell, is to implement the algorithm outputted by Weka Explorer.<br />

FeaturesExtractor took as input a Mail object and a tagger, removed the lines <strong>of</strong> the format “On ,<br />

wrote:" right before a replying-email and the replying-lines starting with ">", and stored:<br />

- the subject <strong>of</strong> the email<br />

- the passed tagger which will be used to individuate the several grammatical parts <strong>of</strong> speech<br />

- a sentence-id hash map<br />

- a matrix where, for each sentence <strong>of</strong> the email, all the values <strong>of</strong> the extracted relevant features are inserted<br />

Dealing with parts <strong>of</strong> speech<br />

As we saw at Section 3.5, among the important features for the determination <strong>of</strong> the relevance <strong>of</strong> a sentence there is<br />

also num_verbs_norm. To understand how many verbs there are in a sentence, we used the function tagString(<<br />

st ring >) <strong>of</strong> the MaxentTagger we stored. This tool is a part-<strong>of</strong>-speech tagger suitable for Java programming,<br />

developed by Stanford natural Language Processing Group 21 . We installed it in Eclipse very easily by importing<br />

stanford-postagger.jar in the Referenced Libraries, and by creating a folder tagger where to place the English<br />

vocabulary files.<br />

Implementation <strong>of</strong> the relevance tree<br />

The function determineRelevance() computes through a long i f − chain the relevance <strong>of</strong> a sentence by accessing<br />

the information stored in the features matrix. The result (1 if the sentence is calculated to be “relevant", 0 otherwise)<br />

is then stored back in the last column <strong>of</strong> the matrix itself.<br />

How to<br />

Using such a plug-in is really intuitive. The user:<br />

1. opens, in Eclipse editor, REmail and REmail Summary views<br />

2. lets CouchDB start (if a shared server among developers does not already exist)<br />

3. clicks on some class to visualize the related emails in the REmail box view<br />

4. reads the extractive summary in REmail Summary box view.<br />

5. If the user judges that the produced summary is still too long, we provide a “Shorten" button that returns a<br />

skimmed summary which is long at most 60% <strong>of</strong> the original text. We remind, however, that this is a completely<br />

deterministic process, surely not aimed at preserving the human generated-likeness. This functionality is active<br />

only for emails longer than 4 sentences.<br />

21 http://nlp.stanford.edu/s<strong>of</strong>tware/tagger.shtml<br />

24

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!