Extractive Summarization of Development Emails
Extractive Summarization of Development Emails
Extractive Summarization of Development Emails
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
- the produced summary<br />
- the short version <strong>of</strong> the summary<br />
- the length <strong>of</strong> the summary<br />
- a FeaturesExtractor instance<br />
- a global tagger<br />
Class Summary behavior:<br />
- summUpMail() appends to the same text all the email sentences marked as relevant, and passes the obtained<br />
summary to MailSummaryView<br />
- makeShortSummary() creates a summary considering only sentences whose relevance leaf is at higher levels<br />
<strong>of</strong> the tree, until reaching 60% <strong>of</strong> the length <strong>of</strong> the original message.<br />
The task <strong>of</strong> class FeaturesExtractor, in a nutshell, is to implement the algorithm outputted by Weka Explorer.<br />
FeaturesExtractor took as input a Mail object and a tagger, removed the lines <strong>of</strong> the format “On ,<br />
wrote:" right before a replying-email and the replying-lines starting with ">", and stored:<br />
- the subject <strong>of</strong> the email<br />
- the passed tagger which will be used to individuate the several grammatical parts <strong>of</strong> speech<br />
- a sentence-id hash map<br />
- a matrix where, for each sentence <strong>of</strong> the email, all the values <strong>of</strong> the extracted relevant features are inserted<br />
Dealing with parts <strong>of</strong> speech<br />
As we saw at Section 3.5, among the important features for the determination <strong>of</strong> the relevance <strong>of</strong> a sentence there is<br />
also num_verbs_norm. To understand how many verbs there are in a sentence, we used the function tagString(<<br />
st ring >) <strong>of</strong> the MaxentTagger we stored. This tool is a part-<strong>of</strong>-speech tagger suitable for Java programming,<br />
developed by Stanford natural Language Processing Group 21 . We installed it in Eclipse very easily by importing<br />
stanford-postagger.jar in the Referenced Libraries, and by creating a folder tagger where to place the English<br />
vocabulary files.<br />
Implementation <strong>of</strong> the relevance tree<br />
The function determineRelevance() computes through a long i f − chain the relevance <strong>of</strong> a sentence by accessing<br />
the information stored in the features matrix. The result (1 if the sentence is calculated to be “relevant", 0 otherwise)<br />
is then stored back in the last column <strong>of</strong> the matrix itself.<br />
How to<br />
Using such a plug-in is really intuitive. The user:<br />
1. opens, in Eclipse editor, REmail and REmail Summary views<br />
2. lets CouchDB start (if a shared server among developers does not already exist)<br />
3. clicks on some class to visualize the related emails in the REmail box view<br />
4. reads the extractive summary in REmail Summary box view.<br />
5. If the user judges that the produced summary is still too long, we provide a “Shorten" button that returns a<br />
skimmed summary which is long at most 60% <strong>of</strong> the original text. We remind, however, that this is a completely<br />
deterministic process, surely not aimed at preserving the human generated-likeness. This functionality is active<br />
only for emails longer than 4 sentences.<br />
21 http://nlp.stanford.edu/s<strong>of</strong>tware/tagger.shtml<br />
24