Extractive Summarization of Development Emails

More documents

Recommendations

Info

Figure 10. Partial view of joined ArgoUML + FreeNet benchmark (rows 1-20, columns I-Q) Since we had to extract the values for all the features from every sentence of every analyzed email thread and write them in the prepared CSV files, this procedure would have been absurdly long to be done manually. So we implemented a Python program called “CsvHandler" to automate the filling. We largely used the nltk library 13 which is great at tokenizing periods, tagging words with the grammatical part of speech they represent, managing stop-words. We used mostly the function nltk.regexp_tokenize(, ) to split a sentence into words. It takes a string and tokenizes it applying the given regular expression. We employed the following one: PATTERN = r’’’(?x) ’’’ \w+\://\S+ | i\.e\. | e\.g\. | \w+(/\w+)+ | \w+(\\\w+)+ | \w+((-|_|\.)(?!(as|java|php|c|h)\W|\w+\()\w+)+ | \w+ | -\d+ To tag the adjectives, the nouns, the verbs in a sentence we used the function nltk.pos_tag(), which takes an array of singular words, and returns a list of pairs (words, grammatical part of speech) (e.g. ’VB’ for verbs, ’NN’ for nouns, ...). To recognize stop-words we exploited the nltk.corpus stopwords. Once run this program on all the email threads, we merged all the results in a single CSV file that we used as input benchmark for the machine learning. 3.5 Machine Learning There exist many precise and technical definitions for machine learning. In simple terms, it is a branch of artificial intelligence aimed to design special algorithms that allow computers to evolve behaviors, by exploiting empirical data. In our research, by giving our benchmark as input to one of these algorithms, we should be able to definitely elaborate which features are most important in human summarization and use that result to automatically build summaries with our system. We performed this job with the support of the platform Weka Explorer 14 . We opened as input file our CSV containing the merged information from ArgoUML and FreeNet, and we removed redundant features (such as the 13 http://nltk.org/ 14 http://www.cs.waikato.ac.nz/ml/weka/ 19
singular vote of each participant, that is already included in the sum of them stored at column "hmn_score") that could foul up the results. We started the environment with the following settings: • We chose as attribute evaluator ClassifierSubsetEval 15 that uses a classifier to estimate the "merit" of a set of attributes. • The search method was BestFirst “which explores a graph by expanding the most promising node chosen according to a specified rule" 16 . • We set "class" as feature to evaluate. Figure 11. Weka processing platform Weka Explorer rewarded us with a relevance tree [Figure 12] stating that the 6 attributes which determine the relevance of a sentence are chars, num_nouns_norm, num_stopw_norm, num_verbs_norm, rel_pos_norm, and subj_words_norm. Thanks to the tree we could determine whether a sentence should be included in the summary or not, simply by considering its values of these 6 attributes and going through the conditions written in the tree (starting from the root). If we arrived to a leaf labeled as "relevant", we include the sentence in the summary, otherwise not. 15 http://bio.informatics.indiana.edu/ml _docs/weka/weka.attributeSelection.ClassifierSubsetEval.html 16 http://en.wikipedia.org/wiki/Best-first _search 20
Page 1 and 2: Bachelor Thesis June 12, 2012 Extra
Page 3 and 4: Acknowledgments I would like to tha
Page 5 and 6: in charge of some "creative" produc
Page 7 and 8: method the terms were ordered by de
Page 9 and 10: • [4] that used a deterministic "
Page 11 and 12: need for reading them. Murray et al
Page 13 and 14: 2. Is extracting only keywords more
Page 15 and 16: • If a sentence contains some cit
Page 17 and 18: While for strace thread type follow
Page 19: The results of Round 2 showed that
Page 23 and 24: 4 Implementation 4.1 Summit While t
Page 25 and 26: - the produced summary - the short
Page 27 and 28: 5 Conclusions and Future Work Our r
Page 29 and 30: A Pilot exploration material A.1 Fi
Page 31 and 32: A.2 Second pilot test output 1. Ema
Page 33 and 34: B Benchmark creation test B.1 "Roun
Page 35 and 36: What we need back from you is a zip
Page 37 and 38: Figure 22. Average on answers to qu
Page 39 and 40: Figure 27. Partial view of joined A
Page 41: References [1] A. Bacchelli, M. Lan

Extractive Summarization of Development Emails

Create successful ePaper yourself

Delete template?

Save as template?