28.11.2012 Views

Design and development of a concept-based multi ... - Citeseer

Design and development of a concept-based multi ... - Citeseer

Design and development of a concept-based multi ... - Citeseer

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Shiyan Ou, Christopher S.G. Khoo <strong>and</strong> Dion H. Goh<br />

2. Literature review<br />

Summarization approaches can be divided broadly into extractive <strong>and</strong> abstractive approaches. A commonly<br />

used extractive approach is statistics-<strong>based</strong> sentence extraction. Statistical <strong>and</strong> linguistic features<br />

used in sentence extraction include frequent keywords, title keywords, cue phrases, sentence position,<br />

sentence length, <strong>and</strong> so on [4–6]. Sometimes, cohesive links such as lexical chain, co-reference <strong>and</strong><br />

word co-occurrence are also used to extract internally linked sentences <strong>and</strong> thus increase the cohesion<br />

<strong>and</strong> fluency <strong>of</strong> the summaries [7, 8]. Although extractive approaches are easy to implement, the resulting<br />

summaries <strong>of</strong>ten contain redundancy <strong>and</strong> lack cohesion <strong>and</strong> coherence. These weaknesses become<br />

more serious in <strong>multi</strong>-document summarization because the extracted sentences are from different<br />

sources, have different writing styles, <strong>of</strong>ten contain repeated information, <strong>and</strong> lack context. To reduce<br />

redundancy in <strong>multi</strong>-document summaries, some summarization systems, such as MEAD [9], XDoX<br />

[10] <strong>and</strong> MultiGen [11], clustered documents (or sentences) <strong>and</strong> extracted representative sentences from<br />

each cluster as components <strong>of</strong> the summary. In addition, the Maximal Marginal Relevance (MMR) metric<br />

was used by Carbonell <strong>and</strong> Goldstein [12] to minimize the redundancy <strong>and</strong> maximize the diversity<br />

among the extracted text passages (i.e. phrases, sentences, segments, or paragraphs).<br />

In comparison to extractive approaches, abstractive approaches involve text abstraction <strong>and</strong> generation<br />

to produce more coherent <strong>and</strong> concise summaries. Thus abstractive approaches seem more<br />

appropriate for <strong>multi</strong>-document summarization [13]. Real abstractive approaches that completely<br />

imitate human abstracting behavior are difficult to achieve with current natural language processing<br />

techniques [1]. Current abstractive approaches are in reality hybrid approaches involving both<br />

extractive <strong>and</strong> abstractive techniques. Abstractive approaches for <strong>multi</strong>-document summarization<br />

focus mainly on similarities <strong>and</strong> differences across documents, which can be identified <strong>and</strong> synthesized<br />

using various methods. The MultiGen summarizer identified similar words or phrases<br />

across documents through syntactic comparisons <strong>and</strong> converted them into fluent sentences using<br />

natural language generation techniques [11]. Lin [14] identified similar <strong>concept</strong>s <strong>based</strong> on a lexical<br />

thesaurus WordNet <strong>and</strong> generalized these <strong>concept</strong>s using a broader unifying <strong>concept</strong>. Mckeown <strong>and</strong><br />

Radev [15] extracted salient information using template-<strong>based</strong> information extraction <strong>and</strong> combined<br />

the instantiated slots in different templates using various content planning operators (e.g. agreement<br />

<strong>and</strong> contradiction). Zhang et al. [16] added the sentences that have specific cross-document rhetorical<br />

relationships (e.g. equivalence <strong>and</strong> contradiction) into a baseline summary generated using a<br />

sentence extraction method to improve the quality <strong>of</strong> the summary. Afantenos et al. [17] created a<br />

set <strong>of</strong> topic-specific templates using an information extraction system <strong>and</strong> connected these templates<br />

according to synchronic rhetorical relations (e.g. identity, elaboration, contradiction, equivalence)<br />

<strong>and</strong> diachronic rhetorical relations (e.g. continuation, stability).<br />

However, most <strong>of</strong> these studies identified similarities <strong>and</strong> differences using low-level text analysis,<br />

i.e. mainly <strong>based</strong> on lexical, syntactic <strong>and</strong> rhetorical relations between text units (e.g. words, phrases,<br />

<strong>and</strong> sentences). It is desirable to identify similarities <strong>and</strong> differences at a more semantic <strong>and</strong> contextual<br />

level. Thus, this study identified similarities <strong>and</strong> differences focusing on research <strong>concept</strong>s <strong>and</strong><br />

relationships. In sociological studies, the research <strong>concept</strong>s <strong>of</strong>ten represent elements <strong>of</strong> society <strong>and</strong><br />

human behavior whereas the relationships are semantic relations between research <strong>concept</strong>s investigated<br />

by researchers. This study adopts a combination <strong>of</strong> abstractive <strong>and</strong> extractive approaches –<br />

identifying more important sections using discourse parsing, extracting research <strong>concept</strong>s <strong>and</strong><br />

relationships using information extraction techniques, integrating <strong>concept</strong>s <strong>and</strong> relationships using<br />

syntactic analysis, combining the four kinds <strong>of</strong> information using the variable-<strong>based</strong> framework, <strong>and</strong><br />

organizing the integrated <strong>concept</strong>s using a taxonomy to generate a <strong>multi</strong>-document summary.<br />

3. Multi-document summarization system<br />

The summarization system has a blackboard architecture with five modules (shown in Figure 1). Each<br />

module accomplishes one summarization step. A knowledge base was used as a central repository for<br />

Journal <strong>of</strong> Information Science, XX (X) 2007, pp. 1–19 © CILIP, DOI: 10.1177/0165551507084630 3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!