Design and development of a concept-based multi ... - Citeseer
Design and development of a concept-based multi ... - Citeseer
Design and development of a concept-based multi ... - Citeseer
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Shiyan Ou, Christopher S.G. Khoo <strong>and</strong> Dion H. Goh<br />
2. Literature review<br />
Summarization approaches can be divided broadly into extractive <strong>and</strong> abstractive approaches. A commonly<br />
used extractive approach is statistics-<strong>based</strong> sentence extraction. Statistical <strong>and</strong> linguistic features<br />
used in sentence extraction include frequent keywords, title keywords, cue phrases, sentence position,<br />
sentence length, <strong>and</strong> so on [4–6]. Sometimes, cohesive links such as lexical chain, co-reference <strong>and</strong><br />
word co-occurrence are also used to extract internally linked sentences <strong>and</strong> thus increase the cohesion<br />
<strong>and</strong> fluency <strong>of</strong> the summaries [7, 8]. Although extractive approaches are easy to implement, the resulting<br />
summaries <strong>of</strong>ten contain redundancy <strong>and</strong> lack cohesion <strong>and</strong> coherence. These weaknesses become<br />
more serious in <strong>multi</strong>-document summarization because the extracted sentences are from different<br />
sources, have different writing styles, <strong>of</strong>ten contain repeated information, <strong>and</strong> lack context. To reduce<br />
redundancy in <strong>multi</strong>-document summaries, some summarization systems, such as MEAD [9], XDoX<br />
[10] <strong>and</strong> MultiGen [11], clustered documents (or sentences) <strong>and</strong> extracted representative sentences from<br />
each cluster as components <strong>of</strong> the summary. In addition, the Maximal Marginal Relevance (MMR) metric<br />
was used by Carbonell <strong>and</strong> Goldstein [12] to minimize the redundancy <strong>and</strong> maximize the diversity<br />
among the extracted text passages (i.e. phrases, sentences, segments, or paragraphs).<br />
In comparison to extractive approaches, abstractive approaches involve text abstraction <strong>and</strong> generation<br />
to produce more coherent <strong>and</strong> concise summaries. Thus abstractive approaches seem more<br />
appropriate for <strong>multi</strong>-document summarization [13]. Real abstractive approaches that completely<br />
imitate human abstracting behavior are difficult to achieve with current natural language processing<br />
techniques [1]. Current abstractive approaches are in reality hybrid approaches involving both<br />
extractive <strong>and</strong> abstractive techniques. Abstractive approaches for <strong>multi</strong>-document summarization<br />
focus mainly on similarities <strong>and</strong> differences across documents, which can be identified <strong>and</strong> synthesized<br />
using various methods. The MultiGen summarizer identified similar words or phrases<br />
across documents through syntactic comparisons <strong>and</strong> converted them into fluent sentences using<br />
natural language generation techniques [11]. Lin [14] identified similar <strong>concept</strong>s <strong>based</strong> on a lexical<br />
thesaurus WordNet <strong>and</strong> generalized these <strong>concept</strong>s using a broader unifying <strong>concept</strong>. Mckeown <strong>and</strong><br />
Radev [15] extracted salient information using template-<strong>based</strong> information extraction <strong>and</strong> combined<br />
the instantiated slots in different templates using various content planning operators (e.g. agreement<br />
<strong>and</strong> contradiction). Zhang et al. [16] added the sentences that have specific cross-document rhetorical<br />
relationships (e.g. equivalence <strong>and</strong> contradiction) into a baseline summary generated using a<br />
sentence extraction method to improve the quality <strong>of</strong> the summary. Afantenos et al. [17] created a<br />
set <strong>of</strong> topic-specific templates using an information extraction system <strong>and</strong> connected these templates<br />
according to synchronic rhetorical relations (e.g. identity, elaboration, contradiction, equivalence)<br />
<strong>and</strong> diachronic rhetorical relations (e.g. continuation, stability).<br />
However, most <strong>of</strong> these studies identified similarities <strong>and</strong> differences using low-level text analysis,<br />
i.e. mainly <strong>based</strong> on lexical, syntactic <strong>and</strong> rhetorical relations between text units (e.g. words, phrases,<br />
<strong>and</strong> sentences). It is desirable to identify similarities <strong>and</strong> differences at a more semantic <strong>and</strong> contextual<br />
level. Thus, this study identified similarities <strong>and</strong> differences focusing on research <strong>concept</strong>s <strong>and</strong><br />
relationships. In sociological studies, the research <strong>concept</strong>s <strong>of</strong>ten represent elements <strong>of</strong> society <strong>and</strong><br />
human behavior whereas the relationships are semantic relations between research <strong>concept</strong>s investigated<br />
by researchers. This study adopts a combination <strong>of</strong> abstractive <strong>and</strong> extractive approaches –<br />
identifying more important sections using discourse parsing, extracting research <strong>concept</strong>s <strong>and</strong><br />
relationships using information extraction techniques, integrating <strong>concept</strong>s <strong>and</strong> relationships using<br />
syntactic analysis, combining the four kinds <strong>of</strong> information using the variable-<strong>based</strong> framework, <strong>and</strong><br />
organizing the integrated <strong>concept</strong>s using a taxonomy to generate a <strong>multi</strong>-document summary.<br />
3. Multi-document summarization system<br />
The summarization system has a blackboard architecture with five modules (shown in Figure 1). Each<br />
module accomplishes one summarization step. A knowledge base was used as a central repository for<br />
Journal <strong>of</strong> Information Science, XX (X) 2007, pp. 1–19 © CILIP, DOI: 10.1177/0165551507084630 3