02.04.2013 Views

Special Issue on Information Explosion

Special Issue on Information Explosion

Special Issue on Information Explosion

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<str<strong>on</strong>g>Special</str<strong>on</strong>g> <str<strong>on</strong>g>Issue</str<strong>on</strong>g> <strong>on</strong> Informati<strong>on</strong> Explosi<strong>on</strong> 209<br />

Fig. 2 An Open Search-engine Infrastructure: TSUBAKI 5)<br />

dren’s physical strength decline.” We need much more powerful, next-generati<strong>on</strong><br />

search facilities to exploit the accumulated informati<strong>on</strong> <strong>on</strong> the Web. Our Infoplosi<strong>on</strong><br />

Project puts stress <strong>on</strong> research infrastructures, and has c<strong>on</strong>structed an<br />

open search-engine infrastructure, called TSUBAKI developed by Kurohashi and<br />

Shinzato (Kyoto University), for next-generati<strong>on</strong> search technologies based <strong>on</strong><br />

deep natural language processing (NLP). 5) TSUBAKI is equipped with 100 milli<strong>on</strong><br />

Japanese Web pages, cleaned up and marked up with sentence boundaries<br />

for each page, and preprocessed with the morphological analyses, syntactic analyses<br />

and syn<strong>on</strong>ymous expressi<strong>on</strong> analyses for sentences. The resultant enriched<br />

data is stored in an XML-based format. In additi<strong>on</strong> to words, syn<strong>on</strong>ymous expressi<strong>on</strong>s<br />

and dependency relati<strong>on</strong>s are registered in the index, which enables<br />

TSUBAKI to handle complex natural language expressi<strong>on</strong>s in queries and documents<br />

and to identify syn<strong>on</strong>ymous expressi<strong>on</strong>s (Fig. 2).<br />

TSUBAKI is not <strong>on</strong>ly used through ordinary Web browsers, but it also<br />

provides an open access to its Web corpus and search APIs. A large text corpus<br />

is essential to knowledge extracti<strong>on</strong> and enhancement of NLP, and the XMLbased<br />

100-milli<strong>on</strong>-page corpus has been widely used in the research community.<br />

Furthermore, white-boxed TSUBAKI APIs are supporting the next-generati<strong>on</strong><br />

search research activities, such as clustering-based Web informati<strong>on</strong> summarizati<strong>on</strong>,<br />

or opini<strong>on</strong> and sentiment overview.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!