27.06.2013 Views

Proceedings of the 12th European Conference on Knowledge ...

Proceedings of the 12th European Conference on Knowledge ...

Proceedings of the 12th European Conference on Knowledge ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Taha Osman, Dhavalkumar Thakker and Matt Nathan<br />

inferencing. In this c<strong>on</strong>tributi<strong>on</strong>, we focus <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> technologies that are directly related to knowledge<br />

management, namely informati<strong>on</strong> extracti<strong>on</strong> (IE), <strong>on</strong>tology engineering, and <str<strong>on</strong>g>the</str<strong>on</strong>g> knowledgebase.<br />

Next we document our experience in exploiting and integrating <str<strong>on</strong>g>the</str<strong>on</strong>g> above technologies to build an<br />

intelligent browsing engine for PA Images, in view to c<strong>on</strong>tributing to <str<strong>on</strong>g>the</str<strong>on</strong>g> methodology <str<strong>on</strong>g>of</str<strong>on</strong>g> building<br />

semantic-driven knowledge management frameworks for this class <str<strong>on</strong>g>of</str<strong>on</strong>g> applicati<strong>on</strong>s.<br />

4.1 Informati<strong>on</strong> extracti<strong>on</strong><br />

Image<br />

capti<strong>on</strong>s<br />

GATE-based IE System<br />

Gazetteer (known entities)<br />

JAPE Grammar (c<strong>on</strong>text rules)<br />

Disambiguati<strong>on</strong>/Summarisati<strong>on</strong><br />

Annotated Image<br />

Capti<strong>on</strong>s<br />

What to store<br />

What to extract<br />

C<strong>on</strong>firmati<strong>on</strong><br />

Entities <str<strong>on</strong>g>of</str<strong>on</strong>g> Interest Learned Facts<br />

PA <strong>Knowledge</strong>base<br />

PA Images<br />

Ontology<br />

Schema<br />

Data<br />

PA Images view<br />

Linked<br />

Data Cloud<br />

Figure 1: Semantic Annotati<strong>on</strong> System<br />

It is fundamental to semantic retrieval systems that <str<strong>on</strong>g>the</str<strong>on</strong>g>re is a mechanism to annotate images with<br />

descriptive metadata (entities) representing key c<strong>on</strong>cepts and relati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> investigated applicati<strong>on</strong><br />

domain. In our framework as illustrated in Figure 1 below, <str<strong>on</strong>g>the</str<strong>on</strong>g> metadata generati<strong>on</strong> process is<br />

handled by <str<strong>on</strong>g>the</str<strong>on</strong>g> GATE based text mining system (Cunningham, 2002) that takes advantage <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> rich,<br />

domain-specific PA Images’ knowledgebase. PA Images’ knowledgebase c<strong>on</strong>sists <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>on</strong>tologies<br />

(schema) and data operating <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> schema.<br />

4.1.1 Entity recogniti<strong>on</strong> and disambiguati<strong>on</strong><br />

The company receives images from a number <str<strong>on</strong>g>of</str<strong>on</strong>g> c<strong>on</strong>tracted photographers and photo agencies with<br />

minimal metadata, such as a short descripti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> main event in <str<strong>on</strong>g>the</str<strong>on</strong>g> image (headline) and a more<br />

elaborate descripti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> image with detailed event informati<strong>on</strong> (capti<strong>on</strong>). The text mining system<br />

operates <strong>on</strong> headline and capti<strong>on</strong> to extract entities <str<strong>on</strong>g>of</str<strong>on</strong>g> interest.<br />

The IE system utilises GATE for text mining and c<strong>on</strong>tains three comp<strong>on</strong>ents: gazetteers, JAPE<br />

grammar and a disambiguati<strong>on</strong> module. The gazetteers are <str<strong>on</strong>g>the</str<strong>on</strong>g> list <str<strong>on</strong>g>of</str<strong>on</strong>g> known entities that <str<strong>on</strong>g>the</str<strong>on</strong>g> system<br />

utilises during <str<strong>on</strong>g>the</str<strong>on</strong>g> initializati<strong>on</strong> process. The <strong>on</strong>tology influences <str<strong>on</strong>g>the</str<strong>on</strong>g> decisi<strong>on</strong> <strong>on</strong> what informati<strong>on</strong><br />

needs to be stored in <str<strong>on</strong>g>the</str<strong>on</strong>g>se gazetteers. The JAPE grammar rules allow detecting additi<strong>on</strong>al entities<br />

and at <str<strong>on</strong>g>the</str<strong>on</strong>g> same time c<strong>on</strong>firming <str<strong>on</strong>g>the</str<strong>on</strong>g> entities detected by <str<strong>on</strong>g>the</str<strong>on</strong>g> gazetteers.<br />

The disambiguati<strong>on</strong> module deals with any disambiguati<strong>on</strong> generated by previous comp<strong>on</strong>ents. Here,<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> disambiguati<strong>on</strong> refers to <str<strong>on</strong>g>the</str<strong>on</strong>g> cases where <str<strong>on</strong>g>the</str<strong>on</strong>g> same piece <str<strong>on</strong>g>of</str<strong>on</strong>g> text is ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r given two class labels<br />

(e.g. “Liverpool” as City and Club) or where two or more entity identifier is assigned to <str<strong>on</strong>g>the</str<strong>on</strong>g> same piece<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> text (e.g. “Premier League” as “id:Premier_Legue_Darts” and “id: Premier_Legue_Football”). We<br />

made a design decisi<strong>on</strong> whereby <str<strong>on</strong>g>the</str<strong>on</strong>g> IE system c<strong>on</strong>tains limited knowledge relating to <str<strong>on</strong>g>the</str<strong>on</strong>g> entities as<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> gazetteers c<strong>on</strong>tain <strong>on</strong>ly alternate labels and classificati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> entities. Moreover, <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

knowledgebase also c<strong>on</strong>tains relati<strong>on</strong>ships between <str<strong>on</strong>g>the</str<strong>on</strong>g>se entities. Therefore, <str<strong>on</strong>g>the</str<strong>on</strong>g> knowledgebase<br />

plays a crucial role in resolving disambiguati<strong>on</strong> as it has more embedded intelligence compared to <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

IE system.<br />

The informati<strong>on</strong> extracted using <str<strong>on</strong>g>the</str<strong>on</strong>g> IE system generally has a c<strong>on</strong>fidence rating to indicate <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

c<strong>on</strong>fidence <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> system in <str<strong>on</strong>g>the</str<strong>on</strong>g> process. For example, higher level <str<strong>on</strong>g>of</str<strong>on</strong>g> c<strong>on</strong>fidence is assigned to<br />

extracted entities where <str<strong>on</strong>g>the</str<strong>on</strong>g> entities are already known to <str<strong>on</strong>g>the</str<strong>on</strong>g> gazetteers and <str<strong>on</strong>g>the</str<strong>on</strong>g>re is sufficient c<strong>on</strong>text<br />

in <str<strong>on</strong>g>the</str<strong>on</strong>g> text to c<strong>on</strong>firm <str<strong>on</strong>g>the</str<strong>on</strong>g> validity <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> extracti<strong>on</strong>. The summarizati<strong>on</strong> module filters <str<strong>on</strong>g>the</str<strong>on</strong>g> results for<br />

740

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!