27.03.2014 Views

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The table title and the dimension summary header are most<br />

unlikely full sentences. Many resemble noun phrases.<br />

Therefore, they need to be partitioned to appropriate sizes in<br />

order to find a label to be matched with the ancestor set. Those<br />

labels could be phrases or words. This study has developed<br />

three different ways of partitioning in order to find the best<br />

method to choose the candidate set. The candidates can be<br />

partitioned into NLP chunks, or they can be partitioned to the<br />

WordNet synset and the full chunk that would serve as a label<br />

can then be retrieved. The third method is a combination of the<br />

previous two.<br />

1) Case #1: NLP chunks<br />

Chunking any sentence/phrase, in this case the table title<br />

and the dimension summary header, depends on the part of<br />

speech tagging of words in the sentence. These chunks are<br />

considered as candidates for the label that covers the dimension<br />

headers. We refer to “Statistical parsing of English sentences”<br />

[17] to divide the candidates into appropriate noun phrases,<br />

verb phrases and words. For example, for the table title in Fig.<br />

2, the chunks are {Cancer, new cases, selected primary site of<br />

cancer, sex}, which become candidates in the candidate set.<br />

2) Case #2: single words<br />

Each candidate in the candidate set will be a single<br />

WordNet synset that is found in the table title and the<br />

dimension summary header.<br />

3) Case #3: sliding window chunks<br />

A sliding window technique is formed on the table title and<br />

the dimension summary header. The candidates in the<br />

candidate set are of size 1, 2 or 3 WordNet synsets picked from<br />

this technique.<br />

C. Assigning Label from the Candidate Set<br />

This step computes the semantic similarity between the list<br />

of ancestors, derived in step A, and the candidates in the<br />

candidate set, derived in step B. This is done using the<br />

semantic score presented in “WordNet-based semantic<br />

similarity measurement” [18]. The semantic similarities are<br />

computed between the ancestors in the ancestor set and the<br />

candidate set, beginning with the highest scoring ancestor and<br />

the candidates that appear after the prepositions in the table<br />

title. The candidate that matches one of the ancestors is<br />

selected as a label. This procedure is done in two steps,<br />

depending on the type of candidates in the candidate set. The<br />

steps are presented in the following algorithm.<br />

1: for each in { , ..., }<br />

2: for each candidate in { <br />

3: divide each to each WordNet synset<br />

{ <br />

4: for each <br />

5: identify LCA between and , the<br />

depth d for and ,<br />

6: Sim= LCA / )<br />

7: end<br />

8: end<br />

9: if Case #1 or Case#3 then{<br />

compute the similarity for the ancestor<br />

and the full chunk<br />

10: = <br />

11: = <br />

12: T= ( + )/ ( + )<br />

13: if T certain threshold then{<br />

14: L = <br />

15: quit loop}<br />

16: else if Case#2 then{<br />

17: if (Max Sim certain threshold) then{<br />

18: L = <br />

19: quit loop}<br />

When the candidates are single WordNet synsets, the<br />

semantic similarity between each ancestor and each candidate<br />

is measured according to Wu and Palmer [19], as shown in (2),<br />

where LCA is the least common ancestor depth in WordNet<br />

taxonomy between two WordNet synsets; i.e., the ancestor and<br />

the candidate, is the depth for the ancestor, and is the<br />

depth for the candidate word.<br />

Sim= LCA / ) (2)<br />

When the candidates are chunks, the chunks must be<br />

divided into single WordNet synset and their semantic<br />

similarity computed with the ancestors. This is followed with a<br />

computing of the full similarity of the full chunk, since any<br />

semantic similarity function cannot compute the similarity<br />

directly between the phrase and words, because the chunk<br />

contains more than one WordNet synset and the ancestor is one<br />

WordNet synset. Therefore, it is necessary to ensure that the<br />

semantic similarity measure considers comparisons of a group<br />

of WordNet synsets taken together.<br />

Since (2) computes the similarity between two WordNet<br />

synsets, the similarity scores for the case when the candidate is<br />

a phrase must be collected. According to [18] the total<br />

similarity, T, is equal to the summation of the similarities<br />

between the phrase and the ancestor, p1, and p2 is the highest<br />

similarity score between the ancestor and a word in the phrase.<br />

This is divided by the summation of their length l1 and l2.<br />

T= (p1+p2)/ (l1+l2) (3)<br />

VI. EXPERIMENTAL RESULTS<br />

Our system was tested with multidimensional tables from<br />

Statistics Canada’s website of summary tables [20] and from<br />

Statistics Austria website [21]. The tables are freely accessible<br />

and belong to a few national statistical agencies that continue to<br />

publish in HTML. They contain invaluable quantities of isA<br />

relationships. More importantly, the tables are valuable because<br />

they cover a wide range of topics. Our test dataset contained<br />

some 305 randomly selected tables, containing 781 dimensions,<br />

which were domain-independent and covering such topics as<br />

education, construction, household, travel, and languages. By<br />

implementing the algorithms, we found that we are able to<br />

extract 92% of isA relationships successfully.<br />

Different techniques were implemented to reach the goal of<br />

this research. This section of the paper looks deeper into the<br />

results to determine their effectiveness. First, the preprocessing<br />

step described in Section V was found crucial to<br />

710

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!