29.05.2013 Views

RR_03_02

RR_03_02

RR_03_02

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Table No Position on x axis Position on yaxis [nclusion Proximity Area<br />

I I I 0 2, 3 ?<br />

2 1 2 0 1,3,20 ?<br />

3 2 2 0 1,2,20 ?<br />

4 3 3 3 8, 9 ?<br />

5 4 4 4 6, 7 ?<br />

6 5 4 4 5, 7 ?<br />

7 4 5 4 4, 5, 6 ?<br />

--- --- --- --- --- ---<br />

--- --- --- --- --- ---<br />

• Level of Table of Content (TOC) determining at<br />

which level should each label representing each<br />

table be displayed.<br />

o Labels coming from the tables having<br />

the same inclusion criterion should be at<br />

the same level.<br />

• Merging Criteria (when we can safely merge<br />

tables):<br />

o Only tables which have the same<br />

inclusion criteria can be merged<br />

provided:<br />

• Only at the lowest levels in the<br />

hierarchy<br />

• Only when they share identical<br />

sides<br />

• Can happen in both left/right or<br />

top/bottom neighbors<br />

• But not if a border exists<br />

This algorithm provides a map as shown in Table l.<br />

This shows a partial analysis of the example web page.<br />

This map can decide how the content is to be displayed<br />

and in which order, based on the best-guess scenario by<br />

exploi ting accurate layout information of the web page.<br />

[n the absence of such information, the content would<br />

have been displayed based on other criteria, such as<br />

semantic relationship [8], absolute size of the tables in<br />

terms of the number of words, and others. This algorithm<br />

can also help us in picking the most important part of the<br />

content based on geometric position. For example, Tables<br />

4 and 8 are strategically placed in the upper middle part of<br />

the page, so chances are that they are the centerpieces of<br />

the page. [n addition, this also allows us to label other<br />

tables, such as Table 2 is a sidebar, Table 20 is the lower<br />

bar, and so on.<br />

[t is also interesting to see how the semantic<br />

relatedness factor [8] helps in determining which tables<br />

should be merged. This is something that we will explore<br />

in the future.<br />

Table 1: A map based on Figure 1<br />

39<br />

4. Extension of the Argument<br />

Estimating accurate layout information from HTML pages<br />

may be hard, but not impossible. This will require writing<br />

a parser that will be based on geometric analysis of web<br />

pages and will be able to deliver considerably richer<br />

information about the content, their relationships with<br />

each other, and their relative importance within the<br />

context of that page compared to what information is<br />

available from a traditional parser.<br />

The next question will be to represent this rich set of<br />

information gathered from the geometry-based parser. The<br />

information collected from such a geometry based parser<br />

will help in determining the "sense" and meaning of the<br />

main "message", which is turn will help in making the reauthoring,<br />

more accurate. Recently XML is being<br />

successfully used in many applications to mark up<br />

important information according to application-specific<br />

vocabularies. To properly exploit the flexibi Ii ty inherent in<br />

the power of XML, industry associations, e-commerce<br />

consortia, and the W3C need to develop their own<br />

vocabularies to be able to transform information marked<br />

up in XML from one vocabulary to another. Two W3C<br />

Recommendations, XSLT (the Extensible Stylesheet<br />

Language Transformations) and XPath (the XML Path<br />

Language), meet that need. They provide a powerful<br />

implementation of a tree-oriented transformation language<br />

for transmuting instances of XML using one vocabulary<br />

into either simple text, the legacy HTML vocabulary, or<br />

XML instances using any other vocabulary imaginable. [t<br />

is probably better to use the XSL T language, which itself<br />

uses XPath, to specify how an implementation of an<br />

XSLT processor is to create a desired output from a given<br />

marked-up input.<br />

This representation of information will probably take a<br />

form where the content of each page is represented in a

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!