RR_03_02
RR_03_02
RR_03_02
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Table No Position on x axis Position on yaxis [nclusion Proximity Area<br />
I I I 0 2, 3 ?<br />
2 1 2 0 1,3,20 ?<br />
3 2 2 0 1,2,20 ?<br />
4 3 3 3 8, 9 ?<br />
5 4 4 4 6, 7 ?<br />
6 5 4 4 5, 7 ?<br />
7 4 5 4 4, 5, 6 ?<br />
--- --- --- --- --- ---<br />
--- --- --- --- --- ---<br />
• Level of Table of Content (TOC) determining at<br />
which level should each label representing each<br />
table be displayed.<br />
o Labels coming from the tables having<br />
the same inclusion criterion should be at<br />
the same level.<br />
• Merging Criteria (when we can safely merge<br />
tables):<br />
o Only tables which have the same<br />
inclusion criteria can be merged<br />
provided:<br />
• Only at the lowest levels in the<br />
hierarchy<br />
• Only when they share identical<br />
sides<br />
• Can happen in both left/right or<br />
top/bottom neighbors<br />
• But not if a border exists<br />
This algorithm provides a map as shown in Table l.<br />
This shows a partial analysis of the example web page.<br />
This map can decide how the content is to be displayed<br />
and in which order, based on the best-guess scenario by<br />
exploi ting accurate layout information of the web page.<br />
[n the absence of such information, the content would<br />
have been displayed based on other criteria, such as<br />
semantic relationship [8], absolute size of the tables in<br />
terms of the number of words, and others. This algorithm<br />
can also help us in picking the most important part of the<br />
content based on geometric position. For example, Tables<br />
4 and 8 are strategically placed in the upper middle part of<br />
the page, so chances are that they are the centerpieces of<br />
the page. [n addition, this also allows us to label other<br />
tables, such as Table 2 is a sidebar, Table 20 is the lower<br />
bar, and so on.<br />
[t is also interesting to see how the semantic<br />
relatedness factor [8] helps in determining which tables<br />
should be merged. This is something that we will explore<br />
in the future.<br />
Table 1: A map based on Figure 1<br />
39<br />
4. Extension of the Argument<br />
Estimating accurate layout information from HTML pages<br />
may be hard, but not impossible. This will require writing<br />
a parser that will be based on geometric analysis of web<br />
pages and will be able to deliver considerably richer<br />
information about the content, their relationships with<br />
each other, and their relative importance within the<br />
context of that page compared to what information is<br />
available from a traditional parser.<br />
The next question will be to represent this rich set of<br />
information gathered from the geometry-based parser. The<br />
information collected from such a geometry based parser<br />
will help in determining the "sense" and meaning of the<br />
main "message", which is turn will help in making the reauthoring,<br />
more accurate. Recently XML is being<br />
successfully used in many applications to mark up<br />
important information according to application-specific<br />
vocabularies. To properly exploit the flexibi Ii ty inherent in<br />
the power of XML, industry associations, e-commerce<br />
consortia, and the W3C need to develop their own<br />
vocabularies to be able to transform information marked<br />
up in XML from one vocabulary to another. Two W3C<br />
Recommendations, XSLT (the Extensible Stylesheet<br />
Language Transformations) and XPath (the XML Path<br />
Language), meet that need. They provide a powerful<br />
implementation of a tree-oriented transformation language<br />
for transmuting instances of XML using one vocabulary<br />
into either simple text, the legacy HTML vocabulary, or<br />
XML instances using any other vocabulary imaginable. [t<br />
is probably better to use the XSL T language, which itself<br />
uses XPath, to specify how an implementation of an<br />
XSLT processor is to create a desired output from a given<br />
marked-up input.<br />
This representation of information will probably take a<br />
form where the content of each page is represented in a