RR_03_02

Table No Position on x axis Position on yaxis [nclusion Proximity Area 

I I I 0 2, 3 ? 

2 1 2 0 1,3,20 ? 

3 2 2 0 1,2,20 ? 

4 3 3 3 8, 9 ? 

5 4 4 4 6, 7 ? 

6 5 4 4 5, 7 ? 

7 4 5 4 4, 5, 6 ? 

--- --- --- --- --- --- 

--- --- --- --- --- --- 

• Level of Table of Content (TOC) determining at 

which level should each label representing each 

table be displayed. 

o Labels coming from the tables having 

the same inclusion criterion should be at 

the same level. 

• Merging Criteria (when we can safely merge 

tables): 

o Only tables which have the same 

inclusion criteria can be merged 

provided: 

• Only at the lowest levels in the 

hierarchy 

• Only when they share identical 

sides 

• Can happen in both left/right or 

top/bottom neighbors 

• But not if a border exists 

This algorithm provides a map as shown in Table l. 

This shows a partial analysis of the example web page. 

This map can decide how the content is to be displayed 

and in which order, based on the best-guess scenario by 

exploi ting accurate layout information of the web page. 

[n the absence of such information, the content would 

have been displayed based on other criteria, such as 

semantic relationship [8], absolute size of the tables in 

terms of the number of words, and others. This algorithm 

can also help us in picking the most important part of the 

content based on geometric position. For example, Tables 

4 and 8 are strategically placed in the upper middle part of 

the page, so chances are that they are the centerpieces of 

the page. [n addition, this also allows us to label other 

tables, such as Table 2 is a sidebar, Table 20 is the lower 

bar, and so on. 

[t is also interesting to see how the semantic 

relatedness factor [8] helps in determining which tables 

should be merged. This is something that we will explore 

in the future. 

Table 1: A map based on Figure 1 

39 

4. Extension of the Argument 

Estimating accurate layout information from HTML pages 

may be hard, but not impossible. This will require writing 

a parser that will be based on geometric analysis of web 

pages and will be able to deliver considerably richer 

information about the content, their relationships with 

each other, and their relative importance within the 

context of that page compared to what information is 

available from a traditional parser. 

The next question will be to represent this rich set of 

information gathered from the geometry-based parser. The 

information collected from such a geometry based parser 

will help in determining the "sense" and meaning of the 

main "message", which is turn will help in making the reauthoring, 

more accurate. Recently XML is being 

successfully used in many applications to mark up 

important information according to application-specific 

vocabularies. To properly exploit the flexibi Ii ty inherent in 

the power of XML, industry associations, e-commerce 

consortia, and the W3C need to develop their own 

vocabularies to be able to transform information marked 

up in XML from one vocabulary to another. Two W3C 

Recommendations, XSLT (the Extensible Stylesheet 

Language Transformations) and XPath (the XML Path 

Language), meet that need. They provide a powerful 

implementation of a tree-oriented transformation language 

for transmuting instances of XML using one vocabulary 

into either simple text, the legacy HTML vocabulary, or 

XML instances using any other vocabulary imaginable. [t 

is probably better to use the XSL T language, which itself 

uses XPath, to specify how an implementation of an 

XSLT processor is to create a desired output from a given 

marked-up input. 

This representation of information will probably take a 

form where the content of each page is represented in a

Previous page

Next page

1

3

5

6

7

9

11

13

15

21

23

27

31

35

37

38

39

40

41

45

47

48

49

50

51

52

53

RR_03_02

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?