RR_03_02

Assuming Accurate Layout Information is Available: How do we Interpret the 

Content Flow in HTML Documents? 

Abstract 

Accessing multi-modal HTML documents. such as web 

pages. by devices of lower form factor is a growing 

practice. There are three problems of using a small 

handheld device for accessing web pages. the display area 

is small. the network speed is low. and the device 

capability is lower. Often the devices are unable to run IE 

or other browsers. and when they can. the browsers have 

lower levels of jimctionality. A direct consequence is that 

web pages written f or desktop devices with HTML 4 with 

Cascading Style Sheets (CSS). image maps. and flash 

multi-media plug-ins are often not viewable in these 

smaller devices. In order to solve this problem. 

researchers have been using techniques such as 

transcoding and re-authoring to re-format web pages and 

re-jlowing them to the small devices. Some of these 

techniques use secondary information about the HTML 

data structure and content association in terms of 

semantics of the web pages. but none has access to 

accurate layout information before it is rendered. 

Assuming we do have access to this type of information. 

we can take a whole new perspective to this web page reauthoring 

problem. This paper discusses some ideas about 

the possible use of accurate layout information while 

automatically re-authoring web pages. 

1. Introduction 

Web pages written in HTML do not provide adequate 

information about the final layout. This is only known 

when the page is rendered using one of the standard web 

page browsers such as IE®, Netscape® or Opera®. This is 

so as many of the elements in the web pages are dynami c. 

For example, the width of a column may vary based on the 

browser window size. This makes calculating accurate 

layout information very difficult. But what if we did have 

access to this information? How do we use this to improve 

web page re-authoring? This paper di scusses some ideas 

about how we can exploit this information to produce 

Hassan Alam and Fuad Rahman§ 

BeL Technologies Inc. 

fuad@bcltechnologies. com 

better re-flow of web pages targeted to small handheld 

devices. 

2. Related Work 

There is a growing demand for viewing web pages on 

small screen devices. Mobile viewing allows keeping in 

touch with the rest of the world while on the move. So 

web page re-authoring can be of great interest. Another 

motivation is to summarize web content to help in rapid 

viewing, as time is a very important commodity and the 

amount of information available these days makes it 

impossible to browse through entire web sites [I]. Often 

there is a demand for alternative browsing, such as the use 

of voice [2, 3]. Most email browsing software now support 

HTML pages, which make the problem of universal email 

accessibility a problem of web page manipulation. Last, 

but not the least, web page manipulation is often required 

in order to extract content and transfer it to other formats, 

such as PDF and others. In this scenario, web page 

manipulation supported with document analysis 

techniques is the best choice. 

Before we start, here is a very brief outline of past 

work. Over the years, researchers have proposed different 

solutions to the problem of web page re-authoring. The 

possible solutions can be categorized as: 

• By Hand: Manipulate web sites by hand. 

• Transcoding: Automatically replace tags with 

device and target specific tags. In this case, the 

content flow is identical to the flow in the HTML 

data structure [4]. 

• Re-authoring: Automatically capture the structure 

and intelligently re-flow content in a different flow 

other than in the original document. There are 

many vari ations of this approach: 

• Table of Content (TOC): the document is 

re-created based on extracted content and 

a heading is created. This heading hides a 

hyperlink, which if selected, can load up 

details associated with the headline [5]. 

l Corresponding author. The authors gratefu lly acknowledge the support of a Sma ll Business Innovation Research (SBIR) award under the supervision 

of US Army Communica tions-Electronics Command (CECO M). 

37

Previous page

Next page

1

3

5

6

7

9

11

13

15

21

23

27

31

35

37

38

39

40

41

45

47

48

49

50

51

52

53

RR_03_02

Create successful ePaper yourself

Delete template?

Save as template?