29.05.2013 Views

RR_03_02

RR_03_02

RR_03_02

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Assuming Accurate Layout Information is Available: How do we Interpret the<br />

Content Flow in HTML Documents?<br />

Abstract<br />

Accessing multi-modal HTML documents. such as web<br />

pages. by devices of lower form factor is a growing<br />

practice. There are three problems of using a small<br />

handheld device for accessing web pages. the display area<br />

is small. the network speed is low. and the device<br />

capability is lower. Often the devices are unable to run IE<br />

or other browsers. and when they can. the browsers have<br />

lower levels of jimctionality. A direct consequence is that<br />

web pages written f or desktop devices with HTML 4 with<br />

Cascading Style Sheets (CSS). image maps. and flash<br />

multi-media plug-ins are often not viewable in these<br />

smaller devices. In order to solve this problem.<br />

researchers have been using techniques such as<br />

transcoding and re-authoring to re-format web pages and<br />

re-jlowing them to the small devices. Some of these<br />

techniques use secondary information about the HTML<br />

data structure and content association in terms of<br />

semantics of the web pages. but none has access to<br />

accurate layout information before it is rendered.<br />

Assuming we do have access to this type of information.<br />

we can take a whole new perspective to this web page reauthoring<br />

problem. This paper discusses some ideas about<br />

the possible use of accurate layout information while<br />

automatically re-authoring web pages.<br />

1. Introduction<br />

Web pages written in HTML do not provide adequate<br />

information about the final layout. This is only known<br />

when the page is rendered using one of the standard web<br />

page browsers such as IE®, Netscape® or Opera®. This is<br />

so as many of the elements in the web pages are dynami c.<br />

For example, the width of a column may vary based on the<br />

browser window size. This makes calculating accurate<br />

layout information very difficult. But what if we did have<br />

access to this information? How do we use this to improve<br />

web page re-authoring? This paper di scusses some ideas<br />

about how we can exploit this information to produce<br />

Hassan Alam and Fuad Rahman§<br />

BeL Technologies Inc.<br />

fuad@bcltechnologies. com<br />

better re-flow of web pages targeted to small handheld<br />

devices.<br />

2. Related Work<br />

There is a growing demand for viewing web pages on<br />

small screen devices. Mobile viewing allows keeping in<br />

touch with the rest of the world while on the move. So<br />

web page re-authoring can be of great interest. Another<br />

motivation is to summarize web content to help in rapid<br />

viewing, as time is a very important commodity and the<br />

amount of information available these days makes it<br />

impossible to browse through entire web sites [I]. Often<br />

there is a demand for alternative browsing, such as the use<br />

of voice [2, 3]. Most email browsing software now support<br />

HTML pages, which make the problem of universal email<br />

accessibility a problem of web page manipulation. Last,<br />

but not the least, web page manipulation is often required<br />

in order to extract content and transfer it to other formats,<br />

such as PDF and others. In this scenario, web page<br />

manipulation supported with document analysis<br />

techniques is the best choice.<br />

Before we start, here is a very brief outline of past<br />

work. Over the years, researchers have proposed different<br />

solutions to the problem of web page re-authoring. The<br />

possible solutions can be categorized as:<br />

• By Hand: Manipulate web sites by hand.<br />

• Transcoding: Automatically replace tags with<br />

device and target specific tags. In this case, the<br />

content flow is identical to the flow in the HTML<br />

data structure [4].<br />

• Re-authoring: Automatically capture the structure<br />

and intelligently re-flow content in a different flow<br />

other than in the original document. There are<br />

many vari ations of this approach:<br />

• Table of Content (TOC): the document is<br />

re-created based on extracted content and<br />

a heading is created. This heading hides a<br />

hyperlink, which if selected, can load up<br />

details associated with the headline [5].<br />

l Corresponding author. The authors gratefu lly acknowledge the support of a Sma ll Business Innovation Research (SBIR) award under the supervision<br />

of US Army Communica tions-Electronics Command (CECO M).<br />

37

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!