RR_03_02
RR_03_02
RR_03_02
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Assuming Accurate Layout Information is Available: How do we Interpret the<br />
Content Flow in HTML Documents?<br />
Abstract<br />
Accessing multi-modal HTML documents. such as web<br />
pages. by devices of lower form factor is a growing<br />
practice. There are three problems of using a small<br />
handheld device for accessing web pages. the display area<br />
is small. the network speed is low. and the device<br />
capability is lower. Often the devices are unable to run IE<br />
or other browsers. and when they can. the browsers have<br />
lower levels of jimctionality. A direct consequence is that<br />
web pages written f or desktop devices with HTML 4 with<br />
Cascading Style Sheets (CSS). image maps. and flash<br />
multi-media plug-ins are often not viewable in these<br />
smaller devices. In order to solve this problem.<br />
researchers have been using techniques such as<br />
transcoding and re-authoring to re-format web pages and<br />
re-jlowing them to the small devices. Some of these<br />
techniques use secondary information about the HTML<br />
data structure and content association in terms of<br />
semantics of the web pages. but none has access to<br />
accurate layout information before it is rendered.<br />
Assuming we do have access to this type of information.<br />
we can take a whole new perspective to this web page reauthoring<br />
problem. This paper discusses some ideas about<br />
the possible use of accurate layout information while<br />
automatically re-authoring web pages.<br />
1. Introduction<br />
Web pages written in HTML do not provide adequate<br />
information about the final layout. This is only known<br />
when the page is rendered using one of the standard web<br />
page browsers such as IE®, Netscape® or Opera®. This is<br />
so as many of the elements in the web pages are dynami c.<br />
For example, the width of a column may vary based on the<br />
browser window size. This makes calculating accurate<br />
layout information very difficult. But what if we did have<br />
access to this information? How do we use this to improve<br />
web page re-authoring? This paper di scusses some ideas<br />
about how we can exploit this information to produce<br />
Hassan Alam and Fuad Rahman§<br />
BeL Technologies Inc.<br />
fuad@bcltechnologies. com<br />
better re-flow of web pages targeted to small handheld<br />
devices.<br />
2. Related Work<br />
There is a growing demand for viewing web pages on<br />
small screen devices. Mobile viewing allows keeping in<br />
touch with the rest of the world while on the move. So<br />
web page re-authoring can be of great interest. Another<br />
motivation is to summarize web content to help in rapid<br />
viewing, as time is a very important commodity and the<br />
amount of information available these days makes it<br />
impossible to browse through entire web sites [I]. Often<br />
there is a demand for alternative browsing, such as the use<br />
of voice [2, 3]. Most email browsing software now support<br />
HTML pages, which make the problem of universal email<br />
accessibility a problem of web page manipulation. Last,<br />
but not the least, web page manipulation is often required<br />
in order to extract content and transfer it to other formats,<br />
such as PDF and others. In this scenario, web page<br />
manipulation supported with document analysis<br />
techniques is the best choice.<br />
Before we start, here is a very brief outline of past<br />
work. Over the years, researchers have proposed different<br />
solutions to the problem of web page re-authoring. The<br />
possible solutions can be categorized as:<br />
• By Hand: Manipulate web sites by hand.<br />
• Transcoding: Automatically replace tags with<br />
device and target specific tags. In this case, the<br />
content flow is identical to the flow in the HTML<br />
data structure [4].<br />
• Re-authoring: Automatically capture the structure<br />
and intelligently re-flow content in a different flow<br />
other than in the original document. There are<br />
many vari ations of this approach:<br />
• Table of Content (TOC): the document is<br />
re-created based on extracted content and<br />
a heading is created. This heading hides a<br />
hyperlink, which if selected, can load up<br />
details associated with the headline [5].<br />
l Corresponding author. The authors gratefu lly acknowledge the support of a Sma ll Business Innovation Research (SBIR) award under the supervision<br />
of US Army Communica tions-Electronics Command (CECO M).<br />
37