D3.1 Deliverable Description of the state-of-the-art ... - Hitech Projects

D3.1 Deliverable 

Description of the state-of-the-art 

CANTATA 

••••••••••••••••••••••••••••••••••••••••••••• 

Project number: ITEA05010 

Document version no.: 1.0 

Status: Final 

Edited by: Dominique Segers, Barco, Belgium 

Thursday, 26 April 2007 

ITEA Roadmap domains: 

Major: Services & Software creation 

Minor: Cyber Enterprise 

ITEA Roadmap technology categories: 

Major: Content 

Minor: Data and content management

History: 

Document 

version # 

Date Remarks 

v0.10 8/11/2006 Initial document start by Dominique Segers, Barco 

V0.11 21/11/2006 First compilation by Dominique Segers, Barco 

V0.12 21/11/2006 Second compilation by Dominique Segers, Barco 

V0.13 22/11/2006 Edit after input from CodaSystem 

V0.14 15/12/2006 New Structure 

V0.15 19/12/2006 New Structure and sections by Juana Sánchez, Telefónica 

V0.16 20/12/2006 Edit after new input from Telefónica 

V0.17 12/01/2007 Edit after new input from CodaSystem 

V0.18 18/01/2007 Edit after input from Solid 

V0.19 06/02/2007 Edit after review from Egbert, LogicaCMG 

V1.0 26/04/2007 Final approval by the PMT 

Contributors: 

Dominique Segers, Barco 

Ismael Fuentes, I&IMS 

Juana Sánchez Pérez, Telefónica 

John de Vet, iLab 

Jorma Palo, Solid 

Johannes Peltola, VTT 

Raoul Djeutane, CodaSystem 

Gorka Marcos Ortego,VicomTech 

Nicolas Damien, Centre Henri Tudor 

This document will be treated as strictly confidential. It will only be public to those who 

have signed the ITEA Declaration of Non-Disclosure.

TABLE OF CONTENTS 

1 Introduction....................................................................................................................6 

1.1 The Aim of the activity.................................................................................................. 6 

1.2 Potential Partners contributions: .................................................................................. 6 

2 State-of-the-art User Interfaces of applications and services .................................... 7 

2.1 State-of-the-art UI of applications on mobile phones ................................................... 7 

2.1.1 Introduction ............................................................................................................ 7 

2.1.2 Video applications on mobile phones ..................................................................... 7 

2.1.2.1 VideoImpression - Mobile Edition [53].................................................. 7 

2.1.3 Photo applications on mobile phones..................................................................... 9 

2.1.3.1 PhotoBase Deluxe - Mobile Edition [54]................................................ 9 

2.1.4 Video Surveillance over IP ................................................................................... 10 

2.1.4.1 IRIS [55] ............................................................................................... 10 

2.1.4.2 The 3rdi Security System [56] .............................................................. 10 

2.1.4.3 DLink DCS-2120 Wireless Internet Camera with 3G Mobile Video 

Support [57] .......................................................................................................... 10 

2.1.4.4 NIOO VISIO [58] ................................................................................. 12 

2.1.5 Video Surveillance over IP with content analysis on server ................................. 12 

2.1.5.1 Visio Wave [59].................................................................................... 13 

2.1.5.2 3rdeye - Video Surveillance on Your Mobile [60] ............................... 14 

2.1.6 Interactive composition and scene mixing............................................................ 16 

2.2 UI of services for IP-enabled TV and Set-Top Boxes ................................................. 19 

2.2.1 On-line services for IP-enabled TV and Set-Top Boxes ....................................... 19 

2.2.2 Flash-based content adaptation in Set-Top Boxes............................................... 19 

3 State-of-the-art Compression Algorithms .................................................................. 20 

3.1 Motion JPEG-2000 and Wireless (Part 11) JPEG-2000 ............................................. 20 

3.1.1 Introduction .......................................................................................................... 20 

3.1.2 Scope and Features of Motion JPEG-2000 .......................................................... 21 

3.1.3 Scope and Features of Wireless JPEG-2000 ....................................................... 21 

3.1.4 Video Coding with Motion Compensated Prediction............................................. 23 

3.2 Codification technologies ........................................................................................... 26 

3.2.1 Introduction .......................................................................................................... 26 

3.2.2 MPEG-1 and MPEG-2 .......................................................................................... 26 

3.2.3 MPEG4................................................................................................................. 27 

3.2.3.1 MPEG-4 architecture ............................................................................ 27 

3.2.3.2 CODECS (MPEG-4 Visual y MPEG-4 Audio).................................... 27 

3.2.3.3 MPEG-4 Systems (BIFS)...................................................................... 28 

3.2.3.4 MPEG-4 Part 20 (LASeR and SAF [44]) ............................................. 28 

3.3 Additional formats for most power devices (future) .................................................... 33 

3.3.1 VC1 [21] ............................................................................................................... 33 

3.3.2 Device-oriented screens....................................................................................... 33 

3.4 Analysis of state-of-the art image compression algorithms for medical applications .. 34 

3.4.1 Still image compression such as JPEG, JPEG-LS and JPEG-2000 ..................... 34 

3.4.2 Intra-frame image compression such as MJPEG-2000 ........................................ 36 

3.4.3 Inter-frame image compression such as MPEG-4 AVC........................................ 36

4 User Interface Adaptation ........................................................................................... 37 

4.1 Introduction ................................................................................................................ 37 

4.2 MPEG-4 Advanced Content visualization technologies.............................................. 37 

4.2.1 Software BIFS reproducers .................................................................................. 37 

4.2.2 GPAC: Osmo4...................................................................................................... 38 

4.2.3 IBM: M4Play......................................................................................................... 38 

4.2.4 Envivio TV............................................................................................................ 39 

4.2.5 Bitmanagement: BS Contact MPEG-4.................................................................. 39 

4.2.6 Octaga Professional............................................................................................. 40 

4.2.7 Digimax: MAXPEG Player .................................................................................... 40 

4.2.8 COSMOS ............................................................................................................. 40 

4.3 UI adaptation based on XML...................................................................................... 40 

4.3.1 UI adaptation based on XML transformation ........................................................ 40 

4.3.2 Adaptation via XML publishing servers ................................................................ 42 

4.3.3 Adaptation based on the definition & identification of the device.......................... 43 

4.3.3.1 Composite Capabilities / Preference Profiles ....................................... 43 

4.3.3.2 UAPROF (OMA).................................................................................. 44 

4.3.3.3 Device Description Repository............................................................. 45 

4.3.4 XML based UI adaptation..................................................................................... 45 

4.3.4.1 UIML User Interface Meta Language................................................... 46 

4.3.4.2 AUIML ................................................................................................. 46 

4.3.4.3 XIML (eXtensible Interface Markup Language).................................. 46 

4.3.4.4 XUL ...................................................................................................... 47 

4.3.4.5 TERESA XML...................................................................................... 47 

4.3.4.6 USIXML ............................................................................................... 48 

4.3.4.7 AAIML [43].......................................................................................... 48 

4.3.4.8 XForms and RIML................................................................................ 49 

4.3.4.9 MPEG-21 .............................................................................................. 50 

4.4 Device ontology ......................................................................................................... 50 

4.5 Agent-base user interface adaptation......................................................................... 50 

5 State-of-the-art system architecture........................................................................... 51 

5.1 DLNA ......................................................................................................................... 51 

5.2 mTag.......................................................................................................................... 52 

5.3 Content retrieval and device management................................................................. 54 

6 References.................................................................................................................... 61

1 Introduction 

1.1 The Aim of the activity 

The activity preparing the production of the Deliverable 3.1 encompasses all the Topics 

addressed in the WP 3: 

• Topic 3.1 Device-oriented UI adaptation. 

• Topic 3.2 User-oriented UI adaptation. 

• Topic 3.3 Content-oriented UI adaptation. 

• Topic 3.4 Presentation and interaction with users. 

And should also map with the different domains that are targeted within the CANTATA 

project: 

• Multimedia consumer. 

• Medical Imagery. 

• Surveillance. 

The deliverable D3.1 aims thus at establishing an as complete as possible State of the art 

analysis of the WP technologies. 

1.2 Potential Partners contributions: 

• Barco. 

• I&IMS. 

• Telefonica. 

• iLab. 

• Solid. 

• VTT. 

• CodaSystem. 

• ViconTech. 

• Centre Henri Tudor. 

Remark VTT: 

Since VTT is still unfunded partner and WP2-management work takes all spare 

resources VTT cannot promise to participate to the WP3 until they have received 

funding. VTT may receive funding early 2007 if all goes well.

2 State-of-the-art User Interfaces of applications and services 

2.1 State-of-the-art UI of applications on mobile phones 

2.1.1 Introduction 

This document describes de state of the art of UI of Video applications on mobile phones. 

This state of the art will consist on presenting some video applications that exist and 

which run on mobile phone and we’ll describe their principal functionality. 

2.1.2 Video applications on mobile phones 

The applications below present a resume of what is done currently concerning video 

application on mobile phones. 

2.1.2.1 VideoImpression - Mobile Edition [53] 

VideoImpression is one solution developed by Arcsoft. 

This application allow to create and share custom mini-movies featuring user’s own 

videos, photos and slide shows, with custom animated titles, credit screens, 

soundtracks and scene transitions.

These are the principal functionalities: 

• Capture video on your mobile device. 

• Playback video you record, download, or receive from friends. 

• Trim video clips. 

• Combine multiple clips together. 

• Add transition effects between clips. 

• Add titles and credits. 

• Share your movies via infrared, BlueTooth, email, or MMS. 

• File format support: ASF, 3GP, MP4 for video; PCM, ADPCM, MP3, and AMR for 

audio. 

• Video codec support: H.263, MPEG-4.

2.1.3 Photo applications on mobile phones 

2.1.3.1 PhotoBase Deluxe - Mobile Edition [54] 

Photobase is another developed by Arcsoft: 

These are the key features of this application: 

ArcSoft Panorama Maker 

Designed specifically for low profile devices, your customers can capture multiple 

photos and have them automatically stitched together. 

Auto Red-eye Removal 

Give your customers this quick fix that instantly and automatically removes pesky redeye. 

Still Image Capture 

When using a camera phone, it is important to have an intuitive application that allows 

users to capture stunning pictures. The Still Image Capture component offers several 

quality enhancement options for your images on the device. Components include: 

• White Balance (hardware solution). 

• Brightness and Contrast. 

• Digital Zoom. 

• JPEG Encoding. 

Edit and Enhancement 

A variety of editing and enhancement functions are provided, such as red-eye removal, 

crop and rotate. Users can edit their photos before they store or share them. 

Media Management Sharing 

With PhotoBase Deluxe, your customers can manage their photos when they are on the 

go. This application provides a complete solution, allowing your customers to sort, 

album, display, and label their images. Instantly create a slide show with cool transition 

effects and sound. Users can share their images through BlueTooth, MMS, infrared and 

email. 

Fun Features 

PhotoBase Deluxe provides a variety of fun features and content. The Panorama 

Maker feature provides instant photo stitching capabilities to your mobile device. Add 

clip art, fun frames, and text to any image. Download more content for special holidays 

and occasions.

2.1.4 Video Surveillance over IP 

2.1.4.1 IRIS [55] 

IRIS cameras are able to transmit live or recorded video to your mobile phone over a 

standard mobile phone network. When you want to look at what’s going on just use the 

IRIS viewing software on your mobile phone to connect to your camera via the IRIS 

Control Centre. 

IRIS cameras can also detect when an intruder has entered your home through their 

sensors. When an alarm is triggered on your camera the IRIS Control Centre sends you 

a text message alert. You can then look at a recording of the event that set the camera 

off or see what’s happening now. 

2.1.4.2 The 3rdi Security System [56] 

3rdi cameras can detect when an intruder has entered your home using infrared and 

motion sensors. When an alarm is triggered on your camera the 3rdi control centre 

sends you a text message alert. You can then look at a recording of the event that 

triggered the camera or see what's happening even if your phone is switched off when 

the alert is sent to you, video of the event is stored at the 3rdi control centre for up to 30 

days so you can look at it when it's most convenient for you. 

You can also see what’s happening at the camera location by simply accessing 

it via your mobile phone. 

2.1.4.3 DLink DCS-2120 Wireless Internet Camera with 3G Mobile Video Support [57] 

DCS-2120 is a wireless internet security camera developed by DLink which allow 

remotely watching over and observing a place. It can connect to your network 

through a fast ethernet port. This camera also allows sending alert messages 

(emails) if it detects a suspect movement. 

Here are the specifications of this camera. 

3g mobile video from your phone and more 

The DCS-2120 offers both consumers and small businesses a flexible and convenient 

way to remotely monitor a home or office in real time from anywhere within a mobile 

phone’s 3G service area. When used in conjunction with the email alert system, mobile 

users can now view a camera feed without a notebook PC and wireless hotspot. This 

live video feed can then be accessed through 3G cellular networks by compatible cell 

phones*. 

In addition to cellular phone monitoring, the 3GPP/ISMA video format also enables 

streaming playback on a computer. The camera is also viewable from any Internet 

Streaming Media Alliance (ISMA) compatible device and offers support for RealPlayer® 

10.5 and QuickTime® 6.5 viewing. The DCS-2120 supports resolutions up to 640x480 

at up to 30fps using compression rates.

Convenient management options 

D-Link’s IP surveillance camera management software is included to enhance the 

functionality of the DCS-2120. Manage and monitor up to sixteen compatible cameras 

simultaneously with this program. IP surveillance can be used to archive video straight 

to a hard drive or network-attached storage devices, playback video, and set up motion 

detection to trigger video/audio recording or send e-mail alerts. Alternatively, it is 

possible to access and control the DCS-2120 via the web using Internet Explorer. As 

you watch remote video obtained by the DCS-2120, it is possible to take snapshots 

directly from the web browser to a local hard drive, making it ideal for capturing any 

moment no matter where you are. 

This is the diagram of this system.

2.1.4.4 NIOO VISIO [58] 

Nioo Visio is one solution developed by Neion Graphics which enables remote 

visualization without any constraint. 

This application allows to connect to on one many cameras, perform zoom, and to 

remotely capture photography, from a PDA or smart phone. 

It can allow for example to visualize what is passing at your home when you are not 

present. 

Conclusion: 

From the example of applications before, we conclude that video applications which 

exist now allow creating, generating and managing videos and pictures. These 

applications do not allow to manage or done any action depending on the content of the 

media based directly on the media. 

2.1.5 Video Surveillance over IP with content analysis on server 

There exist other solutions based on the Video content analysis which generate an action 

depending on the action in other camera. We have for example a solution developed by 

Visio Wave.

2.1.5.1 Visio Wave [59] 

Visio Wave developed a solution on video content analysis based on the scheme below: 

Here there are many cameras connected to a server that analyses the video. When 

there is a problem, an alert is generated and sent to a PDA, PC or some other device 

and the end user who has the device can be connected directly to the remote camera 

and see what happen at this moment.

2.1.5.2 3rdeye - Video Surveillance on Your Mobile [60] 

3rdeye is a video surveillance on the mobile phone system developed by the Romanian 

company Cratima. With the help of a mobile phone and of 3rdeye system, you can view 

live images with any location watched by a video camera. This location can be your 

own home, office, vacation home, store or, even a parking space. The quality of the 

images is high, thanks to the GPRS transmission mode. 

3rdeye’s Architecture 

3rdeye consists in two applications: 

• The video server (to which the monitoring video cameras are connected) 

• The client application, which runs on the user's mobile phone. 

The server application is also divided by two components: the Video Grabbing Server – 

that receives the images straight from the video cameras and that sends these images 

to the Video Streaming Server. This last component is responsible with properly 

sending the received images onto the client application on the mobile phone. 

Between the two basic software modules of the Video Surveillance Server takes place a 

two-way exchange of information. 

The Video Grabbing Server grabs video images from the video cameras and sends 

them, in digital format, to the Video Streaming Server (in order to prepare the video 

streams for the clients), while the Video Streaming Server sends back to the Video 

Grabbing Server the commands and control info, received from the client application. 

The client application can be configured to connect both to Video Surveillance Servers 

that have fixed IP address, or dynamically allocated (e.g. Dial-up). When the server has 

a fixed IP address, the client application will connect straight to the Video Surveillance 

Server.

When the server's IP address is dynamically allocated (different IP address from one 

connexion to another), the client application will first interrogate the Fixed IP Address 

Server, permanently connected to the Internet, in order to obtain the IP address of the 

Video Surveillance Server to which it is about to connect. 

3rdeye’s Functionality 

3rdeye allows you to watch in real time, on your Java enabled mobile phone (not 

necessarily 3G), the images provided by the video cameras connected to the video 

server and controls the position of the video cameras (Pan/Tilt/Zoom).

The connection to the video server is made through the Internet, using a GPRS 

connection (not necessarily, as already stated, a 3G connection; neither a “smart 

phone”).The received image can be presented both full screen/normal view and has 

multiple display modes: full frame, 1:2, 1:1 (in this case, the application is designed to 

have a scroll and an auto detection feature). 

The moment a client application connects to the server, it then right away sends the 

server info regarding the maximum size of the phone's display, so the server 

automatically adjusts the video images (width x height). Using an advanced technology 

(developed by Cratima Software), based on motion detection and tracking proprietary 

algorithm, the Video Grabbing Server records all the events occurred along with the 

adequate motion images. 

All the recorded events can be viewed from the client's 3rdeye mobile phone 

application. 

3rdeye’s Applicability 

3rdeye has multiple usages and, being developed from one end to another by Cratima, can be 

customized for every client's needs: 

Managing employee conduct and duties from remote locations. 

Off site monitoring of homes, cottages, shops, offices, factories, warehouses, cars, boats. 

Child care monitoring for development at home, nurseries, kindergartens and schools or 

observing the well-being of senior citizens and disables. 

Pet/weather watch; snow or traffic condition; construction site video surveillance. 

Conclusion: 

From our study, we can conclude that for the moment there is no solution for video content 

analysis on the mobile phone. The solutions which exist now allow modifying, managing media. 

There exist solutions which mobile phone in their platform but Content analysis is done away. 

2.1.6 Interactive composition and scene mixing 

The scene is the composition of the audiovisuals elements that are shown to the user. Initially 

it’s generated in the corresponding server. The user, interacting with its device, will actuate on 

the scene elements updating it depending of his preferences. The scene composition, and 

therefore the way of interacting with it, can be done on different ways. 

Composition or mixing in the server 

The scene and all the components that compose it are mixed in one unique stream that is sent 

to the client. 

When the user interacts to the application to modify the scene, the server receives the proper 

orders to compose the scene again and send it in one only stream for every user. 

The bandwidth is proportional to the number of user, because every user is served with a video 

stream specially codified for him.

It’s needed a potent video server able to codify the video elements, compose them and codify 

them in real time. I these terms, it must codify simultaneously as many flows as users. 

Actual technologies: 

Video edition tools. There are some tools able to make this complete process. However, these 

tools are defined to make a video postproduction. Some of them allow to generate video in real 

time for direct emissions. But all of them have a graphic operator interface and lack of 

programmatic interface (api) so they are not valid to provide interactivity with the user. 

Decoding and mixing using a frame server. The frames mixing can be done with a frame server. 

The frame servers are oriented to video postproduction. But there are some developments that 

allow a certain grade of personalization although the interactivity is limited. 

AviSynth is a frame server composed by some apis that can be used such as by the reproducer 

as the video server. In this case, it will have to be installed into the video server for VoD or into 

the multicast transmitter for TV channels. 

Composition or mixing in the client 

The videos that compose the scene are sent as independent streams to the user and the 

client device mix the video flows. 

When the user interacts with the application/reproducer to modify the scene, the server 

just receives the flow control requests of the user streams. 

The bandwidth is proportional to the number of users, multiply by the streams number 

that every user is visualizing. 

This solution is valid for multicast environments. This is because is not codified a stream 

for every user. 

It’s necessary a video server with the capacity of having such video processes as users 

multiplied by the videos number that every user can reproduce simultaneously. 

Actual technologies: 

• Decoding and mixing using a frame server. 

AviSynth is a frame server that needn’t a graphic interface. It can be used with a 

reproducer using a script in the client. 

• VRML (Virtual Reality Modeling Language) or X3D. 

VRML made possible to visualize 3D scenes (with contents) in the web. However, the 

remote access to big and complex scenes where the bandwidth is limited is a lack 

until the user can interact with the scene elements and manage it. 

BIFS, being a comprised binary format that is encapsulated and sent with streaming, 

reduce this lack so the user can interact with the scene elements that are available 

improving the user experience. 

• MPEG-4: BIFS and LASeR. 

BIFS is the scene description MPEG-4 protocol to compose MPEG-4 objects, 

describe the interaction between then and animate them. BIFS is a binary format for 

2D or 3D content. 

LASeR is the proposed protocol in MPEG-4 standard to provide similar capacities to 

BIFS for devices with fewer capacities such as PDA’s and mobiles.

Composition or mixing in the client and server 

This is a mixed approximation, trying to take the best of every alternative. 

This alternative consists on codify several independent element of the scene in one. This 

must be done with the elements that don’t require separated interaction with every one of 

them. This reduces the streams number per user (so that the bandwidth) not reducing the 

interaction possibilities. 

This codification should be done just one time and must be stored for a later use. In this 

way, it will be guarantied that the server won’t need a high processing capacity and the 

interactivity won’t be penalized by the delays.

2.2 UI of services for IP-enabled TV and Set-Top Boxes 

2.2.1 On-line services for IP-enabled TV and Set-Top Boxes 

Services which can be directly rendered on IP-enabled TV screens are based on 

HTML/XML technology. Service providers are able to define the layout and style of the 

user interface of their services (infotainment, travel, shopping, etc.). 

• CE-HTML is a remote UI protocol with a core based on XHTML for such services 

which is developed by CEA and adopted by DLNA [ref CEA-2014]. Version 1.0 is 

available since June 2006. It allows existing Internet-content to be easily re-purposed 

for a variety of CE devices (see also device based UI adaptation). Content-based 

adaptations to the user interface can be communicated via this protocol. 

• T-navi is an information service based on IP (conforms to HTML 4.0) developed by 

Matsushita. The T-Navi services are only available in Japan since 2006. T-Navi 

enabled sets are available from Panasonic (Viera series) and Toshiba. 

• acTVila, a successor to T-Navi, that will be launched in Japan in February 2007 with 

a IP and HTML television service combining text-based information with video and 

plans on providing a streaming based video on demand service by the end of 2007. 

acTVila service providers will have the freedom to create their own UI style by 

changing colors and website lay-out. E.g. Video-on-demand services can have a 

different layout as information-based services. Also brand specific logos or designs 

can be applied to the UI. 

2.2.2 Flash-based content adaptation in Set-Top Boxes 

NDS and Bluestreak together bring middleware for set-top boxes to the market using 

UPnP multimedia streaming and Macromedia Flash as user interface engine. Flash 

allows the set-top box makers to add dynamic elements to the user interface (e.g. 

animations) adapted to different media categories (type of content) being watched. Also 

user-oriented UI adaptation can be supported: customization based on user preferences 

as well by choosing from a list of predefined skins.

3 State-of-the-art Compression Algorithms 

3.1 Motion JPEG-2000 and Wireless (Part 11) JPEG-2000 


The JPEG-2000 standardization effort [1] demonstrated that state-of-the-art coding 

performance can be obtained in still-image compression with a coding architecture 

that enables a rich set of features for the compressed bitstream. In particular, unlike 

the previous JPEG standard, JPEG-2000 provided a precise rate-control mechanism 

based on embedded coding of wavelet coefficients. Moreover, multiple qualities and 

multiple resolutions of the same picture are possible within JPEG-2000 based on 

selective decoding of portions of the compressed bitstream. Additionally, it should be 

emphasized that, for image and video transmission over error-prone channels, the 

embedded nature of JPEG-2000 allows for a layered content protection against the 

channel errors [2]. 

In the area of motion-compensated video compression, similar functionalities have 

long been pursued, mainly via the use of extensions of the basic MPEG coding 

structure [3]. In terms of related systems with immediate industrial applicability, i.e. 

scalable video coding standards, this resulted in the fine-granularity scalable video 

coding extension of MPEG-4 video (MPEG-4 FGS) [4]. However, MPEG-4 FGS left 

much to be desired. In particular, the compression efficiency of FGS was not as good 

as the equivalent non-scalable (baseline) coder. In addition, the use of the 

conventional closed-loop video coding structure of MPEG-alike coders hindered the 

scalability functionalities. 

As a result, recent research efforts on scalable video coding were targeted on 

extension of open-loop coding systems, such as JPEG-2000, to video coding. 

Although an extension of the basic technology of JPEG-2000 to three dimensions is a 

feasible task by extending its transform and coding modules to three dimensions [5] , 

this does not guarantee the highest possible coding efficiency since motioncompensation 

tools are not included. Moreover, the end-to-end delay of such a 

coding system is substantially increased in comparison to the corresponding frameby-frame 

compression. Although the delay problem manifests itself in motioncompensated 

video coding as well, in this case the compression efficiency is 

significantly increased by the use of motion-compensated prediction. This may 

override the high-delay detriment in applications for which achieving a low end-to-end 

delay is not a critical issue. 

In this section, we present an overview of the fundamental tools behind scalable 

image and video coding that are suitable for transmission environments with losses. 

Our presentation is divided in two parts: Section 2 is dedicated to the description of 

the features of Motion JPEG-2000 and the upcoming Wireless JPEG-2000 standard 

as they represent the state-of-the-art in intra-frame video coding for ideal and lossytransmission 

frameworks, respectively. Inter-frame video coding architectures 

involving motion compensated prediction are treated in Section 3.

3.1.2 Scope and Features of Motion JPEG-2000 

Motion JPEG-2000 (or MJPEG-2000) is an extension of the baseline (part 1) JPEG- 

2000 standard that supports video data. Intra-frame coding is supported based on the 

Embedded Block Coding with Optimized Truncation (EBCOT) algorithm of JPEG- 

2000 (i.e. without motion compensated prediction). Lossy and lossless compression 

is provided with one codec and for every video frame, similarly to JPEG-2000, 

scalability in resolution and quality is available from a single compressed bitstream. 

The input sample depth can be up to 32 bits per color component, while the maximum 

frame width and height is up to 32 

2 - 1 pixels. The output bitrate for each frame can 

be controlled based on a constant-bitrate (CBR) scheme. Alternatively, variablebitrate 

(VBR) schemes can be used, which provide uniform quality across time with 

high efficiency. For the integration of the various bitstreams in one stream, an MPEG- 

4 based file-format is used, which appropriately tags the various bitstreams to ensure 

correct synchronization of audio and video. This format provides the capability for 

metadata embedding and moreover multi-component, multisampling formats are 

supported, e.g. YUV 4:2:2, RGB 4:4:4, etc. 

In general, although intra-frame algorithms do not provide the highest coding 

efficiency for video data, MJPEG-2000 intra-frame coding provides important 

functionality requirements that are difficult to satisfy with inter-frame video coding 

based on motion-compensated prediction. For example, intra-frame coding greatly 

facilitates video editing, individual frame access, fast browsing with enhanced 

forward/backward capabilities, etc. In addition, in terms of complexity requirements 

and overall delay, intra-frame algorithms are always preferred over inter-frame 

algorithms since they have lower memory requirements (typically up to only one input 

frame), no motion estimation/motion compensation is performed at the encoder or 

decoder, and the maximum delay corresponds to delay incurred by the end-to-end 

processing of one input frame. 

3.1.3 Scope and Features of Wireless JPEG-2000 

Wireless JPEG-2000 (a.k.a. JPWL) [6] is an upcoming extension of the JPEG-2000 

standard. JPWL defines a set of tools and methods to achieve the efficient 

transmission of JPEG 2000 bitstreams over an error-prone wireless network. Wireless 

networks are characterized by the frequent occurrence of transmission errors along 

with a low bandwidth, henceforth putting strong constraints on the transmission of 

digital images. Since JPEG-2000 provides high compression efficiency, it is a good 

candidate for wireless multimedia applications. Moreover, due to its high scalability, 

JPEG-2000 enables a wide range of quality of service (QoS) strategies for network 

operators. However, to be suitable for wireless multimedia applications, JPEG-2000 

has to be robust to transmission errors. 

The baseline JPEG-2000 standard defines error resilience tools to improve 

performances over noisy channels. However, these tools only detect where errors 

occur, conceal the erroneous data, and resynchronize the decoder. More specifically, 

they do not correct transmission errors.

Furthermore, these tools do not apply to the image headers, which are the most 

important parts of the codestream. For these reasons, they are not sufficient in the 

context of wireless transmissions. 

JPWL system description. 

For the purpose of efficient transmission over wireless networks, JPWL defines other 

mechanisms for error protection and correction. These mechanisms extend the 

elements in the core coding system described in baseline (Part 1) JPEG-2000. These 

extensions are backward compatible in the sense that decoders which implement 

Part 1 are able to decode the part of the data that conforms to Part 1 while skipping 

the extensions defined by JPWL. 

The JPWL system is illustrated in the figure above [6]. Basically, JPWL provides a 

generic file-format for robust transmission of JPEG-2000 bitstreams over error-prone 

networks without being linked to a specific network, error-resilient coder or transport 

protocol. Additionally, the JPWL provides a generic format for the description of the 

degree of sensitivity to transmission errors of the different parts of the bitstream, and 

a generic format for the description of the locations of residual errors in the 

codestream. 

Thus basically, the JPWL standard signals the use of informative tools in order to 

protect the codestream against transmission errors. These tools include techniques 

such as error resilient entropy coding, FEC codes, UEP and data 

partitioning/interleaving. It is important to point out that these informative tools are not 

defined in the standard. Instead, they are registered with the JPWL registration 

authority. Upon registration, each tool is assigned an ID, which uniquely identifies it. 

When encountering a JPWL codestream, the decoder can identify the tool(s) which 

have been used to protect this codestream by parsing the standardized JPWL 

markers and by querying the registration authority. The decoder can then take the 

appropriate steps to decode the codestream, e.g. acquire or download the 

appropriate error-resilience tool.

3.1.4 Video Coding with Motion Compensated Prediction 

In this section, we review the conventional closed-loop video coding structure as well 

as the recently-introduced open-loop video coding schemes that perform a temporal 

decomposition using motion compensated temporal filtering. Both have been used in 

related literature [3] [7] to provide working video coding systems with scalability 

properties. 

All the currently-standardized video coding schemes are based on a structure in 

which the two-dimensional spatial transform and quantization is applied to the error 

frame coming from closed-loop temporal prediction. A simple structure describing 

such architectures is shown in the “Hybrid video compression scheme” figure (a) (see 

further on). The operation of temporal prediction P typically involves block-based 

motion-compensated prediction. The decoder receives the motion vector information 

and the compressed error-frame C t and performs the identical loop using this 

information in order to replicate MCP within the P operator. Hence, in the decoding 

process (seen in the dashed area in the “The hybrid video compression scheme” 

figure (a), the reconstructed frame at time instant t can be written as: 

° ° - 1 - 1 ° - 1 - 1 

At = PAt-1 + TS QS Ct, A0 = TS Q S C0. 

(0.1) 

The recursive operation given by (0.1) creates the well-known drift effect between the 

encoder and decoder if different information is used between the two sides, i.e. if 

Ct ¹ QST SHt at any time instant t in the decoder. This is not uncommon in practical 

systems, since transmission errors or loss of compressed data due to limited channel 

capacity can be a dominant scenario in wireless or IP-based networks, where a 

number of clients compete for the available network resources. In general, the 

capability to seamlessly adapt the compression bitrate without transcoding, i.e. SNR 

scalability, is a very useful feature for such network environments. Solutions for SNR 

scalability based on the coding structure of the “Hybrid video compression scheme” 

figure basically try to remove the prediction drift by artificially reducing at the encoder 

side the bitrate of the compressed information C t to a base layer for which the 

network can guarantee the correct transmission [3]. An example of such a codec is 

the MPEG-4 FGS [4]. 

This however reduces the prediction efficiency [3], thereby leading to degraded 

coding efficiency for SNR scalability. To overcome this drawback, techniques that 

include a certain amount of enhancement layer information into the prediction loop 

have been proposed. For example, leaky prediction [8] gracefully decays the 

enhancement information introduced in the prediction loop in order to limit the error 

propagation and accumulation. Scalable coding schemes employing this technique 

achieve notable coding gains over the standard MPEG-4 FGS [4] and a good tradeoff 

between low drift errors and high coding efficiency [8] [9]. Progressive Fine 

Granularity Scalable (PFGS) coding [10] yields also significant improvements over 

MPEG-4 FGS by introducing two prediction loops with different quality references. A 

generic PFGS coding framework employing multiple prediction loops with different 

quality references and careful drift control lead to considerable coding gains over 

MPEG-4 FGS, as reported in [11] [12].

To address the issues of efficient video transmission, several proposals suggested an 

open-loop system, depicted in the “Motion-compensated temporal filtering“ figure (b) 

(see further on), which incorporates recursive temporal filtering. This can be 

perceived as a temporal wavelet transform with motion compensation [13], i.e. 

motion-compensated temporal filtering. This scheme begins with a separation of the 

input into even and odd temporal frames (temporal split). Then the temporal predictor 

performs MCP to match the information of frame A 2t+ 1 with the information present in 

frame A 2t . Subsequently, the MCU operator U inverts the information of the 

prediction error back to frame A 2t , thereby producing, for each pair of input frames, 

an error frame H t and an updated frame L t . The MCU operator performs either 

motion compensation using the inverse vector set produced by the predictor [14], or 

generates a new vector set by backward motion estimation [15]. The process iterates 

on the L t frames, which are now at half temporal-sampling rate (following the 

multilevel operation of the conventional lifting), thereby forming a hierarchy of 

temporal levels for the input video. The decoder performs the mirror operation: the 

scheme in the “Motion-compensated temporal filtering“ figure (b) operates from right 

to left, the signs of the P , U operators are inverted and a temporal merging occurs 

at the end to join the reconstructed frames. As a result, having performed the 

reconstruction of the L t , denoted by L ° t , at the decoder we have: 

° ° - 1 - 1 ° ° - 1 - 1 

A 2t = Lt - UT Q C , A2t+ 1 = P A2t + T Q C 

(0.2) 

S S t S S t 

where A° 2t, A ° 2t+ 1 denote the reconstructed frames at time instants 2t , 2t + 1. 

As 

seen from (0.2), even if Ct ¹ QST SHt in the decoder, the error affects locally the 

reconstructed frames A° 2t, A ° 2t+ 1 and does not propagate linearly in time over the 

reconstructed video. Error-propagation may occur only across the temporal levels 

through the reconstructed L ° t frames. However, after the generation of the temporal 

decomposition, embedded coding may be applied in each group-of-frames GOP by 

prioritizing the information of the higher temporal levels based on a dyadic-scaling 

framework, i.e. following the same principle of prioritization of information used in 

wavelet-based SNR-scalable image coding [6]. Hence, the effect of error propagation 

in the temporal pyramid is limited and seamless video-quality adaptation can be 

obtained in SNR scalability [7] [16]. In fact, experimental results obtained with the 

SNR-scalable MCTF video coders, as well as the results obtained with other state-ofthe-art 

algorithms [17] [18], suggest that this coding architecture can be comparable 

in rate-distortion sense to an equivalent non-scalable coder that uses the closed-loop 

structure. However, one significant disadvantage of this type of techniques for realtime 

communications concerns the end-to-end codec delay. In particular, following 

the analysis of [19], it can be shown that for a GOP of N (where N is typically 16 or 

32 for a frame-rate of 30 or 60 frames-per-second, respectively), the required end-to- 

N end delay in terms of number of decoded frames can be as high as 2 + 1 frames.

A + 

+ 

A + 

+ 

t 

- 

P 

TS S Q TS S Q 

Frame 

Delay 

+ 

+ + 

A 

Q 

T 

t 

(a) The hybrid video compression scheme. 

A , A + 

2 t 2 t 1 

A 2 t + 1 

Temporal 

+ 

- 

+ 

− 1 

S 

− 1 

S 

Split P U 

A 

2t 

(b) Motion-compensated temporal filtering. 

Notations: 

+ 

+ 

A H 

0 , 0 t , t 

C 

t 

T S Q S 

H 

C 

+ t L 

+ t L 

At consists the input video frame at time instant t = 0, t, 2 t, 2t + 1 

A ° t is the reconstructed frame 

H t is the error frame, whereas L t is the updated frame 

C t denotes the transformed and quantized error frame obtained by using the spatial operators T S and Q S , 

respectively 

P denotes temporal prediction 

U denotes the temporal update. 

Our description on motion-compensated state-of-the-art video coders is concluded 

with the presentation of two indicative coding systems that represent the current 

state-of-the-art in the closed-loop and open-loop temporal prediction structures, 

namely the Advanced Video Coder (AVC), also called as the H.264 coder, which was 

jointly standardized by MPEG and ITU-T [20], and the motion-compensated 

embedded zero-block coder (MC-EZBC) of [17]. While the AVC is a non-scalable 

coding scheme, optimized for a certain set of quantization parameters, the MC-EZBC 

has the capability of simultaneous scalability in bitrate, resolution and SNR. 

t 

t

3.2 Codification technologies 


Video codification is a necessary element for compressing the video size making the 

most possible of the capabilities of net and storage. 

MPEG (Moving Picture Expert Group) is an ISO/IEC work group in charge of 

development standard for audio and video codification. The first standard, MPEG-1, 

was the basis for codification formats such as Video CD and MP3. After, the definition 

of standard MPEG-2 was the basis of products like DVD and digital TV set-top boxes. 

The last standard that has been defined is MPEG-4. MPEG-4 is a multimedia 

standard for wire and wireless nets. MPEG-4 has been defined for representing 

audio-visual real and virtual objects. Moreover, MPEG-7 has been created to describe 

and locate audio-visual contents, and MPEG-21: the multimedia framework. 

3.2.2 MPEG-1 and MPEG-2 

MPEG-1 

Used for video streaming. It has a bit rate of 1,5 Mbit/s approximately. Oriented to 

digital storage specially for CD-ROM. 

MPEG-2 

This is an advanced video compression technique to generate better bit rates and 

better compression. It let codification of gradual and entwined video sequences until 

HDTV level. 

The most important audio codec defined in the MPEG-2(Part 7) standard is AAC 

(Advanced Audio Code). AAC defines a format for multi-channel audio codification. 

This format wins similar qualities than others codec with a bit rate better. 

Video MPEG-2 is a video compression standard with bit rates between 4 and 10 

Mbit/s. It defines 5 profiles, referred to complexity of compression algorithm, and 4 

levels referred to the resolution of original video. The main level and the main profile 

(MP/ML) is the most used combination. 

Systems MPEG-2 define two multiplexation systems: ”Program Stream” compatible 

with MPEG-1 and “Transport Stream” that lets to send multiples streams with 

independent origins. 

MPEG-2 is the most successful standard for multimedia representation in market. The 

digital entertainment uses mainly MPEG-2. 

The most conceptual innovation in MPEG-2 is the video scalable codification.

MPEG-4 defines new functionality and new capacities, probably, the future standard 

for multimedia applications. 

MPEG-4 adds an important conceptual advance in the representation of multimedia 

contents: the model of representation based on objects. This new model considers 

that the audio-visual contents describe a world composed by elements called objects. 

The audio-visual scene is the composition of independent objects, each one with its 

own codification, characteristics and behaviours. If the elements are encoded 

individually, they will be accessed in a individual way. This architecture provides a 

complete range of interactive possibilities. 

MPEG-4 has the characteristics of MPEG-1 and MPEG-2 with a better video 

codification and adding new characteristics like advance 3D graphical support 

(textures, animations, etc.) for 3d scenes, files oriented to objects(audio, video, 3D 

objects, streaming text), support for DRM (Digital Rights Management). 

3.2.3 MPEG4 

3.2.3.1 MPEG-4 architecture 

The following parts compose the MPEG-4 architecture: 

• MPEG-4 Systems: Specifies the global architecture of the standard and defines 

how integer Visual MPEG-4 and Audio MPEG-4. MPEG-4 Systems introduce the 

concept of BIFS (Binary Format for scenes). BIFS defines the interaction between 

objects. 

• DMIF: Delivery Multimedia Integration Framework. This part defines the 

advanced content streaming or “Rich Media”. 

• Visual MPEG-4: This part defines the nature and synthetic video content 

representation. 

• Audio MPEG-4: This part defines the nature and synthetic audio content 

representation. 

3.2.3.2 CODECS (MPEG-4 Visual y MPEG-4 Audio) 

A codec (COder – DECoder) is the algorithm that defines how to encode and 

decode the video and audio content to reduce its size or its necessary band width 

for transmission, with the minimum lose of quality as possible. 

The audio codec, MPEG-4 AAC(advanced Audio Codec), is an extension of MPEG- 

2 AAC (MPEG-2 Part 7). 

The main video codecs are: 

• The codecs included in the standard part 2, specially the ones bind to the simple 

profiles (SP) and advanced simple (ASP). 

• 264/AVC (Advanced Video Coding)/MPEG-4 Part 10. MPEG-4 AVC allows an 

efficient video compression much better than the others, providing more flexibility 

for applications.

MPEG-4 Scalable Video Coding (SVC) is a future extension of the standard MPEG- 

4 AVC. SVC uses the same video streaming (an unique content codification) for 

different devices in different nets. SVC provides scalability in three aspects: 

• Space scalability: suitable resolution. 

• Time scalability: Selecting frame rate. 

• Quality scalability: selecting bit rate. 

MPEG-4 SVC generates a compatible layer with MPEG-4 AVC, and one or more 

additional layers. The base layer contains the minimum quality, frame rate and 

resolution, and the following layers increase the quality and/or resolution and/or 

frame rate. 

3.2.3.3 MPEG-4 Systems (BIFS) 

Exploring other possibilities for advanced devices, it has been demonstrated that 

ISO/IEC 14496-11 “Scene description and application engine” (also known as BIFS) 

is another good alternative, but it’s necessary an interoperability to create these two 

formats, at the same time, to produce ISO/IEC 14496-20 and ISO/IEC 14496-11 

simultaneously. 

ISO/IEC 14496-11 specifies the coded representation of interactive audio-visual 

scenes and applications. 

It specifies the following tools: 

The coded representation of the space-temporal positioning of audio-visual objects 

as well as their behaviour in response to interaction (scene description) 

The coded representation of synthetic two-dimensional (2D) or three-dimensional 

(3D) objects that can be manifested audibly and/or visually 

The Extensible MPEG-4 Textual (XMT) format, a textual representation of the 

multimedia content described in ISO/IEC 14496 using the Extensible Markup 

Language (XML) and a system level description of an application engine (format, 

delivery, lifecycle, and behaviour of downloadable Java byte code applications). 

3.2.3.4 MPEG-4 Part 20 (LASeR and SAF [44]) 

Because of resource limitations of mobiles, smartphones, PDA’s, SetTopBoxes and 

older desktop or portable PCs, we need to optimize requirements to accommodate 

all devices into one compatible format that permits this interoperability across 

different cases. For all this, we are exploring all emerging audio, video and streams 

formats to find the best choice.

It seems that the actual best choice would be ISO/IEC 14496 (also known as 

MPEG-4) and their primary parts: ISO/IEC 14496-1 “Systems” [45], ISO/IEC 14496- 

2 “Visual” [46], ISO/IEC 14496-3 “Audio” [47], ISO/IEC 14496-10 “Advanced Video 

Coding” [48] and ISO/IEC 14496-20 “Lightweight Application Scene Representation 

(LASeR) and Simple Aggregation Format (SAF)” [49]. Optionally, we analyse 

ISO/IEC 14496-11 “Scene description and application engine” (also known as BIFS) 

[50] but actually it can’t be adapted to the less power resources of limited devices 

like mobiles. 

The fundamental part of this optimum formats we find are: 

• ISO/IEC 14496-20 which defines a scene description format (LASeR) and an 

aggregation format (SAF) suitable for representing and delivering rich-media 

services to resource-constrained devices such as mobile phones. A rich media 

service is a dynamic, interactive collection of multimedia data such as audio, 

video, graphics, and text. Services range from movies enriched with vector 

graphic overlays and interactivity (possibly enhanced with closed captions) to 

complex multi-step services with fluid interaction and different media types at 

each step. 

• LASeR aims at fulfilling all the requirements of rich-media services at the scene 

description level. LASeR supports: 

o An optimized set of objects inherited from SVG to describe rich-media 

scenes. 

o A small set of key compatible extensions over SVG. 

o The ability to encode and transmit a LASeR stream and then reconstruct 

SVG content. 

o Dynamic updating of the scene to achieve a reactive, smooth and 

continuous service. 

o Simple yet efficient compression to improve delivery and parsing times, as 

well as storage size, one of the design goals being to allow both for a direct 

implementation of the SDL as documented, as well as for a decoder 

compliant with ISO/IEC 23001-1 “Binary MPEG format for XML” to decode 

the LASeR bitstream. 

o An efficient interface with audio and visual streams with frame-accurate 

synchronization. 

o Use of any font format, including the OpenType industry standard and. 

o Easy conversion from other popular rich-media formats in order to leverage 

existing content and developer communities. 

Information taken from http://www.mpeg-laser.com 

Introduction 

LASeR is a scene description format, where a scene is a spatial, temporal and 

behavioral composition of audio media, visual media, graphics elements and text. 

LASeR is binary or compressed like BIFS or Flash, as opposed to textual scene 

descriptions such as XMT or VRML or SVG. LASeR stands for Lightweight 

Application Scene Representation.

SAF is a streaming-ready format for packaging scenes and media together and 

streaming them onto such protocols as HTTP/TCP. SAF services include: 

A simple multiplex for elementary streams (media, fonts or scenes). 

Synchronization and packaging signaling. 

SAF stands for Simple Aggregation Format. LASeR and SAF have been designed 

for use in mobile, interactive applications. 

Why LASeR? 

The decision to create yet another standard for scene description was taken after a 

thorough survey of available open or de-facto standards: BIFS, Flash and SVGT. 

Profiling of BIFS was tried to create a small enough subset to be used on mobile 

phones, to no avail. Flash is proprietary and is too big for most mobiles. SVGT1.1 is 

getting some traction, but on one hand SVGT1.1 does not have AV interfaces or 

dynamicity, and on the other hand its successor, SVGT1.2, is still in flux, and while 

it will feature AV interfaces, it will still miss dynamicity, compression, streaming and 

is significantly heavier than SVGT1.1. Also, SVGT in general relies on a host of 

other standards such as DOM, SMIL, ECMA-Script, XHTML and CSS, MIME 

multipart… and to manage such a pile of standards is a true challenge in terms of 

interoperability. 

Why SAF? 

The decision to create yet another standard for distribution of mobile content was 

taken after implementing and trying interactive services on small devices, based on 

RTP/RTSP or MP4/3GP download (progressive or not) on TCP/HTTP. In most 

cases, the need for a simpler, lighter solution was obvious. In order to package 

efficiently and download progressively or stream a scene with a few media, RTP is 

overkill, and MP4/3GP is not well suited to the job: MP4/3GP format is a file format, 

and it can only be used for progressive download by using special cases (moov 

atom in front of the file, media interleaved in time order). In addition, MP4/3GP has 

a host of features that burden a mobile implementation for no reason. In order to 

reduce the design time of SAF and get almost an immediate validation, SAF was 

designed around a simple configuration of a proven technology: the MPEG-4 

Systems Sync Layer. This enables as a bonus the availability of an RTP payload 

format for SAF for free with RFC3640. 

So as a summary, SAF has the minimal/optimal set of features for the job, and can 

be mapped easily to other transport mechanisms (RTP, MP4/3GP, MPEG-2 TS…). 

Requirements of LASeR 

The requirements which structure the design of LASeR are: 

1 Support efficient and compact representation of scene data supporting at least 

the subset of SVG T 1.1 object set functionality. (Today LASeR is aligned as 

much as possible with SVGT1.2).

2 Allow an easy conversion from other graphics formats (e.g. BIFS, SMIL/SVG, 

PDF, Flash, …). 

3 Provide efficient coding, to be suitable for the mobile environment. 

4 Allow separate streams for 2D and 3D content. 

5 Allow the representation of scalable scenes. 

6 Allow the representation of adaptable scenes, for use within the MPEG-21 DIA 

framework. 

7 Be extensible in an efficient manner. 

8 Allow small profiles definition. 

9 Allow the representation of error-resilient scenes. 

10 Allow encoding modes easily reconfigurable and signaled in band. 

11 Provide an optimal balance between compression efficiency and complexity 

and memory footprint of decoder and compositor code. 

12 Allow integer-only implementation of decoding and rendering. 

13 Allow to save/restore several scene states. The saving and restoring shall be 

triggerable either by the server or by the user. 

14 Allow low-complexity profiles implementable on Java MIDP platform. 

15 Allow the representation of differential scenes, i.e. scenes meant to build on top 

of another scene. 

16 Allow interaction through available input devices, such as mobile keyboard or 

pen, and support the input of strings. 

17 Allow safe implementation of scene decoder. 

In addition, it is deemed crucial that LASeR is designed in such a way that 

implementations can: 

• Be as small as possible. 

• Be as fast as possible. 

• Require as small as possible runtime memory. 

• Be implementable at least partially in hardware. 

Requirements for Simple Aggregation Format (SAF) 

The requirements which structure the design of SAF are: 

1 Provide a simple aggregation mechanism for Access Units for various media in 

aggregated packets (Video, Audio, Graphics, Images, Text/Font…). 

2 Allow a synchronized presentation of the various media elements in a packet or 

a sequence of such aggregated packets. 

3 Be as bit efficient as possible. 

4 Be byte aligned. 

5 Be easily transported on popular interactive transport protocol (e.g. HTTP). 

6 Be easily mapped on popular streaming protocol (e.g. MPEG-4 RTP payload 

format RFC 3640). 

7 Be extensible in an efficient manner. 

8 Allow the management of pre-loaded objects that enables the server to 

anticipate the downloading of the corresponding objects to improve user 

experience.

What is LASeR? 

LASeR is: 

• A SVGT scene tree, with an SVG rendering model. 

• An updating protocol, allowing actions on the scene tree such as inserting an 

object, deleting an object, replacing an object or changing a property: this is the 

key to the design of dynamic services and a fluid user experience. This 

updating protocol can also be seen as a kind of micro-scripting language. 

• OpenType text and fonts, including downloadable/streamable fonts. 

• A binary encoding which, coupled with the updating protocol, allows the 

incremental loading/streaming of scenes, with excellent bandwidth usage. 

• Few LASeR extensions to improve the support of input devices, or the flexibility 

of event processing without a full scripting language, or simple axis-aligned 

rectangular clipping. 

Because of the above, LASeR may also have: 

• A micro-DOM or JSR226 interface, since the scene tree is almost purely SVG, 

thus allowing the design of complete applications on top of the LASeR engine. 

• The micro-DOM interface also makes it possible to use ECMA-Script with 

LASeR scenes. 

• Because of the updating protocol which is similar to that of Flash, it is easy to 

convert Flash content to LASeR. 

What is SAF? 

SAF is: 

• A fixed configuration of the MPEG-4 Systems Sync Layer, providing an easy yet 

powerful way of packaging elementary streams. 

• A simplified stream description mechanism. 

• A simple multiplex for several media, fonts and scene streams. 

SAF streams may be: 

• Packaged in RTP/RSTP using the payload format defined in RFC3640. 

• Packaged in MP4/3GP files using a mapping defined with SAF. 

• Packaged in MPEG-2 Transport Stream using the SL mapping defined in 

ISO/IEC. 14496-8. 

Although it seems that this format has a patent fee, we think this was not a problem 

since it also seems to be our best solution. Actually we are waiting for the release 

of the final reference software to check its viability and stability.

3.3 Additional formats for most power devices (future) 

3.3.1 VC1 [21] 

Other aspects to introduce, in near future, are best resolutions to cover high definition 

contents without resource penalties. For all this, we find that video codec SMPTE 

421M “VC-1 Compressed Video Bitstream Format and Decoding Process“ (known as 

VC-1) is a great choice to cover also these big resolutions. 

VC-1 minimizes the complexity of decoding high definition (HD) content through 

improved intermediate stage processing and more robust transforms. As a result, VC- 

1 decodes HD video twice as fast as H.264, while offering two to three times better 

compression than MPEG-2. 

Since VC-1 is optimized for decoding performance, it ensures a superior playback 

experience across the widest possible array of systems regardless of bit rate or 

resolution. These systems range from the PC (where VC-1 playback at 1080p is 

possible), to set-top-boxes, gaming systems, and even wireless handsets. 

VC-1 offers superior quality across a wide variety of content types and bit rates, which 

has been well documented by independent sources: 

• DV Magazine found VC-1 to be superior to both MPEG-2 and MPEG-4. 

• TANDBERG Television found VC-1 produces significantly better quality than MPEG-2 

and comparable quality to H.264. These results were presented at the 2003 

International Broadcasting Convention (IBC). 

• C'T Magazine, Germany's premier audio-video magazine, compared various codec 

standards—including VC-1, H.264, and MPEG-4—and selected VC-1 as producing 

the best subjective and objective quality for HD video. 

• The European Broadcasting Union (EBU) found VC-1 had the most consistent quality 

in tests that compared VC-1, RealMedia V9, the Envivio MPEG-4 encoder, and the 

Apple MPEG-4 encoder. 

3.3.2 Device-oriented screens 

Analysing hundreds of devices by major manufacturers (Nokia, Sony-Ericsson, Motorola, 

Fujitsu-BenQ-Siemens, Samsung, Alcatel, Phillips, Acer, HP, Blackberry, Qtek (HTC), 

Palm…) we find that square pixel is the most used proportion to represent pixel 

information onto their screens (typical based upon TFT). Because there are many 

sources (primary documentary and films) recorded into panoramic formats, we think that 

it is important to accommodate all these into their original aspect format and reduce size 

of bitstream and their complexity. 

We think that is important to create automatically and simultaneously all formats into a 

dedicated servers to provide the same information, in real time. With all this, we can 

make a standard to transmit all information to all devices independently of their power, 

and control the total server power to create each channel.

And eventually, we find that these next resolutions are desirable to take advantage of our 

analysed device physical screens: 

Aspect Lowest Low Medium High Highest Ultra HD 1* HD 2* 

M M M+P P+C P+C C+T C+T C+T 

4:3 128x96 176x132 240x180 320x240 480x360 640x480 --- --- 

16:9 128x72 176x99 240x135 320x180 480x270 640x360 1280x720 1920x1080 

*: Optionally for future. 

M: Mobiles & Smartphones. 

P: PDA’s. 

C: Computers. 

T: TV & Advanced SetTopBoxes. 

For all this we consider we need a source minimum resolution of 640x480 for an aspect 

ratio of 4:3 and six simultaneous compressed (or live) streams to accommodate all 

possible devices; and 640x360 for an aspect ratio of 16:9 and six streams with actual 

requirements, or 1920x1080 to cover the maximum future HD and need eight 

simultaneous streams. Because of server resources consumption, we prefer to limit 

computer final resolutions and oversample source video image of destination computer to 

a possible big screen resolution. 

3.4 Analysis of state-of-the art image compression algorithms for 

medical applications 

3.4.1 Still image compression such as JPEG, JPEG-LS and JPEG-2000 

Following results were obtained using optimized software for JPEG-2K, JPEG-LS and 

lossless JPEG compression running on a Pentium IV 3 GHz. A set of grayscale 

medical images was compressed and the images were of size SXGA (1280x1024 

pixels). 

Performance measurements lossless mode 

CODEC Throughput 

Mbit/s 

Throughput 

Fps for 

SXGA 

Processing 

time / frame 

Average 

CR (1:x) 

Coded 

Stream BW 

Mbit/s 

JPEG2000 22 0.7 1420 ms 3.4 6.5 

JPEG-LS 62 2.0 500 ms 2.9 21.5 

Lossless 

JPEG 

230 7.3 137 ms 1.7 135.3

Performance measurements lossy mode 

CODEC Throughput 

Mbit/s 

Throughput 

Fps for SXGA 

Processing 

time / 

frame 

Average 

CR (1:x) 

JPEG2000 @ 

10:1 

20 0.6 1667 ms 10 2 

JPEG2000 @ 

20:1 

22 0.7 1429 ms 20 1.1 

JPEG @ 10:1 650 20.7 48 ms 10 65 

JPEG @ 20:1 800 25.4 39 ms 20 40 

Discussion 

Coded 

Stream 

BW 

Mbit/s 

There are two reasons why existing state-of-the-art still image compression 

algorithms are not suitable for our (realtime) application. First of all: the throughput 

(framerate) of these compression algorithms is too low. JPEG2000 for instance only 

achieves an average of 0.7 frames per second, which is not acceptable. JPEG-LS 

and lossless JPEG perform better but still 7 frames per second is too low to allow 

fluent interaction between user and application and to show medical video 

sequences. The second reason is that the compression ratios of these algorithms are 

still too low. Even the best algorithm (JPEG2000) only achieves an average 

compression ratio of 3.4 on medical images. Current wireless networks (802.11g) 

have a theoretical bandwidth of 54 Mbit per second but the actual throughput is more 

around 20 Mbit per second. If we want to transmit medical color images of size 

1600x1200 (which is rather low for medical imaging) then the size per image is 

1600x1200x3= 5.760.000 bytes or 46.080.000 bits since there are three color planes. 

This means that a compression ratio of 3.4 would only allow to send 46.080.000 bits/ 

20 Mbit= 2.2 images per second over the wireless network. Again, this is too low to 

display medical video data and to allow fluent interaction between the user and the 

medical software application. 

Above discussion was for lossless compression. If we would switch to lossy 

compression then the problem of low compression ratio is solved. However, at the 

same time uncontrolled artifacts and distortions are introduced in the medical images 

due to the lossy nature of the compression algorithms. This is absolutely 

unacceptable: up to today the general opinion in the medical imaging community is 

that it is not a priori allowed to apply lossy compression on medical images that will 

be used for diagnosis. Lossy compression in medical imaging is only allowed to 

reduce size of archived images or if one can prove that the lossy nature cannot 

influence the clinical image quality (which no one has been able to prove up to now). 

Also even with lossy compression, there is still the problem of limited throughput 

(framerate) of the existing compression algorithms.

Note that the performance results presented above are in line with results of other 

people and results available on the Internet (such as for the highly optimized ‘Kakadu’ 

implementation of JPEG2000). 

3.4.2 Intra-frame image compression such as MJPEG-2000 

Motion JPEG2000 uses only key frame (intra-frame) compression, allowing each 

frame to be independently accessed. Advantage of applying frame-by-frame 

compression is that computationally expensive motion estimation is avoided. 

Disadvantage is that the compression ration of algorithms using only intra-frame 

compression will be significantly lower than inter-frame based algorithms. One could 

see intra-frame image algorithms as an extension of still image compression 

algorithms. 

The same drawbacks exist for this type of algorithms: the compression ratio is still 

insufficient when working in lossless mode and the latency is too low to really support 

medical video sequences (at typical medical image resolutions). 

3.4.3 Inter-frame image compression such as MPEG-4 AVC 

Intra-frame image compression provides very high compression ratio especially when 

used in lossy mode (where they are designed for). Both the closed-loop or open-loop 

video codec architectures require complex hierarchical block-based motion models in 

order to efficiently reduce the uncertainty about the true motion and to improve the 

compression efficiency. Employing complex motion models however, reduces the 

chances of attaining real-time video encoding. Additionally, opting for a classical 

video codec brings a delay that is as high as high as N/2 +1 frames, where N is 

typically 16 or 32 for a frame-rate of 30 or 60 frames-per-second, respectively. This 

means that for a system running at 30 frames per second, the introduced delay due to 

the compression would be 17 frames or 567 milliseconds. It is obvious interaction 

between the user and the software application generating the image data is 

completely impossible with a delay of 0.5 seconds. For example: such a delay would 

mean that the display system will respond to any action of the user (such as clicking a 

button, rotating a medical image, performing window level, moving a window, …) with 

a delay of more than half a second. For off-line analysis of medical images (MRI, CT, 

etc) this is not a problem. For vision aided surgery this is indeed is a problem.

4 User Interface Adaptation 

4.1 Introduction 

User interface adaptation is an issue as old as the history of the existing devices 

and ways of interaction. 

During the last years more and more entertainment or professional services and 

applications that can be used (interfaced) and accessed through different devices 

have been developed. 

Before analyzing the state of the art of this type of applications, some criteria 

should be established in order to narrow the scope of the analysis. In order to do 

that, the following classification, based on the way this adaptation is done, is 

proposed: 

a) Customized adaptation: This category compiles the user interfaces adapted 

manually. The main advantage of those applications is that the adaptation is 

perfectly suited for the final needs of the device. Each interface is redefined in 

a manual or semi-automatic way in order to have the perfect appearance that 

it should have. The cost, the lack of use of standards and the impossibility to 

launch automatic processes are the main disadvantages. 

b) Adaptation based on standard adaptation solutions or on generic 

standard tools: This kind of adaptation is base on standard or semi-standard 

tools that allow the automatic adaptation of the interfaces. This adaptation 

process can be fully standardized or can be based on generic standard 

transformation tools (i.e. XSLT) but both cases have a common feature: the 

adaptation is based on solutions that facilitate the interoperability and 

serialization, although sometimes the price is the loss of granularity in the 

adaptation process. 

4.2 MPEG-4 Advanced Content visualization technologies 

4.2.1 Software BIFS reproducers 

According to the Market, there are available some developments that support MPEG- 

4 System (BIFS), such as universities that actuate as research institutions or 

companies in Research and Development aspects and services commercialization. 

The standard specification MPEG-4 BIFS guarantee the interoperability. In this way, 

a MP4 content generated with any tool that follow the standard, will be able to be 

reproduced in any device compatible with BIFS.

However, the most of the reproducers don’t implement the 100% BIFS nodes, and 

that implies that the interoperability is not completely achieved. 

4.2.2 GPAC: Osmo4 

Osmo4 is a part of GPAC (Project on Advanced Content) framework developed by 

the National Superior Telecommunications School of France. GPAC allows generate 

2D and 3D advanced contents using the MP4Box tool and reproduce them using 

Osmo4. 

GPAC is distributed under license LGPL (lesser General Public License). 

Characteristics: 

• It supports several multimedia formats, since simple contents (avi, mov, mpg) until 

2D/3D advanced contents. 

• It supports local files reproduction, unload http and reproduction and rtp/rstp 

streaming on UDP (unicast or multicast) or TCP. 

• Video and audio presentation based on open source plugins. It’s available a decoder 

development kit (DDK) to connect the player with the necessary codec. 

• Reproduction control: play, pause and advance. 

• Graphic characteristics: antialiasing, zoom, rendering area size update, complete 

screen. 

Osmo4 allows: 

• Animated cartoon reproduction (unloaded or by means of streaming). 

• Graphic, text, video/audio interactive and synchronized mixing. 

• MPEG-7 and MPEG-21 partial support: meta-data, encrypted, watermarking, DRM. 

4.2.3 IBM: M4Play 

IBM has developed a MPEG-4 toolkit. It consists on a classes Java and APIs set that 

allow to generate MPEG-4 advanced contents and reproduce them. The toolkit is 

distributed under commercial license. 

M4Play player is a part of the toolkit and its characteristics are as follow: 


• Based on java: multiplatform. 

• Two versions: 

o Independent application. 

o Adaptable Applet for html page. 

• It supports streaming on rtp/rtsp and local files reproduction. 

• Can reproduce: 

o MP4 according to ISMA specifications. 

o MP4 including MPEG-4 systems. 

o AVI : MPEG-4simple profile video (.cmp, .m4v, .263). 

o AAC: Low-Complexity Profile audio (.aac, .adif, .adts). 

o MP3: MPEG-1 and third audio level MPEG. (.mp3).

4.2.4 Envivio TV 

Envivio has developed and commercialize a MPEG-4 reproducer for set-top-boxes, 

PCs, and PDA’s. 


• It’s installed as: 

o Independent player. 

o Plugin for known players (QuickTime v.4.1.2 or later, RealNetworks v.7.0 or 

later, and Windows Media placer v6.4 or later). 

• Portable code C/C++ for set top boxes and mobile telephones. 

• According to 2D BIFS specification. 

• Local or with streaming MP4 files reproduction. 

• Protocols: RTP, RTCP, or RTSP on UDP or meanwhile http tunnels, unicast and 

multicast. 

The independent reproducer version can be integrated or ported to any device 

including set-top-boxes, PC, PDA and video game. 

Envivio has been certified by RealNetworks and is a part of the automatic update 

program such as the MPEG-4 plugin for RealNetworks v8.0 reproducer and later. 

4.2.5 Bitmanagement: BS Contact MPEG-4 

Bitmanagement has developed a MPEG-4 player with 2d and 3D support. The 

implementation covers more than 80% of the MPEG-4 nodes. This reproducer is 

being used in several European projects of Telefonica I+D. The MPEG consortium 

has solicited to use the bitmanagement key software technology as a reference 

implementation for the standard. 

The predecessor of this reproducer is the 3D blaxxun contact motor, that was the 

fires VRML visor that introduced DirectX 7 acceleration hardware support and 

incorporated some 3D advanced characteristics (particles systems, multi-texture, 

nurbs, animation, etc.) and interactivity. 

The Bitmanagement player incorporate some characteristics as 2D/3D streaming, 

animations streaming, compressed scenarios and standardized interfaces for Digital 

Rights management and encryption. 

SoNG (portals of Next Generation) was a European Commission project of Telefonica 

I+D that used the Bitmanagement developed reproducer and was the first MPEG-4 

reproducer prototype with 2D/3D. Actually, Bitmanagement commercialize this 

reproducer. 



o Active X plugin for Microsoft Internet Explorer 

o Netscape plugin for Netscape 4.x 

o Control activeX embedded in any language that support COM (Visual C++, 

Visual Basic) 

o Control activeX embedded in Java2 way JNI

Bitmanagement assure that this reproducer has been probed with GPAC(ENST) and 

IBM generated content. 

4.2.6 Octaga Professional 

Octaga commercialize a 3D MPEG-4 advanced content reproducer: Octaga 

Professional. 


• Can reproduce MP4 files generated the GPAC creation tools 


o Independent application 

o Plug in that can be inserted into a html page for Internet Explorer, Firefox and 

Opera browsers. 

4.2.7 Digimax: MAXPEG Player 

Digimax commercialize a 2D/3D reproducer compatible with MEPG-4 (BIFS). 


• Portable: C++ code portable to different platforms (STB and mobile) 

• Can reproduce MP4 files generated by an own tool: MAXPEG Author. 

4.2.8 COSMOS 

COSMOS (COllaborative System based on MPEG-4 Objects and Streams) is a 

framework for developing applications in collaborative environments (CVE- 

Colaborative Virtual Environment). 

Completely developed in java, it allows keeping a 3D virtual environment where 

exchanging 3D objects and manipulate them in real time. 

Allow to send by means of broadcast/multicast a change on a BIFS node to the whole 

interested participants, updating in this way all the involved scenarios. 

4.3 UI adaptation based on XML 

4.3.1 UI adaptation based on XML transformation 

Most of the approaches for UI adaptation are based on XML [22] and its 

transformation technologies. In [23] and [24] there are two very interesting tutorials 

concerning these techniques. 

Most of the applications take into account the following assumption: considering the 

user interface as a tree, this tree can be transformed (adapted) into a different tree by 

recombining the set of leaves it is composed of. 

In the following images can be seen how the authors of the work [25] present a 

possible architecture to carry out this type of user interface adaptation.

In the paper the reader can also get information about an authoring tool to develop 

such transformations. 

Architecture and tool components of the system described in [24]. 

Another example of this approach is the AUIT [26] methodology, which basing on 

XML transformations proposes a four layer architecture to adapt the user interface to 

different devices. This methodology has been improved during the last 4 years and 

there are several implementations based on it.

AUIT architecture. 

In there are several works done basing on XML transformations in the framework of 

the SEESCOA (Software Engineering for Embedded Systems using a Component- 

Oriented Approach) initiative [27]. 

4.3.2 Adaptation via XML publishing servers 

Based on similar technologies there are widely used frameworks that provide 

mechanisms to implement user interface adaptations for those applications which 

access is based on IP technology. 

These frameworks act as Web servers which are able to handle different types of 

devices (represented by different types or versions of web clients [28]) and implement 

a different behavior for each or them. According to this, a web-site or a pizza-ordering 

service can be accessed, browsed and visualized in a very different way (in a TV, 

PDA, mobile, …). 

One of these frameworks which use is quite extended and that has survived and been 

successfully improved during the last decade is Cocoon. This framework, basing on 

XML transformation technologies, systematizes the adaptation process in a very 

significant way. 

In [29] there is an example of the use of one of those frameworks in order to do these 

techniques. 

In there is another application based on Cocoon: PALIO (Personalized Access to 

Local Information and services for tourists) service framework. The PALIO framework 

is being used in the development of location-aware information systems for tourists, 

and is capable of delivering fully adaptive information to a wide range of devices, 

including mobile ones.

PALIO example. 

Sitemesh [30] is a device web-page layout and decoration Java framework that allows 

the device-oriented user interface adaptation basing on XML transformation. It does 

not act as a XML publishing engine but is integrated in the web server. 

4.3.3 Adaptation based on the definition & identification of the device 

4.3.3.1 Composite Capabilities / Preference Profiles 

Composite Capabilities/Preference Profiles (CC/PP) [31] recommendation of W3C, 

which using the Semantic Web oriented language RDF [32] was able to define the 

profiles and capabilities of the device in order to carry out the appropriate 

adaptation. This Working group is closed and its work has been transferred to the 

“Device Independent” group [33]. 

One of the recent results of this group is the definition of the specification of 

“Delivery Context: Interfaces (DCI) Accessing Static and Dynamic Properties [34]”. 

This document defines platform and language neutral interfaces that provide Web 

applications access to a hierarchy of dynamic properties representing device 

capabilities, configurations, user preferences and environmental conditions.

User Interface adaptation: concepts involved according DCI group [33]. 

There is a well documented implementation of Sun of the CC/PP specification. This 

implementation describes how to process CC/PP in Java (JSR-000188, [35]). 

4.3.3.2 UAPROF (OMA) 

One of the outputs of the CC/PP had a direct impact in the active forum Open 

Mobile Alliance (OMA) [36] [37]. The result is the UAProf, which is a concrete 

implementation of CC/PP developed by the. The UAProf is a framework for 

describing and transporting information about the capabilities of a device. This 

information may include hardware characteristics (e.g. screen size, type of 

keyboard, etc.) and software characteristics (e.g. browser manufacturer, markup 

languages supported, etc.) The final purpose is that the origin servers, gateways 

and proxies use this information to customize content for the user. The current 

version of this specification is UAProf 2.0. 

One of the applications that employ this technology can be found in [38].

Architecture defined for the Web UI adaptation in [37 

4.3.3.3 Device Description Repository 

The Device Description Repository is a concept proposed by the World Wide Web 

Consortium (W3C) Device Description Working Group (DDWG). The proposed 

repository would contain information about Web-enabled devices (particularly 

mobile devices) so that content could be adapted to suit. Information would include 

the screen dimensions, input mechanisms, supported colors, known limitations, 

special capabilities etc. 

The idea of implementing a Device Description Repository has been recently 

discussed at an international workshop held by the DDWG in Madrid, Spain in July, 

2006. Thus, using such approach in Cantata to include mobile devices in the 

demonstrators could be interesting. 

4.3.4 XML based UI adaptation 

A software application is known as being device independent when its functions are 

universal on different types of device. This generally means that it is written in a 

meta-language that can be read on any platform. 

XML (eXtensible Markup Language) seems to be a good approach to create deviceoriented 

interface. Indeed XML is a platform-neutral language that organizes and 

exchanges complex information. It is lightweight, easy and increasingly available in 

applications nowadays. In addition, XML provides a facility to define tags and the 

structural relationships between them. It is very powerful and useful language for 

creating a uniform information format for complex multimedia content and documents.

XML also supports XSL style sheets and allow creating customized presentation for 

different devices and users. 

XML-based user interface description seems to become a lot more visible such as 

Extensible User Interface Language (XUL) or TERESA XML. These approaches 

propose specific characteristics and different functionalities. 

The approaches that appear in this section have a common feature: the adaptation is 

achieved due to the definition of the interface without including the final presentation. 

Thus, these approaches force the final device to be compliant with them or to develop 

a renderer for each one of the devices. 

4.3.4.1 UIML User Interface Meta Language 

4.3.4.2 AUIML 

UIML [39] is an XML based (markup) language to define interfaces. UIML allows the 

definition of interfaces by concatenating definitions of the different elements that 

compose that interface. 

There are renderers for different technologies and platforms (J2EE, QT, HTML,C++, 

VoiceXML) that transform the UIML expressed interface into the appropriate output. 

AUIML is similar to UIML but more abstract. AUIML does not include UI 

appearance features to be 100% platform and implementation technology 

independent. According to the IBM definition (which provides a toolkit) “AUIML 

captures relative positioning information of user interface components and 

delegates their display to a platform-specific renderer. Depending on the platform or 

device being used, the renderer decides the best way to present the user interface 

to the user and receive user input.” 

4.3.4.3 XIML (eXtensible Interface Markup Language) 

This initiative [40] has a similar philosophy to the previous ones, but it seems to be 

no very active.

Weather forecast application using XMIL. 

4.3.4.4 XUL 

The Extensive User Interface Language (XUL) is a Mozilla’s XML-based language 

for describing window layout. XUL provides a separation among the client 

application definition and programmatic logic and its graphical presentation and 

language-specific text labels. 

An User Interface (UI) can be described as a set of structured interface elements 

(such as windows, menubar, button …) along with a predefined set of properties. 

XUL has its focus on window-based graphical user interfaces so it might be not 

applicable to interfaces of small mobile devices for example. 

4.3.4.5 TERESA XML 

Teresa is a project, supported by the European project Cameleon IST, from the HCI 

Group of ISTI-C.N.R with the aim to design and develop a concrete user interface 

adapted to specific platform [41]. The Teresa XML language is composed of two 

parts: a XML-description of the CTT (ConcurTaskTree [42]) notation and a language 

for describing user interfaces. 

This XML-based language describes the organization of the Abstract Interaction 

Objects (AIO) that composing the interface. The user interface dialog is also 

described with this language. 

A User Interface (UI) is a structured set of one or more presentation element(s). 

Each presentation element is characterized by a structure, which describes the 

static organization of the UI and the relationships among the various presentation 

elements.

4.3.4.6 USIXML 

The Teresa XML is used in the TERESA tool that supports the generation of tasks 

models, abstracts UIs, and running UIs. 

UsiXML (which stands for USer Interface eXtensible Markup Language) is a XMLcompliant 

markup language that allow the description of the User Interface (UI) for 

multiple contexts of use, such as Character User Interfaces (CUIs), Graphical User 

Interfaces (GUIs), Auditory User Interfaces (AUI), and Multimodal User Interfaces 

(MUI). 

UsiXML consists of a User Interface Description Language (UIDL) that is a 

declarative language capturing the essence of what a UI is or should be, 

independently of physical characteristics. 

UsiXML supports device independence: a UI can be described in a way that 

remains independence from the interactions devices, such as e.g. mouse, screen, 

keyboard, voice recognition system. If needed, a reference to a particular device 

can be added to the description. 

(Information taken from www.usixml.org) 

4.3.4.7 AAIML [43] 

The Alternate User Interface Access Standard (AAIML) is an initiative of the V2 

technical committee of the National Committee for Information technology 

Standards (NCITS). 

This standard aims to allow people with disabilities to remotely control a large set of 

electronics devices (for example copy machines or elevators) from their personal 

device (such as personal mobile phone). 

An abstract user interface is transmitted by the targeted device to the user with 

particular input and output mechanisms that are appropriate for this user. The 

concept of “Universal Remote Control” (URC) is introduced. This XML-based 

language is used to convey an abstract user interface description from the target 

device to the URC. On the URC, this abstract description must be mapped to a 

concrete description available on the platform.

A Compaq iPAQ handheld computer (running Java/Swing on Linux) controlling a TV simulation on a PC 

via 802.11b wireless connection and Jini/Java technology 

4.3.4.8 XForms and RIML 

The W3C XForms specification is a technology intended as the next generation of 

forms for the web. Although its focus is on gathering the input provided by the user, 

it provides some information display facilities. Despite its specialized scope, 

XForms provides many of features necessary for a more general abstract language. 

Indeed XForms separates three aspects of a form interface: 

• The data model used by the target. 

• The presentation of the data model to the user. 

• The processing model. 

In XForms, the data model can be used by specialized interfaces. In fact XForms 

allows that resources such a label can be substituted according to the delivery 

context.

Renderer Independent Markup Language (RIML) is based on emerging standards. 

The current draft of XHTML2.0 is used for content such as paragraphs, tables, 

images, hyperlinks, etc. For form-based interaction, XForms elements have been 

included 

RIML stresses the separation of content definition (i.e. what is to be presented) 

from the description of dynamic adaptations, which can be performed on the 

content in order to match varying capabilities of devices. 

4.3.4.9 MPEG-21 

ISO/IEC is defining the MPEG-21 framework, which is intended to support 

transparent use of multimedia resources across a wide range of networks and 

devices. 

One aspect of the requirements for MPEG-21 is Digital Item Adaptation, which is 

based on a Usage Environment Description. It proposes the description of 

capabilities for at least the terminal, network, delivery, user, and natural 

environment, and notes the desirability of remaining compatible with other 

recommendations such as CC/PP and UAProf (see 4.2.3.1 and 4.2.3.2). 

(Information taken from www.w3.org) 

4.4 Device ontology 

In 2001, the initiative of the FIPA proposes a device ontology [51]. This ontology 

describes the software and hardware properties as well as the services proposed by 

devices. Thanks to this ontology, device’s profiles can be built and used by agents. 

The knowledge of this ontology permits agents receiving the profile of a specific device 

to know if the properties or services of the latter allow them to achieve their objectives. 

The FIPA-device ontology could be used in a CC/PP profile (see 4.2.3.1). 

For some examples see: 

http://www.fipa.org/specs/fipa00091/PC00091A.html#_Toc511707116 

4.5 Agent-base user interface adaptation 

MATE 

MATE is a prototype of Computer-Human Interface based on a society of reactive 

agents and on a language of spatial description of tasks. Implemented as a text editor, 

this tool aims at showing that a software (of an office automation type) can be build 

using the advantages of the agent paradigm and the power of script languages in order 

to make this interface more personalizable, more extendable and more intuitive for 

non-expert users [52].

5 State-of-the-art system architecture 

5.1 DLNA 

Digital.Living.Network.Alliance.(DLNA) is a cross-industry organization of leading 

consumer electronics, computing industry and mobile device companies share a vision 

of a wired and wireless network of interoperable consumer electronics (CE), personal 

computers (PC) and mobile devices in the home and on the road, enabling a seamless 

environment for sharing and growing new digital media and content services. 

DLNA is focused on delivering interoperability guidelines based on open industry 

standards to complete the cross-industry digital convergence. DLNA has published a 

common set of industry design guidelines that allow manufacturers to participate in a 

growing marketplace of networked devices, leading to more innovation, simplicity and 

value for consumers. The DLNA Networked Device Interoperability Guidelines are use 

case driven and specify the interoperable building blocks that are available to build 

platforms and software infrastructure. 

The DLNA Networked Device Interoperability Guidelines refer to standards from 

established, open industry standards organizations and provide CE, PC and mobile 

device manufacturers with the information needed to build compelling, interoperable 

digital.home platforms, devices and applications.

This Figure shows the technology ingredients covered by the DLNA Networked Device 

Interoperability Guidelines. 

The digital home consists of a network of CE, PC and mobile devices that cooperate 

transparently, delivering simple, seamless interoperability that enhances and enriches 

user experiences. This is the communications and control backbone for the home 

network and is based on IP networking UPnP and Internet Engineering Task Force 

technologies. 

Information taken from 

http://www.dlna.org/en/industry/pressroom/DLNA_white_paper.pdf 

5.2 mTag 

There are several new approaches on market focusing on a smart tags which enables 

not only new way to point and select desired source of information but also initiate data 

access and direct desired content to terminal initiated by end user. 

An example of this kind of new approach is mTag architecture. With focus on smart 

environment and capabilities to offer User Interface with a distributed event driven 

architecture for discovering location specific mobile web services mTag shows 

architecture where service discovery is initiated by touching a fixed RFID reader with a 

mobile passive RFID tag attached e.g. to a phone, which results in information of 

available services being pushed to user’s preferred device [mTag].

As stated by mTag project: “The principal advantage of the proposed architecture is 

that it can be realized with today’s off-the-shelf commercial products. We presented a 

proposal for an Internet based deployment and two case studies, where prototype 

implementations were empirically evaluated in the true environment of use. The case 

studies showed that the service was found as an easy way to access location based 

mobile web services.

Users were satisfied with the possibility to fully control the information pushed to their 

devices, in comparison to the automatic location based information delivery of the 

comparative Bluetooth based service in the second case study.” [mTag] 

[mTag]: Korhonen J, Ojala T, Klemola M & Väänänen P (2006) 

mTag – Architecture for discovering location specific mobile web services using 

RFID and its evaluation with two case studies. Proc. International Conference on 

Internet and Web Applications and Services, Guadeloupe. 

5.3 Content retrieval and device management 

The delivery system will be managed like any other network system, but the devices 

present special challenges. The crucial insight is that content delivery is first and 

foremost a data management problem at multiple levels. Content delivery systems 

must be built around a set of database-related requirements: queryable metadata, 

secure and transactional distribution of data between databases, and the unbreakable 

linkage between content and its meta data. Additionally, the distributed system must be 

able to keep its application and configuration data under control to ensure proper 

functionality of the system and autonomic behavior from end user and device point of 

view without much need for user intervention. 

This chapter describes a simple content distribution technique that enables a user to 

easily select content from the vast libraries that are available, download it, view it, and 

be charged for it.

The architecture for the presented system is based on a communicating network of 

database servers that manage all the data of the system. The next figure illustrates the 

different components at a conceptual level. 

The system has following components. 

• The conceptual centerpiece of the system is occupied by the Rendering Devices 

which accept different types of content from multiple media sources. 

• The Content Libraries contain digital content and the associated metadata. 

• The Preference Server contains user-specific data related to content and usage of 

the system. Identity, authentication, and saved queries are stored in the preference 

server. 

• The Ontology Server maintains common ontology data that is shareable across the 

other components of the system. This data makes the content machine searchable. 

• The Configuration Management Server manages the configuration of the system 

and its devices. 

The user’s network terminal device (typically a PC or a set top box) interacts with all 

the above host components using data synchronization across a protocol like http. It 

can download new components to upgrade itself. It can download results sets for 

further local analysis. And of course, it can download content. It can also use the 

system to back up preferences, configurations, user data, and media that no longer fits 

on the device.

At the core of the presented approach is the Solid BoostEngine, a small-footprint 

relational database manager that provides all the typical functionality of a modern data 

manager, including the SQL language for defining schemas and queries, transactions, 

multi-user capabilities, support for programmability (procedures, triggers, events) and 

automatic data recovery. Applications and devices communicate with the data manager 

using standard ODBC (Open Database Connectivity) and JDBC (Java Database 

Connectivity) application programming interfaces (APIs). 

New advanced databases offers new ways to manage required content and critical 

information based on applications and user interface requirements. Solid BoostEngine 

has two separate storage methods: one for typical alphanumeric data, and a second 

mechanism optimized for the storage and retrieval of Binary Large Objects (BLOBs). In 

Solid, digital content can be handled within the database as efficiently as if the data 

were to be stored in operating system files. This provides relational database 

functionality for media content, a solution with many benefits: 

• The same API is used for accessing and distributing both alphanumeric and content 

data, which simplifies application design. 

• Access to content and metadata can be combined in the same query, ensuring that 

property rights data always accompanies content data. 

• All data can be treated transactionally, meaning that changes to content and 

changes to meta-data can be tightly linked. 

• The DBMS protects all data in the system with a unified access control mechanism. 

The data distribution component of the Solid Platform is the Solid SmartFlow Option. 

It links together a set of loosely coupled, cooperative databases that share data with 

one another under strict integrity and security rules. Key aspects of the architecture 

include the following: 

• A hierarchical relationship of master and replica databases. 

• A publish/subscribe mechanism for distributing data from a master database to one 

or more replica databases. 

• A transaction propagation mechanism for forwarding local changes from a replica 

database to its master. 

• Transactional and recoverable message queuing for data transfer between 

databases. 

The content delivery network will be a very large system with numerous different 

components under the control of a variety of entities. Such a system must be designed 

for manageability from the ground up. Recent developments in Autonomic Computing 

show promise in this area. Autonomic systems are self-configuring, self -healing, self - 

optimizing and self –protecting so that they effectively take care of themselves without 

much need for user intervention. The delivery system will be managed like any other 

network system, but the devices present special challenges.

Device management includes at least the following tasks: 

• Managing user identification and authentication. 

• Automatically installing and upgrading software on local devices. 

• Maintaining valid software configurations without requiring user interaction. 

• Backing up and/or deleting unused software and content from devices 

• Transferring user preferences from one device to another. 

The configuration manager holds data relating to system configuration. This includes 

applications that may be needed by terminals and rendering devices. The configuration 

management data can be divided into following components: 

• Version “header information”. 

• Application binaries (Java classes and resources) of the new version. 

• SQL Scripts needed to create or upgrade the database schemas. 

• State information about each of the managed nodes. 

• Log information for troubleshooting purposes. 

All system configuration management operations are performed by preparing the 

required configuration as a publication in the master and then distributing it to the 

managed terminals and rendering devices through data synchronization. After 

refreshing the local copy of the management data, the managed device may run some 

installation procedures (e.g. execute schema upgrade SQL scripts in the target 

database) to complete the task. 

Centralizing configuration data in this way solves the important problem of knowing the 

state of any managed node at any point in time. The configuration manager can alter 

that state into a new consistent state by asking the device to subscribe to a new 

publication or refresh an old publication. 

The rendering device: In order to provide the media service to the end user, the 

rendering device acquires applications and content data from the four components 

mentioned above. Within the database of this device, data may be organized as shown 

in Figure 7.

The diagram shows that the rendering device operates on data that it obtains from a 

number of sources. The data has been organized into logical databases, each of which 

may be synchronized with the source (master database) of the data. Much of this data 

is downloaded or pushed to the device as needed. 

The sequence of steps needed to query video content from a content library and deliver 

it to a rendering device has been described in outline earlier in this document in the 

section on System Functionality. Figure 8 below shows how the various information 

resources contribute to resolving a user’s query.

Queries can take advantage of any or all of the metadata associated with the media in order 

to focus down on desired content. Figure 8 shows the use of two types of metadata: 

enumerated and free text. The user queries against both of them. 

Users may retain their queries for reuse. In our use case, Amy wants to find recent video 

news clips that she has not seen yet about her favorite rock band’s world tour. She may 

wish to re-execute this query every few days to find recent news. Each query is made up of 

a single row in the CONTENT_QUERY table which is linked to one or more rows in the 

enumerated and free text tables, each of which represents a condition that must be met with 

regard to this content. 

The matchmaking procedure finds clips where the metadata and query items match, and it 

and produces rows in a QUERY_MATCH table. This table has a separate entry for each 

piece of content whose met data matches the query criteria. In this example the criteria will 

be: Amy’s favorite band, news clips, not yet seen. In the real world, the query may interact 

with Amy’s preferences about which news sources she prefers and how much she is willing 

to pay for this kind of content. 

The packaging procedure goes through the QUERY_MATCH table and creates rows in the 

SEGMENT_ASSIGNMENT table of all content that matches the query and that has not yet 

been assigned to the rendering device. This step protects Amy from inadvertently 

downloading the same content twice. Amy will interact with this list, either directly or through 

matching to her preferences, to determine what she will actually download. Rows in this 

table will be used to parameterize Amy’s content publication so that it defines the content of 

current interest to her.

WP3.1 Deliverable 

Cantata 

(ITEA 05010) 

Version 0.14 

Page 60 of 63 

At this point, the rendering device is able to obtain content by forwarding a refresh request 

to the content library, asking it to refresh the data of the CONTENT_OF_REPLICA (replica 

ID) publication. It is here that the content assigned to a replica can be downloaded to the 

device or terminal. 

Because of the vast quantity of digital content, providing users with an easy way to locate 

content of interest to them is key to the usability of the system. Technically, this comes down to 

giving users an intuitive way to create queries against content meta-data stores. It must be easy 

for both the naïve and the skilled user to define a query over a range of media servers. Queries 

must provide powerful and flexible search functions, including ways to select by the content of 

the media. Searches must be efficient, i.e. fast to execute. User of the system must be able to 

retain queries for re-execution against new media or other media servers.

6 References 


Cantata 

(ITEA 05010) 

Version 0.14 


[1] M. Boliek, C. Christopoulos, and E. Majani, "JPEG2000 Part I Final Draft International 

Standard," ISO/IEC JTC1/SC29/WG1, Report September 25, 2000 2000. 

[2] J. Editors, "JPEG-2000 image coding system - Part 11: Wireless JPEG-2000 - 

Committee Draft," ISO/IEC/SC29/WG1 (JPEG), CD, 2005. 

[3] H. M. Radha, M. v. d. Schaar, and Y. Chen, "The MPEG-4 Fine-grained Scalable Video 

Coding for Multimedia Streaming over IP," IEEE Transactions on Multimedia, vol. 3, pp. 

53-68, 2001. 

[4] W. Li, "Streaming Video Profile in MPEG-4," IEEE Transactions on Circuits and Systems 

for Video Technology, vol. 11, pp. 301-317, 2001. 

[5] C. Brislawn and P. Schelkens, "JPEG 2000 Part 12: Extensions for Three-Dimensional 

and Floating Point Data Scope and Requirements document, draft version 1," ISO/IEC 

JTC1/SC29/WG1, Sydney, Australia, Report WG1N2378, November 12-16, 2001 2001. 

[6] ISO/IEC, "JPEG 2000 image coding system – Part 11: Wireless JPEG 2000," ISO/IEC 

JTC1/SC29/WG11, N3386, 2004. 

[7] S.-J. Choi and J. W. Woods, "Motion-compensated 3-D subband coding of video," IEEE 

Transactions on Image Processing, vol. 8, pp. 155-167, 1999. 

[8] S. Han and B. Girod, "SNR Scalable Coding with Leaky Prediction," ITU-T Q.6/SG16, 

VCEG-N53 2001. 

[9] H. C. Huang, C.-N. Wang, and T. Chiang, "A Robust Fine Granularity Scalability Using 

Trellis Based Predictive Leak," IEEE Transactions on Circuits and Systems for Video 

Technology, vol. 12, pp. 372-385, 2002. 

[10] F. Wu, S. Li, and Y.-Q. Zhang, "A Framework for Efficient Progressive Fine Granularity 

Scalable Video Coding," IEEE Transactions on Circuits and Systems for Video 

Technology, vol. 11, pp. 332-344, 2001. 

[11] Y. He, R. Yan, F. Wu, and S. Li, "H.26L-based fine granularity scalable video coding," 

ISO/IEC JTC1/SC29/WG1, M7788, December 2001 2001. 

[12] F. Wu, S. Li, R. Yan, X. Sun, and Y.-Q. Zhang, "Efficient and Universal Scalable Video 

Coding," presented at IEEE International Conference on Image Processing (ICIP), 

Rochester, NY, USA, 2002. 

[13] J.-R. Ohm, "Three-dimensional subband coding with motion compensation," IEEE 

Transactions on Image Processing, vol. 3, pp. 559-571, 1994. 

[14] B. Pesquet-Popescu and V. Bottreau, "Three Dimensional Lifting Schemes for Motion 

Compensated Video Compression," presented at IEEE International Conference on 

Acoustics, Speech and Signal Processing (ICASSP), Salt Lake City, Utah, USA, 2001. 

[15] A. Secker and D. Taubman, "Motion-Compensated Highly Scalable Video Compression 

using Adaptive 3D Wavelet Transform Based on Lifting," presented at IEEE International 

Conference on Image Processing (ICIP), Thessaloniki, Greece, 2001. 

[16] A. Secker and D. Taubman, "Lifting-Based Invertible Motion Adaptive Transform 

(LIMAT) Framework for Highly Scalable Video Compression," IEEE Transactions Image 

Processing, vol. 12, pp. 1530-1542, 2003. 

[17] P. Chen and J. W. Woods, "Bidirectional MC-EZBC with Lifting Implementation," IEEE 

Transactions on Circuits and Systems for Video Technology, vol. 14, pp. 1183-1194, 

2004.


Cantata 

(ITEA 05010) 

Version 0.14 


[18] J. W. Woods and J.-R. Ohm, "Special issue on subband/wavelet interframe video 

coding," Signal Processing: Image Communication, vol. 19, 2004. 

[19] D. S. Turaga, M. v. d. Schaar, Y. Andreopoulos, A. Munteanu, and P. Schelkens, 

"Unconstrained Motion Compensated Temporal Filtering (UMCTF) for Efficient and 

Flexible Interframe Wavelet Video Coding," Signal Processing: Image Communication, 

to appear. 

[20] T. Wiegand and G. Sullivan, "Draft ITU-T Recommendation and Final Draft International 

Standard of Joint Video Specification," ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6 

2003. 

[21] SMPTE 421M: VC-1 Compressed Video Bitstream Format and Decoding Process. 

http://www.microsoft.com/windows/windowsmedia/forpros/events/NAB2005/VC-1.aspx 

[22] http://www.w3.org/XML/ 

[23] Transformation with XSL. 

http://www.adobe.com/designcenter/indesign/articles/indcs2at_xsl/indcs2at_xsl.pdf 

[24] XML Transformation Flow Processing. 

http://www.mulberrytech.com/Extreme/Proceedings/typesetpdf/2001/Euzenat01/EML2001Euzenat01.pdf 

[25] Grundy, J. and Yang, B. 2003. An environment for developing adaptive, multi-device 

user interfaces. In Proceedings of the Fourth Australasian User interface Conference on 

User interfaces 2003 - Volume 18 (Adelaide, Australia). R. Biddle and B. Thomas, Eds. 

ACM International Conference Proceeding Series, vol. 36. Australian Computer Society, 

Darlinghurst, Australia, 47-56. 

[26] Grundy, J. and Zou, W. AUIT: Adaptable User Interface Technology, with Extended Java 

Server Pages, in: Seffah, A. and Javahery, H. (eds.) Multiple User Interfaces: 

Crossplatform applications and context-aware interfaces, pages 149-167, Wiley, 2004. 

[27] SEESCOA http://www.cs.kuleuven.ac.be/cwis/research/distrinet/projects/SEESCOA/ 

[28] Complete list of web-browsers (including mobile browsers or micro-browsers) 

http://en.wikipedia.org/wiki/List_of_web_browsers. 

[29] TWEEP – Design and implementation of a multilingual Web server with adapted 

interfaces to PC and Television. 

http://www.vicomtech.es/ingles/html/proyectos/index_proyecto46.html 

[30] Sitemesh: web-page layout and decoration framework. 

http://today.java.net/pub/a/today/2004/03/11/sitemesh.html 

[31] CC/PP Information Page http://www.w3.org/Mobile/CCPP/ 

[32] RDF (Resource Description Framework) http://www.w3.org/RDF/ 

[33] Device Independency of W3C http://www.w3.org/2001/di/ 

[34] Delivery Context: http://www.w3.org/TR/2005/WD-DPF-20051111/ 

[35] JSR 188 http://jcp.org/aboutJava/communityprocess/final/jsr188/index.html 

[36] http://www.openmobilealliance.org/ 

[37] White Paper on UAProf Best Practices Guide. 

http://www.openmobilealliance.org/docs/OMA-WP-UAProf_Best_Practices_Guide- 

20060718-A.pdf 

[38] Example of Web UI adaptation. 

http://users.tkk.fi/~majakobs/thesis/WebUIAdaptation.pdf 

[39] UIML http://www.uiml.org/ 

[40] XIML eXtensible Interface Markup Language.


Cantata 

(ITEA 05010) 

Version 0.14 


[41] Paternò. F and Santoro. C One model, many interfaces. In Ch Kolski and J. 

Vanderdonckt (Eds), editors, Proceedings of the 4 th International Conference on 

Computer-Aided Design of User Interfaces CADUI’2002 (Valenciennes, 15-17 May 

2002), pages 143-154, Dordrecht, 2002. Kluwer Academics Publishers. 

[42] Paternò F., Mancini C., Meniconi S. ConcurTaskTrees: A Diagrammatic Notation for 

Specifying Task Models. 

[43] Zimmermann, G., Vanderheiden, G., Gilman, A. “Prototype Implementations for a 

Universal Remote Console Specification,” in CHI'2002. 2002. Minneapolis, MN: pp. 510- 

511. 

[44] LASeR and SAF. 

http://www.mpeg-laser.org 

[45] ISO/IEC 14496-1: Systems. 

http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=38559 

[46] ISO/IEC 14496-2: Visual. 


[47] ISO/IEC 14496-3: Audio. 


[48] ISO/IEC 14496-10: Advanced Video Coding 


[49] ISO/IEC 14496-20: Lightweight Application Scene Representation (LASeR) and Simple 

Aggregation Format (SAF). 


[50] ISO/IEC 14496-11: Scene description and application engine. 


[51] http://www.fipa.org/specs/fipa00091/PC00091A.html 

[52] Siléo C. Hutzler G. MATE: un éditeur de texte basé sur une société d’agents réactifs, 

RSTI/hors série. JFSMA 2003. 

[53] http://www.arcsoft.com/products/videoimpression/ : Video Impression 

[54] http://www.arcsoft.com/products/mobiledevicesolution/photo.asp : PhotoBase Deluxe 

[55] http://www.iris.tv/indexFlash.htm: IRIS 

[56] http://www.3rdisecure.tv/domestic_products.asp : 3rdi 

[57] http://www.dlink.com/products/?pid=500&sec=0 : DLink DCS-2120 Wireless Internet 

Camera with 3G Mobile Video Support 

[58] http://www.neiongfx.com/neion-video-surveillance-mobile.html: 

[59] http://visiowave.com/ : Nioo Visio 

[60] http://www.3rdeye.ro/index.php?mod=aplic: 3rdeye 

[61] http://www.dlna.org

D3.1 Deliverable Description of the state-of-the-art ... - Hitech Projects

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?