Download - PEEF's Digital Library

meyer.robin10

Download - PEEF's Digital Library

Lecture Notes in Computer Science 5740

Commenced Publication in 1973

Founding and Former Series Editors:

Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board

David Hutchison

Lancaster University, UK

Takeo Kanade

Carnegie Mellon University, Pittsburgh, PA, USA

Josef Kittler

University of Surrey, Guildford, UK

Jon M. Kleinberg

Cornell University, Ithaca, NY, USA

Alfred Kobsa

University of California, Irvine, CA, USA

Friedemann Mattern

ETH Zurich, Switzerland

John C. Mitchell

Stanford University, CA, USA

Moni Naor

Weizmann Institute of Science, Rehovot, Israel

Oscar Nierstrasz

University of Bern, Switzerland

C. Pandu Rangan

Indian Institute of Technology, Madras, India

Bernhard Steffen

University of Dortmund, Germany

Madhu Sudan

Microsoft Research, Cambridge, MA, USA

Demetri Terzopoulos

University of California, Los Angeles, CA, USA

Doug Tygar

University of California, Berkeley, CA, USA

Gerhard Weikum

Max-Planck Institute of Computer Science, Saarbruecken, Germany


Abdelkader Hameurlain Josef Küng

Roland Wagner (Eds.)

Transactions on

Large-Scale

Data- and Knowledge-

Centered Systems I

13


Volume Editors

Abdelkader Hameurlain

Paul Sabatier University

Institut de Recherche en Informatique de Toulouse (IRIT)

118, route de Narbonne

31062 Toulouse Cedex, France

E-mail: hameur@irit.fr

Josef Küng

Roland Wagner

University of Linz, FAW

Altenbergerstraße 69

4040 Linz, Austria

E-mail: {jkueng,rrwagner}@faw.at

Library of Congress Control Number: 2009932361

CR Subject Classification (1998): H.2, H.2.4, H.2.7, C.2.4, I.2.4, I.2.6

ISSN 0302-9743

ISBN-10 3-642-03721-6 Springer Berlin Heidelberg New York

ISBN-13 978-3-642-03721-4 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is

concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,

reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication

or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,

in its current version, and permission for use must always be obtained from Springer. Violations are liable

to prosecution under the German Copyright Law.

springer.com

© Springer-Verlag Berlin Heidelberg 2009

Printed in Germany

Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India

Printed on acid-free paper SPIN: 12738045 06/3180 543210


Preface

Data management, knowledge discovery, and knowledge processing are core and hot

topics in computer science. They are widely accepted as enabling technologies for

modern enterprises, enhancing their performance and their decision making processes.

Since the 1990s the Internet has been the outstanding driving force for application

development in all domains.

An increase in the demand for resource sharing (e.g., computing resources, services,

metadata, data sources) across different sites connected through networks has

led to an evolvement of data- and knowledge-management systems from centralized

systems to decentralized systems enabling large-scale distributed applications providing

high scalability. Current decentralized systems still focus on data and knowledge

as their main resource characterized by:

heterogeneity of nodes, data, and knowledge

autonomy of data and knowledge sources and services

large-scale data volumes, high numbers of data sources, users, computing resources

dynamicity of nodes

These characteristics recognize:

(i) limitations of methods and techniques developed for centralized systems

(ii) requirements to extend or design new approaches and methods enhancing

efficiency, dynamicity, and scalability

(iii) development of large scale, experimental platforms and relevant benchmarks to

evaluate and validate scaling

Feasibility of these systems relies basically on P2P (peer-to-peer) techniques and

agent systems supporting with scaling and decentralized control. Synergy between

Grids, P2P systems and agent technologies is the key to data- and knowledge-centered

systems in large-scale environments.

The objective of the international journal on Large-Scale Data- and Knowledge-Centered

Systems is to provide an opportunity to disseminate original research contributions and to

serve as a high-quality communication platform for researchers and practitioners. The journal

contains sound peer-reviewed papers (research, state of the art, and technical) of high quality.

Topics of interest include, but are not limited to:

data storage and management

data integration and metadata management

data stream systems

data/web semantics and ontologies

knowledge engineering and processing

sensor data and sensor networks

dynamic data placement issues

flexible and adaptive query processing


VI

Preface

query processing and optimization

data warehousing

cost models

resource discovery

resource management, reservation, and scheduling

locating data sources/resources and scalability

workload adaptability in heterogeneous environments

transaction management

replicated copy control and caching

data privacy and security

data mining and knowledge discovery

mobile data management

data grid systems

P2P systems

web services

autonomic data management

large-scale distributed applications and experiences

performance evaluation and benchmarking.

The first edition of this new journal consists of journal versions of talks invited to the

DEXA 2009 conferences and further invited contributions by well-known scientists in

the field. Therefore the content covers a wide range of different topics in the field.

The second edition of this journal will appear in spring 2010 under the title:

Datawarehousing and Knowledge Discovery (Guest editors: Mukesh K. Mohania

(IBM, India), Torben Bach Perdersen (Aalborg University, Denmark), A Min Tjoa

(Technical University of Vienna, Austria).

We are happy that Springer has given us the opportunity to publish this journal and

are looking forward to supporting the community with new findings in the area of largescale

data- and knowledge-centered systems. In particular we would like to thank Alfred

Hofmann and Ursula Barth from Springer for their valuable support. Last, but not least,

we would like to thank Gabriela Wagner for her organizational work.

June 2009 Abdelkader Hameurlain

Josef Küng

Roland Wagner


Editorial Board

Hamideh Afsarmanesh University of Amsterdam, The Netherlands

Francesco Buccafurri Università Mediterranea di Reggio Calabria, Italy

Qiming Chen HP-Lab, USA

Tommaso Di Noia Politecnico di Bari, Italy

Georg Gottlob Oxford University, UK

Anastasios Gounaris Aristotle University of Thessaloniki, Greece

Theo Härder Technical University of Kaiserslautern, Germany

Zoé Lacroix Arizona State University, USA

Sanjay Kumar Madria University of Missouri-Rolla, USA

Vladimir Marik Technical University of Prague, Czech Republik

Dennis McLeod University of Southern California, USA

Mukesh Mohania IBM India, India

Tetsuya Murai Hokkaido University, Japan

Gultekin Ozsoyoglu Case Western Reserve University, USA

Oscar Pastor Polytechnic University of Valencia, Spain

Torben Bach Pedersen Aalborg University, Denmark

Günther Pernul University of Regensburg, Germany

Colette Rolland Université Paris1 Panthéon Sorbonne, CRI, France

Makoto Takizawa Seikei University, Tokyo, Japan

David Taniar Monash University, Australia

Yannis Vassiliou National Technical University of Athens, Greece

Yu Zheng Microsoft Research Asia, China


Table of Contents

Modeling and Management of Information Supporting Functional

Dimension of Collaborative Networks ............................... 1

Hamideh Afsarmanesh, Ekaterina Ermilova,

Simon S. Msanjila, and Luis M. Camarinha-Matos

A Universal Metamodel and Its Dictionary .......................... 38

Paolo Atzeni, Giorgio Gianforme, and Paolo Cappellari

Data Mining Using Graphics Processing Units ....................... 63

Christian Böhm, Robert Noll, Claudia Plant,

Bianca Wackersreuther, and Andrew Zherdin

Context-Aware Data and IT Services Collaboration in E-Business ...... 91

Khouloud Boukadi, Chirine Ghedira, Zakaria Maamar,

Djamal Benslimane, and Lucien Vincent

Facilitating Controlled Tests of Website Design Changes Using

Aspect-Oriented Software Development and Software Product Lines .... 116

Javier Cámara and Alfred Kobsa

Frontiers of Structured Business Process Modeling ................... 136

Dirk Draheim

Information Systems for Federated Biobanks ........................ 156

Johann Eder, Claus Dabringer, Michaela Schicho, and Konrad Stark

Exploring Trust, Security and Privacy in Digital Business ............. 191

Simone Fischer-Huebner, Steven Furnell, and

Costas Lambrinoudakis

Evolution of Query Optimization Methods .......................... 211

Abdelkader Hameurlain and Franck Morvan

Holonic Rationale and Bio-inspiration on Design of Complex Emergent

and Evolvable Systems ........................................... 243

Paulo Leitao

Self-Adaptation for Robustness and Cooperation in Holonic

Multi-Agent Systems ............................................. 267

Paulo Leitao, Paul Valckenaers, and Emmanuel Adam

Context Oriented Information Integration ........................... 289

Mukesh Mohania, Manish Bhide, Prasan Roy,

Venkatesan T. Chakaravarthy, and Himanshu Gupta


X Table of Contents

Data Sharing in DHT Based P2P Systems .......................... 327

Claudia Roncancio, María del Pilar Villamil, Cyril Labbé, and

Patricia Serrano-Alvarado

Reverse k Nearest Neighbor and Reverse Farthest Neighbor Search on

Spatial Networks ................................................. 353

Quoc Thai Tran, David Taniar, and Maytham Safar

Author Index .................................................. 373


Modeling and Management of Information Supporting

Functional Dimension of Collaborative Networks

Hamideh Afsarmanesh 1 , Ekaterina Ermilova 1 , Simon S. Msanjila 1 ,

and Luis M. Camarinha-Matos 2

1 Informatics Institute, University of Amsterdam, Science Park 107,

1098 XG, Amsterdam, The Netherlands

{h.afsarmanesh,e.ermilova,s.s.msanjila}@uva.nl

2 Faculty of Sciences and Technology, New University of Lisbon,

Quinta da Torre, 2829-516, Monte Capatica, Portugal

cam@uninova.pt

Abstract. Fluent creation of opportunity-based short-term Collaborative Networks

(CNs) among organizations or individuals requires the availability of a

variety of up-to-date information. A pre-established properly administrated strategic-alliance

Collaborative Network (CN) can act as the breeding environment

for creation/operation of opportunity-based CNs, and effectively addressing the

complexity, dynamism, and scalability of their actors and domains. Administration

of these environments however requires effective set of functionalities,

founded on top of strong information management. The paper introduces main

challenges of CNs and their management of information, and focuses on the

Virtual organizations Breeding Environment (VBE), which represents a specific

form of strategic-alliances. It then focuses on the needed functionalities for effective

administration/management of VBEs, and exemplifies information management

challenges for three of their subsystems handling the Ontology, the

profiles and competencies, and the rational trust.

Keywords: Information management for Collaborative Networks (CNs), virtual

organizations breeding environments (VBEs), Information management in

VBEs, Ontology management, competency information management, rational

trust information management.

1 Introduction

The emergence of collaborative networks as collections of geographically dispersed

autonomous actors which collaborate through computer networks, has led both organizations

and individuals to effectively achieving common goals that go far beyond

the ability of each single actor, and providing cost effective solutions, and value creating

functionalities, services and products. The paradigm of “Collaborative Networks

(CN)” being defined during the last decade represents a wide variety of networks of

organizations as well as communities of individuals, where each has distinctive

characteristics and features. While the taxonomy of existing CNs, as presented later

A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 1–37, 2009.

© Springer-Verlag Berlin Heidelberg 2009


2 H. Afsarmanesh et al.

(in section 2.1), indicates their categorical differences, some of their main characteristics

are briefly introduced below.

Wide diversity in structural forms, duration, behavioral patterns, as well as interaction

forms is manifested by different collaborative networks. From the processoriented

chain structures as observed in supply chains, to those centralized around

dominant entities, and the project-oriented federated networks, there exists a wide

range of collaborative network structures [1; 2; 3; 4]. Every structure differently influences

the visibility level of each actor in the network, the intensity of its activities,

co-working and involvement in decision making.

Other important variant elements for networks are the variety of different life-cycle

phases and durations. Goal-oriented networks are shorter-term and typically triggered

by collaboration opportunities that rise in the market/society, as represented by the

case of VOs (virtual organizations) established for a timely response to singular

opportunities. Long-term networks on the other hand are strategic alliances / associations

with the main purpose of enhancing the chances of their members to get involved

in future opportunity-triggered collaboration networks, and increasing the

visibility of their actors, thus serving as breeding environments for goal-oriented

networks. As examples for long-term networks, the cases of industry clusters or industrial

districts and sector-based alliances can be mentioned.

In terms of the types of interaction among the actors involved in collaborative networks,

although there is not a consensus among researchers, some working definitions

are provided [5] for four main classes of interactions that are enumerated as: networking,

coordinated networking, cooperation and collaboration. There is an intuitive

notion of what collaboration represents, but this concept is often confused with the

cooperation. Some researchers even use the two terms indistinguishably. The ambiguities

around these terms reach a higher level when other related terms are also considered

such as the networking, communication, and coordination [6; 7]. Therefore, it

is relevant and important that for the CN research, the concepts behind these interaction

terms are formalized, especially for the purpose of defining a reference model for

collaborative networks, as later addressed in this paper (in Section 3). In an attempt to

clarify these various concepts, based on [5], the following working definitions are

proposed for these four classes of interactions, where in fact every concept defined

below is itself a sub-class of the concept(s) defined above it:

Networking – involves communication and information exchange among involved

parties for mutual benefit. It shall be noted that this term has a broad use in multiple

contexts and often with different meanings. In collaborative networks area of

research, when referred to “enterprise network” or “enterprise networking” the

intended meaning is probably “collaborative network of enterprises”.

Coordinated Networking – in addition to the above, it involves complementarity

of goals of different parties, and aligning / altering activities so that more efficient

results can be achieved. Coordination, that is, the act of working together harmoniously,

is one of the main components for collaboration.

Cooperation – involves not only information exchange and alignments of activities,

but also sharing of some resources towards achieving compatible goals.

Cooperation may be achieved by division of some labor (not extensive) among

participants.


Information Supporting Functional Dimension of Collaborative Networks 3

Collaboration – in addition to the above, it involves joint goals/responsibilities

with specific process(es) in which parties share information, resources and capabilities

and jointly plan, implement, and evaluate activities to achieve their

common goals. It in turn implies sharing risks, losses and rewards. If desired by

its involved parties, the collaboration can also give the image of a joint identity to

the outside.

In practice, collaboration typically involves mutual engagement of participants to

solve a problem together, which also implies reaching mutual trust that takes time,

effort, and dedication. As addressed above, different forms of interaction are suitable

for different structural forms of CNs. For example the long term strategic alliances are

cooperative environments, since they primarily comprise actors with compatible

and/or complimentary goals towards which they align their activities. The shorter

term goal-oriented networks however require intense co-working among their actors

to reach their jointly-established common goals that represent the reason for their

existence, and are therefore collaborative environments. On the other hand, most of

the current social networks just show a networking level of interaction.

But from a different perspective, different forms of the above mentioned interactions

can also be seen as different levels of “collaboration maturity”. Namely, the

interaction among actors in the network may strengthen in time, from simple networking

interaction to intense collaboration. This implies gradual increase in the level of

co-working as well as the risk taking, commitment, invested resources, etc., by the

involved participants. Therefore, operational CNs represent a variety of interactions

and inter-relationships among their heterogeneous and autonomous actors, which in

turn increases the complexity of this paradigm.

It shall be noted that in this paper and most other literature in the field, the concept

of “collaborative networks or CNs” represent the generic name when referring to all

varieties of such networks.

1.1 Managing the Information in CNs

On managing the information in CNs, even if all information is semantically and

syntactically homogeneous, a main generic challenge is related to assuring the availability

of strategic information within the network, required for proper coordination

and decision making. This can be handled through enforcement of a push/pull mechanism

and establishment of proper mapping strategies and components, between the

information managed at different sites belonging to actors in the network and all those

systems (or sub-systems) that support different functionalities of the CN and its activities

during its life cycle. Therefore, it is necessary that from autonomous actors

involved in the CNs, various types of distributed information are collected. This information

shall then be processed, organized, and accessible within the network, both

for navigation by different CN stakeholders and for processing by different software

systems running at the CN. However, although the information about actors evolves

in time - which is typical of dynamic systems as CNs - and therefore need to be kept

up to date, there is no need for continuous flow of all the information from each

legacy system to the CN. This will generate a major overload on the information management

systems at the CN. Rather, for effective CN’s operation and management,


4 H. Afsarmanesh et al.

only at some intervals, partial information needs to be pull/pushed from/to legacy

systems to the CN. Need for access to information also varies depending on the

purpose for which it is requested. These variations in turn pose a second generic information

management challenge that is related to the classification, assessment, and

provision of the required information based on intended use cases in CNs. Both of

the above generic information challenges in CNs are addressed in the paper, and exemplified

for three key CN functionalities, namely common ontology engineering,

competency management, and trust management.

A third generic information management challenge is related to modeling the variety

and complexity of the information that needs to be processed by different functionalities,

which support the management and operation of the CNs. While some of

these functionalities deal with the information that is known and stored within the site

of network’s actors (e.g. data required for actors’ competency management), the information

required for some other functionalities of CNs may be unknown, incomplete,

or imprecise, for which soft computing approaches, such as causal analysis and

reasoning (addresses in Section 7.2) or other techniques introduced in computational

intelligence, shall be applied to generate the needed information (e.g. data needed for

trust management). The issue of modeling the information needed to be handled in

CNs is addressed in details in the paper and also exemplified through the three example

functionalities mentioned above.

There is however a number of other generic challenges related to the management

of the CN information, and it can be expected that more challenges will be identified

in time as the need for other functional components unfolds in the research on

supporting the management and operation of CNs. Among other identified generic

challenges, we can mention: ensuring the consistency among the locally managed

semantically and syntactically heterogeneous information at each organization’s legacy

systems and the information managed by the management system of the CNs as

well as their availability for access by the authorized CN stakeholders (e.g. individuals

or organizations) when necessary. At the heart of this challenge lies the establishment

of needed interoperation infrastructure, as well as a federated information

management system supporting the inter-linking of autonomous information management

systems. Furthermore, challenges related to update mechanisms among

autonomous nodes are relevant. Nevertheless, these generic challenges are in fact

common to many other application environments and are not specific to the CN’s

information management area, and fall outside the scope of this paper.

The remaining sections first address in Section 2 the collaborative networks,

through presenting a taxonomy for collaborative networks and describing the main

requirements for establishing the CNs, while emphasizing their information management

and modeling aspects. Then in Section 3 it addresses the ARCON reference

model for collaborative networks focusing only its endogenous elements. In Section

4, the paper further narrows down on the details of the functional dimension of the

endogenous elements and exemplifying it for one specific kind of CN, i.e. the management

system of the VBE strategic alliance. Specific examples of modeling and

management of information are then provided in Sections 5, 6, and 7 for three subsystems

of a VBE management system, addressing the VBE ontology engineering,

management of profiles and competencies in VBEs, and assessment and management

of trust in VBEs. Section 8 concludes the paper.


Information Supporting Functional Dimension of Collaborative Networks 5

2 Establishing Collaborative Networks

Successful creation and management of inter-organizational and inter-personal collaborative

networks are challenging. Cooperation and collaboration in CNs, although

having the potential of bringing considerable benefits, or even representing a survival

mechanism to the involved participants, are difficult processes (as explained in

Section 2.2), which quite often fail [8; 9]. Therefore, there are a number of requirements

that need to be satisfied to increase their chances for success. Clearly the severity

of each requirement depends on the type and specificities of the CN. In other

words the nature, goal, and vision of each case determine its critical points and requirements.

For instance for the Virtual Laboratory type of CNs, the a priori setting

up of the common collaboration infrastructure and maintaining this infrastructure

afterwards pose some of their main challenges. However, for another type of CNs that

can be focused on international decision making on environment issues such as the

global warming, while setting up and maintaining the collaboration infrastructure is

not too critical, provision of mediation mechanisms and tools to support building of

trust among the CN actors and reaching of agreements on the definition of common

CN policies, pose some main challenges. It is therefore important to briefly discuss

the various types of CNs before addressing their requirements as addressed below

with the taxonomy of the CNs.

2.1 Taxonomy and Working Definitions for Several Types of CN

A first CN taxonomy is defined in [10] addressing the large diversity of manifestation

of collaborative networks in different application domains. Also a set of working

definitions for the terms addressed in Fig. 1 are provided in [5]. A few of these definitions

that are necessary for the later sections of this paper are quoted below from [5];

namely the definitions of collaborative networks (CN), virtual organizations breeding

environments (VBE), virtual organizations (VO), etc.

“A collaborative network (CN) is a network consisting of a variety of actors (e.g.

organizations and people) that are largely autonomous, geographically distributed,

and heterogeneous in terms of their operating environment, culture, social

capital and goals, but that collaborate to better achieve common or compatible

goals, and whose interactions are supported by computer network.”

“Virtual Organization (VO) – represents an alliance comprising a set of (legally)

independent organizations that share their resources and skills, to achieve their

common mission / goal, but that is not limited to an alliance of profit enterprises.

A virtual enterprise is therefore, a particular case of virtual organization.”

“Dynamic Virtual Organization – typically refers to a VO that is established in a

short time in order to respond to a competitive market opportunity, and has a short

life cycle, dissolving when the short-term purpose of the VO is accomplished.”

“Long-term strategic network or breeding environments – a strategic alliance

established with the purpose of being prepared for participation in collaboration

opportunities, and where in fact not collaboration but cooperation is practiced

among their members. In other words, they are alliances aimed at offering the

conditions and environment to support rapid and fluid configuration of collaboration

networks, when opportunities arise.”


6 H. Afsarmanesh et al.

Examples Main classes

VO

Breeding

Environment (VBE)

Industry

cluster

Industrial

district

Business

ecosystem

Long-t erm

strategic

network

Collaborative

Networked

Organization (CNO)

Professional

Virtual

Communit y (PVC)

Disaster

rescue net

Collaborative

virtual lab

Inter-continental

enterprise

alliance

created

_within

Collaborative

Network (CN)

Virtual

Team (VT)

Community of

Active Senior

Professionals (CASP)

Grasping

opportunity

driven net

Extended

enterprise

Goal-oriented

network

Virtual

Ent erprise (VE)

Ad-hoc

Collaborat ion

Virtual

Organization (VO)

Dynamic VO

Fig. 1. Taxonomy of Collaborative Networks

Continuous

activity

driven net

Supply

chain

Dynamic

Supply Chain

Collaborative

transportation

network

Virtual

government

Disperse

manufacturing

“VO Breeding Environments (VBE) – represents “strategic” alliance of organizations

(VBE members) and related supporting institutions (e.g. firms providing

accounting, training, etc.), adhering to a base long-term cooperation agreement

and adopting common operating principles and infrastructures, with the main

goal of increasing both their chances and preparedness of collaboration in potential

VOs”.

“Profession Virtual Communities (PVC) is an alliance of professional individuals,

and provide an environment to facilitate the agile and fluid formation of Virtual

Teams (VTs), similar to what a VBE aims to provide for the VOs.”

“Virtual Team (VT) is similar to a VO but formed by individuals, not organizations,

as such a virtual team is a temporary group of professionals that work together

towards a common goal such as realizing a consultancy job, a joint project,

etc., and that use computer networks as their main interaction environment.

2.2 Base Requirements for Establishing CNs

A generic set of requirements, including: (1) definition of common goal and vision,

(2) performing a set of initiating actions, and (3) establishing common collaboration

space, represents the base pre-conditions to the setting up of the CNs. Furthermore,

after the CN is initiated the environment needs to properly operate, for which its coordination

and management as well as reaching the needed agreements among its

actors for performing the needed tasks represent other set of challenges, including: (1)

performing coordination, support, and management of activities, and (2) achieving

agreements and contracts. Following five sub-sections briefly address these main

basic requirements as identified and addressed within the CN area of research, while

emphasizing their information management challenges in italic.

2.2.1 Defining a Common Goal and Vision

Collaboration requires the pre-existence of a motivating common goal and vision to

represent the joint/common purpose for establishment of the collaboration. In spite of


Information Supporting Functional Dimension of Collaborative Networks 7

all difficulties involved in the process of cooperation / collaboration, the motivating

factor for establishing the CNs is the expectation of being able to reach results that

could not be reached by the involved actors if working alone. Therefore the common

goal and vision of the CN represents its existential purpose, and represent the motivation

for attraction of actors to the required cooperation/collaboration processes [5].

At present, the information related to the common goal and vision of the CNs is

typically stored in textual format and is made available to public with proper interfaces.

Establishing a well-conceived vision however needs involvement of all actors in

the network. To properly participate in formulating the vision, the actors need to be

well informed, which in turn requires the availability of up-to-date information regarding

many aspects of the network. Both the management of required information

for visioning as well as the assurance of its effective accessibility to all actors within

the network is challenging, as later addressed through the development of ontology

for CNs.

2.2.2 Performing a Set of Initiating Actions

There are a number of initiating actions that need to be taken, as a pre-condition to

establishing CNs. These actions are typically taken by the founder(s) of the CN and

may include [11; 12]: identifying interested parties and bring them together; defining

the scope of the collaboration and its desired outcomes; defining the structure of the

collaboration in terms of leadership, roles, responsibilities; setting the plan of actions

in terms of access to resources, task scheduling and milestones, decision-making plan;

defining policies, e.g. for handling disagreements / conflicts, accountability, rewards

and recognition, ownership of generated assets, intellectual property rights; defining

the evaluation / assessment measures, mechanisms and process; and identifying the

risks and planning contingency measures.

Typically most information related to the initiating actions is strategic and considered

proprietary to be accessed only by the CN’s administration. The classification of

information in CNs to ensure its confidentiality and privacy, while guaranteeing

enough access to the level required by each CN stakeholder is a challenging task for

the information management system of the network’s administration.

2.2.3 Substantiating a Common Collaboration Space

Establishing CNs require the pre-establishment of their common collaboration

space. In this context, we define the term collaboration space as a generic term to

address all needed elements, principles, infrastructure, etc. that together provide the

needed environment for CN actors to be able to cooperate/collaborate with each other.

Establishment of such spaces is needed to enable and facilitate the collaboration process.

Typically it addresses the following challenges:

- Common concepts and terminology (e.g. common meta-data defined for

databases or an ontology, etc., specifying the collaboration environment and

purpose) [13].

- Common communication infrastructure and protocols for interaction and

data/information sharing and exchange (e.g. the internet, GRID, open or commercial

tools and protocols for communication and information exchange,

document management systems for information sharing, etc.) [14]


8 H. Afsarmanesh et al.

- Common working and sharing principles, value system, and policies (e.g. procedures

for cooperation/collaboration and sharing different resources, assessment

of collaboration preparedness, measurement of the alignment between

value systems, etc.) [15; 16; 17]. The CN related principles and policies are

typically modeled and stored by its administration and made available to all its

stakeholders.

- Common set of base trustworthiness criteria (e.g. identification, modeling and

specification of periodic required measurements related to some common aspects

of each actor that shall fall above certain threshold for all stakeholders, in

order to ensure that all joining actors as well as the existing stakeholders possess

minimum acceptable trust level [18]. It is necessary to model, specify, store, and

manage entities and concepts related to trust establishment and their measurements

related to different actors.

- Harmonization/adaptation of heterogeneities among stakeholders due to external

factors such as those related to actors from different regions involved in

virtual collaboration networks, e.g. differences in time, language, laws/regulations,

and socio-cultural aspects [19]. Some of these heterogeneities affect the

sharing and exchange of information among the actors in the network, for which

proper mappings and/or adaptors shall be developed and applied.

There are certain other specific characteristics of CNs that require to be supported by

their common collaboration space. For example some CNs may require simultaneous

or synchronous collaboration, while others depend on asynchronous collaboration.

Although remote/virtual collaboration is the most relevant case in collaborative networks,

which may involve both synchronous and asynchronous interactions, some

CNs may require the co-location of their actors [20].

2.2.4 Substantiating Coordination, Supporting, and Management of Activities

A well defined approach is needed for coordination of CN activities, and consequently

establishment of mechanisms, tools, and systems are required for common

coordination, support, and management of activities in the CN. A wide range of

approaches can be considered for coordination of the CNs, among which one would

be selected for each CN, based on its common goal and vision. Furthermore, depending

on the selected coordination approach for each CN, management of its activities

requires different supporting mechanisms and tools. For instance, on one side of the

spectrum, for voluntary involvement of biodiversity scientists in addressing a topic of

public interest in a community, a self-organized strategic alliance may be established.

For this CN, a federated structure/coordination approach can be employed, where all

actors have equal rights on decision making as well as suggesting ideas for the next

steps/plans for the management of the CN that will be voted in this community. On

the other end of the spectrum however, for car manufacturing, a goal-oriented CN

may be established for which a fully centralized coordination approach can be applied,

using a star-like management approach where most activities of the CN actors

are fully guided and measured by one entity in the network, with almost no involvement

from others.

In practice all current goal-oriented CNs typically fall somewhere in between

these two extreme cases. For the long term strategic CNs, the current trend is towards


Information Supporting Functional Dimension of Collaborative Networks 9

establishing different levels of roles and involvements for actors in leading and

decision making at the CN level, with a centralized management approach that

primarily aims to support CN actors with their activities and to guide them towards

better performance.

Short-term goal-oriented CNs on the other hand vary in their coordination approach

and management. For instance in the product/services industry, typically these

CNs are to the extent possible centralized in their coordination, and are managed in

the style of a single organization. But for example looking into CNs in research, we

see a different picture. For example in EC-funded research projects, usually the coordination

of the consortium organized for the project is assumed by one or a few actors

that represent the CN to the outside, but internally the management of activities is far

more decentralized, and the decision making is in many cases done in federated manners

and through voting.

Nevertheless, and no matter which coordination approach is adopted, in order for

CNs to operate successfully, their management requires a number of supporting tools

and systems, which shall be determined and provided in advance of the establishment

of the CNs. This subject is further addressed in Section 3 of this paper, where the

functional dimension of the CNs and specifically the main required functionality for

management of the long term strategic alliances are enumerated and exemplified.

As one example, in almost all CNs, the involved actors need to know about each

others’ capabilities, capacities, resources, etc. that is referred to as the competency of

the involved actors in [21]. In breeding environments, either VBEs or PVCs, for instance,

such competency information constitutes the base for the partner search by the

broker/planner, who needs to match partners’ competencies against the characterization

of an emerged opportunity in order to select the best-fit partners. Similarly, as an

antecedent to any collaboration, some level of trust must pre-exist among the involved

actors in the CN and needs to be gradually strengthened depending on the

purpose of the cooperation/collaboration. Therefore, as addressed in [18], as a part of

the CN management system, rational measurement of the performance and achievements

of CN actors can be applied to determine the trustworthiness of its members

from different perspectives.

Considering these and other functionalities needed for effective management of the

CNs, classification, storage, and manipulation of their related information e.g. for

competencies of actors and their trust-related criteria need to be effectively supported

and is challenging.

2.2.5 Achieving Agreements and Contracts among Actors

Successful operation of the CN requires reaching common agreements/contracts

among its actors [12; 22]. At the point of joining the CN, actors must agree on its

common goals and to follow its vision during the collaboration process, towards the

achievement of the common goal. They must also agree with the established common

collaboration space for the CN, including the common terminology, communication

infrastructure, and its working and sharing principles. Additionally, through the common

collaboration space, a shared understanding of the problem at hands, as well as

the nature/form of sharing and collaboration at the CN level should be achieved.

Further on, clear agreements should be reached among the actors on the distribution

of tasks and responsibilities, extent of commitments, sharing of resources, and the

distribution of both the rewards and the losses and liabilities. Some details in relation


10 H. Afsarmanesh et al.

to these challenges are addressed in [23; 24]. The ownership and sharing of resources

shall be dealt with, whether it relates to resources brought in by CN actors or resources

acquired by the coalition for the purpose of performing the tasks.

Successful collaboration depends on sharing the responsibilities by its actors. It is

as important to have clear assignment of responsibilities during the process of achieving

the CN goals, as afterwards in relation to liabilities for the achieved results. The

level of commitment of actors shall be also clearly defined, e.g. if all actors are collectively

responsible for all results, or otherwise. Similarly, division of gains and

losses shall be agreed by the CN actors. Here, depending on the type of CN, its value

system, and the area in which it operates, a benefit/loss model shall be defined and

applied. Such a model shall address the perception of “exchanged value” in the CN

and the expectations and commitment of its members. For instance, when it comes to

the creation of intellectual property at the CN, its creation in most cases is not linearly

related to the proportion of resources invested by each actor. Therefore, a fair way of

determining the individual contribution to the results of the CN shall be achieved and

applied to the benefit/loss model for the CN.

Due to their relevance and importance for the successful operation of the CNs, detailed

information about all agreements and contracts established with its actors are

stored and preserved by CN administration. Furthermore, some CNs with advanced

management systems, model and store these agreements and contracts within a system

so that they can be semi-automatically enforced, for example such a system can

issue automatic warnings when a CN actor has not fulfilled or has violated some

timely terms of its agreement/contract.

Organizing, processing, and interfacing the variety of information to different

stakeholders, required to support both reaching agreement as well as enforcing them,

is quite challenging.

2.3 Relation between the Strategic-Alliance CNs and the Goal-Oriented CNs

Scarcity of resources / capacities owned by actors is at the heart of the motivation for

collaboration. For instance, large organizations typically hesitate to collaborate with

others, when and if they own sufficient resources and skills to fully respond to emerging

opportunities. On the other hand, due to the lack of needed resources and skills,

SMEs in different sectors increasingly tend towards collaboration and joining their

efforts. Therefore, a main motivation for establishment of CNs is to create larger

resources and skills set, in order to compete with others and to survive in turbulent

markets. Even in the nature, we can easily find natural alliances among many different

species (e.g. bees, ants, etc.), which form communities and collaborate to compete

for increasing both their resources and their power, what is needed for their survival

[25]. Therefore, in today’s market/society, we like to call the “scarcity of resources

(e.g. capabilities/capacities) the mother of collaborative networks”.

Nevertheless, even though bigger pool of resources and skills is generated through

collaboration among individuals or organizations, and face variable needs as market

conditions evolve, these pools are still limited and therefore should be dealt with

through careful effective planning. But unlike the case of a single actor that is selfconcerned

in its decision making, e.g. to approach or not approach an opportunity, in the

case of collaborative networks decision-making on this issue is quite challenging and

are usually addressed by an stakeholder acting as the broker/planner of goal-oriented


Information Supporting Functional Dimension of Collaborative Networks 11

CNs. Further to this decision, there are a large number of other challenges involved in

the creation phase of goal-oriented CNs, e.g. selecting the best-fit partners for an

emerged opportunity; namely finding the best potential actors, through effective

matching of their limited resources and skills against the required characteristics of

the emerged opportunity. Other challenges include the setting up of the common infrastructure,

etc. as addressed in the previous section. Additionally, trust which is a

fundamental requirement for any collaboration is a long-term process that cannot be

satisfied if potential participants have no prior knowledge of each other. Many of

these challenges either become serious inhibitors to the mere establishment of goaloriented

CNs by their broker/planner, or constitute the serious cause for their failures

in the later stages of CN’s life cycle [11].

As one solution approach, both research and practice have shown that creation/foundation

of goal-oriented short term CNs, to respond to emerging opportunities,

can both greatly benefit from the pre-existence of a strategic alliance/association

of actors, and become both cost and time effective. A line of research in the CN discipline

is therefore focused on these long-term alliances, starting with the investigation

of the existing networks that act as such associations – the so called 1 st generation

strategic alliances, but focused specifically on expanding their roles and operations in

the market/society, thus modeling and development of the next generation of such

associations – the so called 2 nd generation VBEs [26].

Research on strategic alliances of organizations and individuals on one hand focuses

on providing the support environment and functionalities, tools and systems that

are required to improve the qualification and positioning of this type of CNs in the

market/society in accordance to its own goal, vision, and value system. Besides defining

the common goal/vision, performing the needed initiating actions, and establishing

the common collaboration space, the alliance plays the main role in coordinating

activities of the association, and achieving agreements among its actors, towards their

successful establishment of goal-oriented CNs. Therefore, a part of the research in

this area focuses on establishing a strong management system for these types of CN

[27], introducing the fundamental functionality and information models needed for

their effective operation and evolution. These functionalities, also addressed later in

this paper, address the management of information needed for effective day-to-day

administration of activities in breeding environments, e.g. the engineering of CN

ontology, management of actors’ competencies and profiles, and specification of the

criteria and management of information related to measurement of the level of trust in

actors in the alliance. Furthermore, a number of specific subsystems are needed in this

environment to support the creation of goal-oriented short term CNs, including the

search for opportunities in the market/society, matching opportunities against the

competencies (resources, capacities, skills, etc.) available in the alliance, and reaching

agreement/negotiation among the potential partners.

On the other hand, this area of research focuses on measuring and improving the

properties and fitness of the involved actors in the strategic alliance as a part of the

goals of these breeding environments, aiming to further prepare and enable them for

participation in future potential goal-oriented CNs.

As addressed later in Section 4.1, the effective management of strategic alliance

type of CNs heavily depends on building and maintaining strong information management

systems to support their daily activities and variety of functionalities that

they provide to their stakeholders.


12 H. Afsarmanesh et al.

3 Collaborative Networks Reference Model

Recent advances in the definition of the CN taxonomy as well as the reference modeling

of the CNs are addressed in [5; 28], and fall outside the scope of this paper. However,

this section aims to provide brief introduction to the ARCON reference model

defined for the CNs. The reference model in turn provides the base for developing the

CN ontology, as well as modeling some of the base information needed to be handled

in the CNs.

This section further focuses on the endogenous perspective of the CNs, while Section

4 focuses specifically on the functional dimension of the CNs. Then Sections 5,

6, and 7 narrow down on the management of information for several elements of the

functional dimension.

3.1 ARCON Reference Model for Collaborative Networks

Reference modeling of CN primarily aims at facilitating the co-working and codevelopment

among its different stakeholders from multi-disciplines. It supports the

reusability and portability of its defined concepts, thus providing a model that can be

instantiated to capture all potential CNs. Furthermore, it shall provide insight into the

modeling tools/theories appropriate for different CN components, and the base for

design and building of the architectural specifications of CN components.

Inspired by the modeling frameworks introduced earlier in the literature related to

collaboration and networking [3; 4; 29; 30] and considering the complexity of CNs

[11; 10; 31], the ARCON (A Reference model for Collaborative Networks) modeling

framework is developed addressing their wide variety of aspects, features, and constituting

elements. The reference modeling framework of ARCON aims at simplicity,

comprehensiveness and neutrality. With these aims, it first divides the CN’s complexity

into a number of perspectives that comprehensively and systematically cover

all relevant aspects of the CNs.

At the highest level of abstraction, the three perspectives of environment characteristics,

life cycle, and modeling intent are identified and defined for the ARCON

framework, respectively constituting the X, Y, and Z axes of the diagrammatic representation

of the ARCON reference model.

First, the life cycle perspective captures the five main stages of the CNs’ life cycle,

namely the creation, operation, evolution, metamorphosis, and dissolution stages.

Second, the environment characteristics perspective further consists of two subspaces:

the “Endogenous Elements subspace” capturing the characteristics of the internal

elements of CNs, and the “Exogenous Interactions subspace” capturing the characteristics

of the external interactions of the CNs with its logical surrounding. Third, the

modeling intent perspective captures different intents for the modeling of CN features,

and specifically addressing three possible modeling stages of general representation,

specific modeling, and implementation modeling.

All three perspectives and their elements are in detailed addressed in [1]. To enhance

the understanding of the content of this paper, below we briefly address only

the environment characteristics perspective, and then focus on the endogenous subspace.

For more details on the life cycle and modeling intent perspectives, as well as


Information Supporting Functional Dimension of Collaborative Networks 13

the description of elements of the exogenous subspace, please refer to the above mentioned

publication.

3.1.1 Environment Characteristics Perspective – Endogenous Elements

Subspace

To comprehensively represent its environment characteristics, the reference model for

CNs shall include both its Endogenous elements, as well as its Exogenous Interactions

[1]. Here we focus on the endogenous elements of the CN. For much more details on

any of these issues the above reference to ARCON reference model is suggested.

Abstraction and classification of CN’s endogenous elements is challenging due to

the large number of their distinct and varied entities, concepts, functionality, rules and

regulations, etc. For instance, every CN participant can play a number of roles and

have different relationships with other CN participants. Furthermore, there are certain

rules of behavior that either constitute the norms in the society/market, or set internal

to the CN and shall be obeyed by the CN participants. Needless to say that in every

CN there are a set of activities and functionalities needed for its operation and management

that also need to be abstracted in its reference model. The Endogenous Elements

subspace of ARCON aims at the abstraction of the internal characteristics of

CNs. To better characterize these diverse set of internal aspects of CNs, four ortogonal

dimensions are proposed and defined, namely the structural, componential, functional,

and behavioral dimensions:

• E1 - Structural dimension. Addressing the composition of CN’s constituting

elements, namely the actors (primary or support), roles (administrator, advisor,

broker, planner, etc.), relationships (trusting, cooperation, supervision, collaboration,

etc.), and network topology (self and potentially sub-network) etc.

• E2 - Componential dimension. Addressing the individual tangible/intangible CN

elements, namely domain specific devices, ICT resources (hardware, software,

networks), human resources, collected information, knowledge (profile/competeny

data, ontologies, bag of assets, profile and competency data, etc.), and its accumulated

assets (data, tools, etc.) etc.

• E3 - Functional dimension. Addressing the “base functions / operations” that run

to support the network, time-sequenced flows of executable operations (e.g. processes

for the management of the CN, processes to support the participation and activities

of members in the CN), and methodologies and procedures running at the

CN (network set up procedure, applicant’s acceptance, CN dissolution and inheritance

handling, etc.) etc.

• E4 - Behavioral dimension. Addressing the principles, policies, and governance

rules that either drive or constrain the behavior of the CN and its members over

time, namely principles of governance, collaboration and rules of conduct (prescriptive

or obligatory), contracts and agreements, and constraints and conditions

(confidentiality, conflict resolution policies, etc.) etc.

Diagrammatic representation of the cross between the life-cycle perspective and the

Endogenous Elements, exemplifying some elements of each dimension is illustrated

in Fig. 2.


14 H. Afsarmanesh et al.

Dissolution

Dissolution

CNO-Life-Cycle Stages

Metamorpho Metamorpho sis sis

Evolution

Evolution

Operation

Operation

Creation

Creation

Endo-E Abstractions Exo-I Abstractions

Model Model Intent Intent

Specific Specific Implementation

Implementation

General General

Modeling Modeling

Modeling Modeling

Representation

Representation

© H. Afsarmanesh & L.M. Camarinha-Matos 2007

L5.

/ Dissolution

Dissolution

L4.

L4.

Metamorphosis

Metamorphosis

/ Dissolution

L3.

Evolution

L2.

Operation

L1.

Creation

Inside

view

CNO-Life-Cycle Stages

e.g.

-Participants

-Relationships

-Roles

-Network

topology

*

E1. Structural

e.g.

-Hardware/

software res.

-Humanres.

-Information/

knowledgeres.

-Ontologyres.

*

E2. Componential

e.g.

-Processes

-Auxiliary

processes

-Procedures

-Methodologies

*

Endogenous Elements (Endo (Endo-E) (Endo (Endo-E) -E)

e.g.

-Prescriptive

behavior

-Obligatory

behavior

-Constraints&

conditions

-Contracts&

agreements

*

E3. Functional E4. Behavioral

Fig. 2. Crossing CN life cycle and the Endogenous Elements perspective [1]

The remaining of this paper focuses only on the functional dimension of the CN

and in specific it addresses in more details the functionality required for effective

management of the long term strategic alliances. In order to exemplify the involved

complexity, It then further focuses down on the management of information required

to support three specific functionality of ontology engineering, profile and competency

management, and trust management within the functional dimension of this

type of CNs. In specific the collection, modeling, and processing of the needed information

for these functionalities that constitute three sub-systems of the management

system for these type of CNs are addressed.

4 Functional Dimension of Collaborative Networks

The detailed elements in the ARCON reference model that comprehensively represent

the functional dimension of the CNs is addressed in [1], where also instantiations of

the functional dimension of CNs for both the long-term strategic alliances as well as

the shorter term goal-oriented networks are presented. This section specifically focuses

on the long term strategic alliances of organizations or individuals. It first addresses

the set of functionality that are necessary for both the management of daily

operation of strategic alliances as well as those needed to support its members with

their participation and activities in this type of CN.

Managing variety of heterogeneous and distributed information is required within

strategic alliances, such as the VBEs and PVCs to support their operation stage, as

characterized in the functional dimension of these two types of CNs. For such networks

to succeed, their administration needs to collect a wide variety of information

partially from their involved actors and partially from the network environment itself,

classify and organize this information to fit the need of their supporting sub-systems,

and continuously keeping them updated and complete to the extent possible [32].


Information Supporting Functional Dimension of Collaborative Networks 15

Current research indicates that while the emergence of CNs delivers many exciting

promises to improve the chances of success for its actors in current turbulent market/society,

it poses many challenges related to supporting its functional dimension.

This in turn results challenges for capturing, modeling and management of the information

within these networks.

Some of the main required functions to support both the management and the daily

operation of the strategic alliances include: (i) engineering of network ontology, (ii)

classification and management of the profile and competency of actors in the network,

(iii) establishing and managing rational trust in the network, (iv) matching partners

capabilities/capacities against requirements for collaboration opportunity, (v) reaching

agreements (negotiation) for collaboration, and (vi) collection of the assets (data, software

tools, lessons learned, best practices, etc.) gathered and generated in the network,

and management of components in such bag of assets, among others [11; 27]. A brief

description of a set of main functionalities is provided in the next sub-section.

4.1 Functionalities and Sub-systems Supporting Strategic Alliances

Research and development on digital networks, particularly the Internet, addresses

challenges related to the online search of information and the sharing of expertise and

knowledge between organizations and individuals, irrespective of their geographical

locations. This in turn paves the way for collaborative problem solving and cocreation

of services and products, which go far beyond the traditional interorganizational

or inter-personal co-working boundaries and geographical constraints,

addressing challenging questions about how to manage information to support the

cooperation of organizations and individuals in CNs.

A set of functionalities are required to support the operation stage of strategic alliances.

In particular, supporting the daily management of CN activities and actors, and

their agile formation of goal-oriented CNs to address emerging opportunities are challenging.

Furthermore, these functionalities handle variety of information, thus need

effective management of their gathered information, considering the geographical

distribution of and heterogeneous nature of the CN actors, such as their applied technologies,

organizational culture, etc.

Due to the specificities of the functionalities required for management of strategic

alliances, developing one large management system for this purpose is difficult to

realize and maintain. A distributed architecture is therefore typically considered for

their development. Applying the service orientation approach, a number of interoperable

independent sub-systems can be developed to and applied, that in turn requires

support the management of their collaboration-related information in the strategic

alliances. As an example for such development and required functionality, as addressed

in [27], a so-called VBE management system (VMS) is designed and implemented

constituting a number of inter-operable subsystems. These sub-systems either

directly support the daily management of the VBE operation [33], or are developed to

assist the opportunity-broker and the VO-planner with effective configuration and

formation of the VOs in the VBE environment [34]. In Fig. 3, eight specific functionalities

address the subsystems supporting the management of daily operation of

VBEs, while four specific functionalities, appearing inside the VO creation box, address

the subsystems supporting different aspects related to the creation of VOs.


16 H. Afsarmanesh et al.

Focusing on their information management aspects, each of these sub-systems provide

a set of services related to their explicit access, retrieve, and manipulation and of

information for different specific purposes These subsystems interoperate through

exchanging their data, and together provide an integrated management system as

shown in Fig. 3. For each subsystem illustrated in this figure, a brief description is

provided below, while more details can be found in the above two references.

A

ODMS

MSMS

Member

registration

A

6

SIMS

S

M

M

7

1

7

6

PCMS

TrustMan

15 BAMS

Value system

16

S

Data transfer

4

5

9

6

DSS

Low trust

10

MSMS

rewarding

8

2

14

DSS

Lack of

competency

A M

A A

1 Profile/competency classification

2 Profile/competency element classification

3 Member’s competency specification

4 Competency classes

5 Low base trust level of organizations

M

A

A

CO-Finder 10 COC-Plan 11 PSS 12 WizAN

B

A

A

2

3

Main users/editors of data in the systems / tools:

VBE Member VBE Administrator Broker

6 Members’ general data

7 Bas trust level of membership applicants

8 Specific trustworthiness of VO partners

9 Organizations’ performance data from the VO

10 Collaborative opportunities’ definitions

11 VO model

DSS

Low performance

14

VIMS

VO inheritance

VIMS

VO registration

14 13

VO creation

17

13

M A B

S

Fig. 3. VMS and its constituent subsystems

B

B

A

A

B

VOMS

14

Support Institution

Manager

12 VO model and candidate partners

13 VO model and VO partners

14 Processed VO inheritance

15 Support institutions’ general data

16 Asset contributors’ general data

17 VO inheritance

Membership Structure Management Systems (MSMS): Collection and analysis of

the applicants’ information as a means to ascertain their suitability in the VBE has

proved particularly difficult. This subsystem provides services which support the

integration, accreditation, disintegration, rewarding, and categorization of members

within the VBE.

Ontology Discovery Management Systems (ODMS): In order to systematize all

VBE-related concepts, a generic/unified VBE ontology needs to be developed and

managed. The ODMS system provides services for the manipulation of VBE ontologies,

which is required for the successful operation of the VBE and its VMS as

further addressed in Section 5.

Profile and Competency Management Systems (PCMS): In VBEs, several functionalities

need to access and process the information related to members’ profiles and

competencies. PCMS provides services that support the creation, submission, and

B


Information Supporting Functional Dimension of Collaborative Networks 17

maintenance of profiles and detailed competency related elements of the involved

VBE organizations, as well as categorizing collective VBE competencies, and organizing

competencies of VOs registered within the VBE, as further addressed in

Section 6.

Trust Management system (TrustMan): Supporting the VBE stakeholders, including

the VBE administration and members, with handling tasks related to the analysis

and assessment of rational trust level for other organizations is of great importance

for successful management and operation of the VBEs, such as the selection of

best fit VO partner as further addressed in Section 7.

Decision Support Systems (DSS): The decision making process in a VBE needs to

involve a number of actors whose interests may even be contradictory. The DSS

has three components that support the following operations related to decisionmaking

within a VBE, namely: Warning of an organization’s lack of performance,

Warning related to the VBE’s competency gap, and Warning of an organization’s

low level of trust.

VO information management system (VIMS): It supports the VBE administrator and

other stakeholders with management of information related to the creation stage of

the VOs within the VBE, storing summary records related to measurement of performance

during the VO’s operation stage, and recording and managing of information

and knowledge gathered from the dissolved VOs, which constitute means

to handle and access inheritance information.

Bag of assets management system (BAMS): It provides services for management and

provision of fundamental VBE information, such as the guidelines, bylaws, value

systems guidelines, incentives information, rules and regulations, etc. It also supports

the VBE members with publishing and sharing some of their “assets” of

common interest with other VBE members, e.g. valuable data, software tools, lessons

learned etc.

Support institution management system (SIMS): The support institutions in VBEs

are of two kinds. The first kind refers to those organizations that join the VBE to

provide/market their services to VBE members. These services include advanced

assisting tools to enhance VBE Members’ readiness to collaborate in VOs. They

can also provide services to assist the VBE members with their daily operation,

e.g. accounting and tax, training, etc. The second kind refers to organizations that

join the VBE to assist it with reaching its goals e.g. ministries, sector associations,

chamber of commerce, environmental organizations, etc. SIMS supports the management

of the information related to activities of support institutions inside the

VBEs.

Collaboration Opportunity Identification and Characterization (coFinder): This

tool assists the opportunity broker to identify and characterize a new Collaboration

Opportunity (CO) in the market/society that will trigger the formation of a new

VO within the VBE. A collaboration opportunity might be external, initiated by a

customer and brokered by a VBE member that is acting as a broker. Some opportunities

might also be generated internally, as part of the VBE’s development

strategy.

CO characterization and VO’s rough planning (COC-plan): This tool supports the

planner of the VO with developing a detailed characterization of the CO needed

resources and capacities, as well as with the formation of a rough structure for the


18 H. Afsarmanesh et al.

potential VO, therefore, identifying the types of required competencies and capacities

needed from organizations that will form the VO.

Partners search and suggestion (PSS): This tool assists the VO planner with the

search for and proposal of one or more suitable sets of partners for VO configurations.

The tool also supports an analysis of different potential VO configurations

in order to select the optimal formation.

Contract negotiation wizard (WizAN): This tool supports the VO coordinator to

involve the selected VO partners in the negotiating process, agreeing on and

committing to their participation in the VO. The VO is launched once the needed

agreements have been reached, contracts established, and electronically signed.

About managing the information in CNs, a summary of the main related challenges are

presented in section 1.1, where several requirements are addressed, in relation to different

aspects and components of the CNs. The next three sections focus down and provide

details on the information management aspects of three of the above functionalities

and address their subsystems, namely the ontology engineering, the management of

profiles and competencies and the management of trust, in this type of CNs.

5 VBE-Ontology Specification and Management

Ontologies are increasingly applied to different areas of research and development,

for example they are effectively used in artificial intelligence, semantic web, software

engineering, biomedical informatics, library science, among many others, as the

means for representing knowledge about their environments. Therefore, a wide variety

of tasks related to processing information/knowledge is supported through the

specification of ontologies. As examples for these tasks we can mention: natural language

processing, knowledge management, geographic information retrieval, etc.

[35]. This section introduces an ontology developed for VBEs that aims to address a

number of challenging requirements for modeling and management of VBE information.

It first presents the challenges being addressed and then sections 5.1 and 5.2

present two specific ontology-based solutions.

5.1 Challenges for VBE Information Modeling and Management

The second generation VBEs must handle a wide variety and types of information

related to both their constituents and their required daily operations and activities.

Therefore, these networks must handle and maintain a broad set of concepts and entities

to support processing of a large set of functionalities.

Among others, complexity, dynamism, and scalability requirements can be identified

as characteristics describing the VBEs and their following aspects: (i) autonomous

geographically distributed stakeholders, (ii) wide range of running management

functionalities and support systems, and (iii) diverse domains of activities and application

environments. The analysis of several 1 st generation VBEs in different domain

has shown that the development of an ontology for VBEs can address their following

main requirements:


Information Supporting Functional Dimension of Collaborative Networks 19

Establishing common understanding in VBEs. Common understanding of the general

as well as domain-related VBE concepts is the base requirement for modelling

and management of information/knowledge in different VBE functionalities. To facilitate

interoperability and smooth collaboration, all VBE stakeholders must use the

same definition and have the same understanding of different aspects and concepts

applied in the VBE, including: VBE policies, membership regulations, working/sharing

principles, VBE competencies, performance measurement criteria, etc.

There is still a lack of consensus on the common and coherent definitions and terminology

addressing the generic VBE structure and operations [36]. Therefore, identification

and specification of common generic VBE terminology, as well as development

of a common semantic subspace for VBE information/knowledge is challenging.

VBE instantiation in different domains. New VBEs are being created and operated

in a variety and range of domains and application environments, from e.g. the provision

of healthcare services, and the product design and manufacturing to the management of

natural disasters, biodiversity and the scientific virtual laboratory experimentations in

physics or biomedicine, among others. Clearly, each domain/application environment

has its own features, culture, terminology, etc. that shall be considered and supported

by the VBEs’ management systems. During the VBE’s creation stage, parameterization

of its management system with both the generic VBE characteristics as well as

with the specific domain-related and application-related characteristics is required.

Furthermore at the creation stage of the VBE, several databases need to be created to

support the storage and manipulation of the information/knowledge handled by different

sub-systems. Design and development of these databases shall be achieved

together with the experts from the domain, requiring knowledge about complex application

domains. Therefore, development of approaches for speeding up and facilitating

instantiation and adaptation of VBEs to different domains / areas of activity is

challenging.

Supporting dynamism and scalability in VBEs. Frequent changes in the market and

society, such as the emergence of new types of customer demands or new technological

trends drive VBEs to work in a very dynamic manner. Supporting dynamic aspects

of VBEs require that the VBE management system is enabled by functionalities

that support human actors with necessary changes in the environment. The VBE’s

information is therefore required to be processed dynamically by semi-automated

reusable software tools. As such, the variety of VBE information and knowledge must

be categorized and formally specified. However, there is still a lack of such formal

representations and categorizations. Therefore formal modelling and specification of

VBE information, as well as development of semi-automated approaches for speeding

up the VBE information processing is challenging.

Responding to the above challenges through provision of innovative approaches,

models, mechanisms, and tools represents the main motivation for the research addressed

in this section. The following conceptual and developmental approaches together

address the above three challenges:

• Conceptual approach - unified ontology: The unified ontology for VBEs, which

is further referred to as the VBE-ontology described as follows [13]:


20 H. Afsarmanesh et al.

VBE-ontology is a form of unified and formal conceptual specification of the

heterogeneous knowledge in VBE environments to be easily accessed by and

communicated between human and application systems, for the purpose of

VBE knowledge modelling, collection, processing, analysis, and evolution.

Specifically, the development of the unified VBE-ontology supports responding to the

challenge of common understanding as follows: (i) supports to represent definitions of

all VBE concepts and the relationships among concepts within a unified ontology that

establishes the common semantic subspace for the VBE knowledge; (ii) introduces

linguistic annotations such as synonyms and abbreviations to address the problem of

varied names for concepts; (iii) through sharing the VBE-ontology within and among

VBEs, supports reusing common concepts and terminology. In relation to the

challenge of VBE instantiation, the VBE ontology addresses it as follows: (1) the

ontological representation of VBE knowledge is semi-automatically convertible/transferable

to database schemas [37] supporting the semi-automated development

of the needed VBE databases during the VBE creation stage; (2) pre-defined domain

concepts within the VBE-ontology support the semi-automated parameterization of

generic VBE management tools, e.g. PCMS, TrustMan, etc. In relation to the challenge

of VBE dynamism and scalability, the developed VBE-ontology responds it in

the following way: (a) formal representation of the knowledge in the VBE-ontology

facilitates semi-automated processing of this knowledge by software tools; (b) the

ontology itself can be used to support semi-automated knowledge discovery from

text-corpora [38].

• Developmental approach - ontology discovery and management system: In

order to benefit from the VBE-ontology specification, a number of ontology engineering

and management functionalities are developed on top of the VBE-ontology [39].

Namely, the ontology engineering functionalities support discovery and evolution of

the VBE-ontology itself, while the ontology management functionalities support VBE

stakeholders learning about VBE concepts, preserve the consistency among VBE

databases and domain parameters with the VBE-ontology, and perform semiautomated

information discovery. These needed functionalities are specified and

developed within one system, called Ontology Discovery and Management System

(ODMS) [39]. The ODMS plays a special role in the functional dimension of the CN

reference model (as addressed in section 4). Unlike other information management

sub-systems addressed in section 4.1, e.g. profile and competency management, trust

management, etc., ODMS does not aim at management of only real information/data

of the VBE, but also of the ontological representation of its conceptual aspects,

namely the meta-data. Precisely, this sub-system aims to support the mapping of

information handled in other VMS sub-systems to its generic meta-models. This mapping

supports consistency between the portions of information accumulated by different

VMS sub-systems and their models. It also supports preserving semantics of the

information, which is the first step for development of semi-automated and intelligent

approaches for information management.

The remaining of this section further describes the above two approaches in more

details.


5.2 VBE-Ontology

Information Supporting Functional Dimension of Collaborative Networks 21

To define the scope of the VBE-ontology, first the VBE information and knowledge

are characterised and categorised. The two following main characteristics of the VBE

information / knowledge are used to categorise them:

• Reusable VBE information at different levels of abstraction: In order to respond

to the challenge of common understanding, the VBE-ontology is primarily

addressed in three levels of abstraction, called here “concept-reusability levels” that

refer to reusability of the VBE information at core, domain, and application levels

(see Fig. 4). The core level constitutes the concepts that are generic for all VBEs, for

example concepts such as “VBE member”, “Virtual Organization”, “VBE competency”,

etc. The domain level has a variety of “exemplars” – one for each specific

domain or business area. Each domain level constitutes the concepts that are common

only for those VBEs that are operating in that domain or sector. Domain level

concepts constitute population of the core concepts into a specific VBE domain environment.

For example the core “VBE competency” concept can be populated with

“Metalworking competency” or “Tourism competency” depending on the domain.

The application level includes larger number of exemplars – one for each specific

VBE application within each domain. Each application level constitutes the concepts

that are common to that specific VBE and cannot be reused by other VBEs. Application

level concepts mainly constitute population of the domain level concepts into one

specific VBE application environment. The levels of reusability also include one very

high level called meta level. This level represents a set of high level meta-properties,

such as “definition”, “synonym”, and “abbreviation”, used for specification of all

concepts from the other three levels.

• Reusable VBE information in different work areas: In order to respond to the

challenge of VBE creation in different domains and the challenge of VBE dynamism

and scalability, the concepts used by different VBE management functionalities, as

addressed in the functional dimension of the CN reference model, should be addressed

in the VBE-ontology. Therefore, the VBE-ontology supports both: development

of the databases for VBE functionality related data, and the semi-automated

processing of these functionalities related information. Additionally, addressing these

concepts in the VBE-ontology responds to the challenge of common understanding

for these functionalities.

Following the approach addressed in [40] for the AIAI enterprise ontology, ten different

“work areas” are identified for VBEs and their management (see Fig. 4). Each

work area focuses on a set of interrelated concepts that are typically assi8ciated with a

specific VBE document repository and/or in a specific VBE management functionality,

such as: the Membership Management functionality, management of Bag of

Assets repository, Profile and Competency management, Trust managements as

addressed in section 4.1. These work areas are complimentary and each of them has

some concepts that it shares with some other work areas. In addition, while extensive

attention is spent on the design of these ten work areas, it is clear that in future more

work areas can be defined and added to the VBE-ontology. Additionally, each of the

ten work areas can be further split into some smaller work areas depending of the


22 H. Afsarmanesh et al.

details they need to capture. For example from the Profile and Competency work area,

the Competency work area can be separated from Profile work area.

The introduced structure of the VBE-ontology represents (a) embedding of the

“horizontal” reusability levels and (b) intersection of them with the “vertical” work

areas. The horizontal reusability levels are embedded in each other hierarchically.

Namely, the core level includes the meta level, as illustrated in Fig. 4. Furthermore,

every domain level includes the core level. Finally, every application level may include

a set of domain levels (i.e. those related to this VBE’s domains of activity). The

work areas are presented vertically, and thus intersect with the core level, domain

level, and application levels, but not with the meta level, that consists of the meta-data

applicable to all other levels. The cells resulted from the intersection of the reusability

levels and the work areas are further called sub-ontologies. The structure of the VBEontology

is also illustrated in Fig. 4. Particularly this figure addresses how intersection

of the horizontal core level and the vertical VBE profile and competency work

area results into the core level profile and competency sub-ontology.

The idea behind the sub-ontologies is to apply the divide and rule principle to the

VBE-ontology in order to simplify coping with its large size and wide variety of aspects.

Furthermore, sub-ontologies represent the minimal physical units of the VBEontology,

i.e. physical ontology files on a computer, while the VBE-ontology itself

shall be compiled out of its physical sub-ontologies according to its logical structure.

Sub-ontologies also help to cope with evolution of different VBE information. Every

time when a new piece of information needs to be introduced in the VBE-ontology,

only the relevant new sub-ontology for that information can be specified for it within

the VBE-ontology. Typically, when a new VBE is established, it does not need to

adapt the entire VBE-ontology. Rather it should build its own “application VBEontology”

out of related sub-ontologies as out of the “construction bricks”. Thus, the

design of the VBE-ontology also provides solutions to the technical question about

the differences in information accumulated by different VBE applications.

Levels of abstraction

VBE-self

VBE actor /

participant

Virtual

organization

VBE profile and

competency

VBE history

Application

Domain

Core

Meta

VBE bag of

assets

VBE

governance

VBE value

systems

VBE

management

system

Core level

“profile and competency”

sub-ontology

Fig. 4. Structure of the VBE-ontology consisting of sub-ontologies

VBE trust

“Trust management”

work area


Information Supporting Functional Dimension of Collaborative Networks 23

One partial screenshot from the developed sub-ontology for the core level of the

profile and competency information is addressed below in Fig. 5, as also later addressed

in section 6.

Fig.5. Partial screen-shot of the VBE profile and competency sub-ontology (at the core level)

5.3 ODMS Subsystem Functionalities

The ODMS (Ontology Discovery and Management System) functionalities aim to

assist the main information management processes and operations that take place

through the entire life-cycle of a VBE. They include both ontology engineering functionalities

that are needed for maintaining the VBE-ontology itself, and ontology

management functionalities that are needed to support VBE information management.

The five specified functionalities for ODMS include:

• Sub-ontology registry: In order to maintain the sub-ontologies of the VBEontology,

this functionality, rooted in [41; 42], aims at uploading, registering,

organizing, and monitoring the collection of sub-ontologies within an application

VBE-ontology. Particularly, it aims at grouping and re-organizing sub-ontologies for

further management, partitioning, integration, mapping, and versioning.

• Sub-ontology modification: This functionality aims at manual construction and

modification of sub-ontologies. Particularly it has an interface through which users

can perform operations of introducing new concepts and adding definitions, synonyms,

abbreviations, properties, associations and inter-relationships for the existing

concepts. The concepts in sub-ontologies are both represented in a textual format as

well as visualized through graphs or diagrams.

• Sub-ontology navigation: This functionality aims at familiarising VBE members

with the VBE terminology and concepts, and thus addressing the challenge of common

understanding. In order to view the terminology, the VBE members first select a


24 H. Afsarmanesh et al.

specific sub-ontology from the registry. The concepts in sub-ontologies are also both

represented in a textual format as well as visualized through graphs or diagrams.

• Repository evolution: This functionality supports establishment and monitoring of

consistency between VBE database schemas (as well as content in some cases) and their

related sub-ontologies, and thus addresses the challenge of VBE instantiation in different

domains. In response to this challenge, the VBE databases shall be developed semiautomatically

guided by the VBE-ontology. Several approaches for conversion of subontologies

into database schemas suggest creation of a map between an ontology and a

database schema [37]. This map later supports monitoring consistency between these

ontology and database schema. Specifically, this functionality aims to indicate if the

database schemas need to be updated after changes to the VBE-ontology.

• Information discovery: This functionality, rooted in [38], aims at semiautomated

discovery of information from text-corpora, based on the VBE-ontology,

which addresses the challenge of VBE dynamism and scalability. Particularly, the

information discovery functionality supports discovery of relevant information about

the VBE member organizations in order to augment the current VBE repositories. The

text-corpora used by this functionality can include semi-structured (e.g. HTMLpages)

or unstructured sources (e.g. brochures). These are typically provided by VBE

member organizations.

6 Profile and Competency Modeling and Management

To support both the proper cooperation among the VBE members and the fluid configuration

and creation of VOs in the 2 nd generation VBEs, it is necessary that all

VBE members are characterised by their uniformly formatted “profiles”. This requirement

is especially severe in the case of the medium- to large-size VBEs, where

the VBE administration and coaches have less of a chance to get to know directly

each VBE member organization. As such, profiles shall contain the most important

characteristics of VBE members (e.g. their legal status, size, area of activity, annual

revenue, etc.) that are necessary for performing fundamental VBE activities, such as

search for and suggestion of best-fit VO partners, VBE performance measurement,

VBE trust management, etc.

The VBE “competencies” represent a specific part of the VBE member organizations’

profiles that is aimed to be used directly for VO creation activities. Competency

information about organizations is exactly what the VO broker and/or planner needs

to retrieve, in order to determine what an organization can offer for a new VO.

This section first addresses the specific tasks in VBEs that require handling of profiles

and competencies, and then it presents the two complimentary solution approaches

developed for solving these tasks.

6.1 Task Requiring Profile and Competency Modeling and Management

In the 2 nd generation VBEs, characteristic information about all VBE members should

be collected and managed in order to support the following four tasks [43].


Information Supporting Functional Dimension of Collaborative Networks 25

• Creation of awareness about potentials inside the VBE. In order to successfully

cooperate in the VBE and further successfully collaborate in VOs, the VBE members

need to familiarize with each other. In small-size VBEs, e.g. with less then 30 members,

VBE members may typically have the chance to get to know each other directly.

However, this becomes increasingly more difficult and even impossible in the geographically

dispersed medium-size and large-size VBEs (e.g. with 100-200 members).

Thus uniformly organizing the VBE members’ information, e.g. to represent the

members’ contact data, industry sector, vision, role in the VBE, etc., is a critical instrument

supporting awareness of the VBE members about each other.

• Configuration of new VOs. The information about the VBE member organization

is needed to be accessed by both the human individuals and the software tools

assisting the VO broker / VO planner in order to suggest configuration of the VOs

with best-fit partners. Therefore the information about members’ qualification, resources,

etc. that can be offered to a new VO needs to be structured and represented in

a uniform format.

• Evaluation of members by the VBE administration. At the stage of evaluating

the member applicants and also during the VBE members’ participation in the VBE,

the VBE administration needs to evaluate the members’ suitability for the VBE. The

members’ information is also needed for automated assessment of members’ collaboration

readiness, trustworthiness, and their performance, supported by software tools.

• Introduction / advertising the VBE in the marker / society. Another reason for

collection and management of VBE members’ information is to introduce / advertise

the VBE to the outside market / society. Therefore, summarized information about the

registered VBE members can be used to promote the VBE towards potential new

customers and therefore against new collaborative opportunities.

Collection of the members’ information in a unified format especially supports

harmonising/adapting heterogeneities among VBE members, which represents

one requirement for substantiation of a common collaboration space for CNs, as addressed

in section 2.2.3.

The profile for VBE member organization represents a separate uniformly formatted

information unit, and is defined as follows:

The VBE member organization’s profile consists of the set of determining characteristics

(e.g. name, address, capabilities, etc.) about each organization, collected

in order to facilitate the semi-automated involvement of each organization

in some specific line of activities / operations in the VBE that are directly or indirectly

aimed at VO creation.

An important part of the profile information represents the organization’s competency,

which is defined as follows:

Organizations’ competencies in VBEs represent up-to-date information about

their capabilities, capacities, costs, as well as conspicuities, illustrating the accuracy

of their provided information, all aimed at qualifying organizations for

VBE participation, and mostly oriented towards their VO involvement.

The remaining of this section addresses two solution approaches, a conceptual one

and a developmental one, that together address the above task.


26 H. Afsarmanesh et al.

6.2 Profile and Competency Models

The main principle used for definition of the unified profile structure is identification

of the major groups of the organization’s information. Following are the identified

categories of profile information:

1. VBE-independent information includes those organization’s characteristics that

are independent of the involvement of the organization in any collaborative and cooperative

consortia.

2. VBE-dependent information includes those organization’s characteristics that

are dependent on the involvement of the organization in collaborative and cooperative

consortia within the VBEs, VOs, or other types of CNs.

3. Evidence documents are required to represent the indication / proof of validity of

the profile information provided by the organizations related to the two previous categories

of information. An evidence can either be an on-line document or some web

accessible information, e.g. organization’s brochures, web-site, etc. The above mentioned

four tasks can then be addressed by the profile model, as follows:

� Creation of awareness about potentials inside the VBE: addressed through basic

information about name, foundation date, location, size, area of activity, general

textual description of the organization.

� Configuration of new VOs: handled through name, size, contact information,

competency information (addressed below), and financial information.

� Evaluation of members by the VBE administration: addressed through records

about past activities of organizations, including past collaboration/cooperation activities,

as well as produced products/services and applied practices.

� Introduction / advertising the VBE in the marker / society: addressed through

aggregation of characteristics such locations, competencies, and past history of its

achievements.

The resulting profile model is presented in Fig. 6.

The main objective of the competency model for VBE member organizations,

which is called “4C-model of competency”, is the “promotion of the VBE member

organizations towards their participation in future VOs”. The main technical challenge

for the competency modelling is the unification of existing organizations competency

models, e.g. as addressed by [44; 45]. Although, these competency models

are developed for other purposes than the 2 nd generation VBE, some of their aspects

can be applied to VBE members’ competencies [21]. However the main principle for

specification of the competency model is to organize different competency related

aspects. These are further needed to search VBE members that best fit some requirements

of an emerged collaboration opportunity. The resulting 4C competency model

is unified and has a compound structure. The primary emphasis of this model goes to

the four following components, which are identified through our experimental study

as necessary and sufficient:

1. Capabilities represent the capabilities of organizations, e.g. their processes and

activities. When collective business processes are modelled for a new VO, the VO

planner has to search for specific processes or activities that can be performed by

different potential organizations, an order to instantiate the model.


Information Supporting Functional Dimension of Collaborative Networks 27

Fig. 6. Model of the VBE member’s profile

2. Capacities represent free availability of resources needed to perform each capability.

Specific capacities of organizations are needed to fulfil the quantitative values

of capabilities, e.g. amount of production units per day. If the capacity of members for

a specific capability in the VBE is not sufficient to fulfil market opportunities, another

member (or a group of members) with the same capability may be invited to the VBE.

3. Costs represent the costs of provision of products/services in relation to each

capability. They are needed to estimate if invitation of a specific group of members to

a VO does not exceed the planned VO budget.

4. Conspicuities represent means for the validity of information provided by the

VBE members about their capabilities, capacities and costs. The conspicuities in

VBEs mainly include certified or witnessed documents, such as certifications, licenses,

recommendation letters, etc.

An illustration of the generic 4C-model of competency, applicable to all variety of

VBEs, is addressed in Fig. 7.

6.3 PCMS Subsystem Functionalities

Based on the objective and identified requirements, PCMS (Profile and Competency

Management System) supports the following four main functionalities.

• Model customization: This functionality aims at management of profile and

competency models within a specific VBE application. The idea for this functionality

is to support the customization of the VBE for a specific domain of activity or a specific

application environment. Prior to performing the profile and competency management

at the VBE creation stage, the profile and competency models need to be

specified and customized.


28 H. Afsarmanesh et al.

Fig. 7. Generic 4C-model of competency

• Data submission: This functionality supports uploading of profile and competency

knowledge from each member organization. The approach for incremental submission

of data is developed for the PCMS. This approach specially supports uploading of large

amounts of data. To support the dynamism and scalability of PCMS, the advanced

ODMS’s mechanism for ontology-based information discovery is applied.

• Data navigation: This functionality needs to be extensive in the PCMS. It supports

different ways for retrieval and viewing of the profile and competency knowledge

accumulated in the VBE. The navigation scope addresses both: single profile

information as well as the collective profile information of the entire VBE. Structuring

of the knowledge in the PCMS’s user interface mimics the VBE profile and competency

sub-ontology.

5. Data analysis: PCMS shall collect the competency data and analyze it in order

to evolve the VBE’s collection of competencies for addressing more opportunities in

the market and society. A number of analysis functions are specified for the PCMS

including: data validation, retrieval and search, gap analysis, and development of new

competencies.

7 Modeling and Management of Trust in VBEs

Traditionally, trust among organizations involved in collaborative networks was established

both “bi-laterally” and “subjectively” based on reputation. In large networks,

particularly with geographical dispersion, such as many VBEs however, trust

issues are sensitive at the network level, and need to be reasoned/justified for example

when applied to the selection of best-fit organizations among several competitors

[18]. Thus in VBEs, analysis of inter-organizational trust is a functionality supported


Information Supporting Functional Dimension of Collaborative Networks 29

by the VBE administration, which needs to apply fact-based data such as current

organizations’ standing and performance data, for its assessment. Thus, a variety of

strategic information related to trust aspects must be collected from VBE actors (applying

pull/push mechanisms), then modeled and classified, and stored a priori to

assessing the level of trust in VBE organizations.

Furthermore, in order to identify the common set of base trust related criteria for

organizations in the VBE, as briefly addressed in section 2.2.3, relevant elements for

each specific VBE must be determined. These trust criteria together with their respective

weights constitute the threshold for assessment of organization’s level of trust in

VBEs.

In the past manual ad-hoc manners were applied to the manipulation and processing

of organizations’ information related to trust. This section addresses the development

of the Trust Management (TrustMan) subsystem at the VBE and describes its

services supporting the rational assessment of level of trust in organizations.

7.1 Requirements for Managing Trust-Related Information

Objectives for establishing trust may change with time, which means the information

required to support the analysis of the trust level of organizations will also vary with

time. As addressed in [18] a main aim for management of trust in VBEs is to support

the creation of trust among VBE member organizations. The introduced approach to

support inter-organizational trust management applies the information related to organization’s

standing as well as its past performance data, in order to determine its

rational trust level in the VBE. Thus organizations’ activities within the VBE, and

their participation in configured VOs are relevant to be assessed.

Four main information management requirements are identified which need to be addressed

for supporting management of trust-related information in VBEs, as follows:

Requirement 1 – Characterization of wide variety of dynamic trust-related information:

The information required to support the establishment of trust among organizations

is dynamic, since depending on the specific objective(s) for which the trust must

be established, the needed information may change with time, and these changes

cannot be predicted. Therefore, characterization of relevant trust-related information

needed to support the creation of trust among organizations, for every trust objective,

is challenging.

Requirement 2 - Classification of information related to different trust perspectives:

As stated earlier, analysis of trust in organizations within VBEs shall rely on factbased

data and needs to be performed rationally. For this purpose some measurable

criteria need to be identified and classified. The identification and classification of a

comprehensive set of trust criteria for organizations is challenging, especially when

considering different perspectives of trust.

Requirement 3 - Processing of trust-related information to support trust measurement:

In the introduced approach, the trust in organizations is measured rationally

using fact-based data. For this purpose, formal mechanisms must be developed using

a set of relevant trust criteria. In addition to measuring trustworthiness of organizations,

the applied mechanisms should support fact-based reasoning about the results


30 H. Afsarmanesh et al.

based on the standing and performance of the organizations. The development of such

trust-related information processing mechanisms is challenging.

Requirement 4 - Provision of services for analysis and measurement of trust: Measurement

of trust in organizations involves the computation of fact-based data using

complex mechanisms that may need to be performed in distributed and heterogeneous

environments. Development of services to manage and process trust-related information

for facilitating the analysis of trust is challenging.

7.2 Approaches for Managing Trust-Related Information of Organizations

Below we propose some approaches to address the four requirements presented

above. We address the establishment of a pool of generic set of trust criteria, the identification

and modeling of trust elements, the formulation of mechanisms for analyzing

inter-organizational trust, and the designing of the TrustMan system.

7.2.1 Approach for Establishment of a Pool of Generic Concepts and Elements

Solutions such as specialized models, tools or mechanisms developed to support the

management of trust within “application specific VBEs” or within “domain specific

VBEs” are difficult to replicate, adapt and reuse in different environments. Therefore,

there is a need to develop a generic pool of concepts and elements that can be customized

for every specific VBE.

Generic set of trust criteria for organizations: In the introduced approach a large set

of trust criteria for VBE organizations is identified and characterized through applying

the HICI methodology and mechanisms (Hierarchical analysis, Impact analysis

and Causal Influence analysis) [46]. The identified trust elements for organizations

are classified in a generalization hierarchy as shown in Fig. 8. Trust objectives and

five identified trust perspectives are generic, cover all possible VBEs, and do not

change with time. A set of trust requirements and trust criteria can be identified at the

VBE dynamically and changes with time. Nevertheless, a base generic set of trust

requirements and trust criteria is so far established that can be expanded/customized

depending on the VBE. Fig. 8 presents an example set of trust criteria for economical

perspective.

7.2.2 Approach for Identification, Analysis and Modeling of Trust Elements

To properly organize and inter-relate all trust elements, an innovative approach was

required. The HICI approach proposed in [46] constitutes three stages, each one focusing

on a specific task related to the identification, classification and interrelation of

trust criteria related to organizations.

The first stage called the Hierarchical analysis stage focuses on the identification

of types of trust elements and classifying them through a generalization hierarchy

based on their level of measurability. This classification enables to understand what

values can be measured for the trust related elements which in turn supports the decision

on what attributes need to be included in the database schema. A general set of

trust criteria is presented in [18] and exemplified in Fig. 8.


Information Supporting Functional Dimension of Collaborative Networks 31

Creating trust among organizations

Trust perspective Trust requirements Trust criteria

Technological

perspective

Structural

perspective

Economical

perspective

Social

perspective

Managerial

perspective

Capital

Financial

stability

VO financial

stability

Financial

standards

Fig. 8. An example set of trust criteria for organizations

Cash capital

Physical capital

Material capital

Cash in

Cash out

Net gains

Operational cost

VO cash in

VO cash out

VO net gains

Auditing standards

Auditing Frequency

The second stage called the Impact analysis stage focuses on the analysis of the

impacts of changes in values of trust criteria on the trust level of organizations. This

enables to understand the nature and frequency of change of values of trust criteria in

order to support the decision regarding the frequency of updates for the trust-related

information.

The third stage called the Causal Influence analysis stage focuses on the analysis

of causal relations between different trust criteria as well as between the trust criteria

and other VBE environment factors, such as the known factors within the VBE and

intermediate factors, which are defined to link the causal relations among all trust

criteria and known factors. The results of causal influence analysis are applied to the

formulation of mechanisms for assessing the level of trust in each organization as

addressed below.

7.2.3 Approach for Intensive Modeling of Trust Elements’ Interrelationships-to

Formulate Mechanisms for Assessing Trust Level of Organizations

Considering the need for assessing the trust level of every organization in the VBE, a

wide range of trust criteria may be considered for evaluating organizations’ trustworthiness.

In the introduced approach, trust is characterized as a multi-objective, multiperspective

and multi-criteria subject. As such, trust is not a single concept that can be

applied to all cases for trust-based decision-making [47], and its measurement for each

case depends on the purpose of establishing a trust relationship, the preferences of the

VBE actor who constitutes the trustor in the case and the availability of trust related

information from the VBE actor who constitutes the trustee in the case [18]. In this


32 H. Afsarmanesh et al.

respect, the trust level of an organization can be measured rationally in terms of the

quantitative values available for related trust criteria e.g. rooted on past performance.

Therefore from analytic modeling, formal mechanisms can be deduced for rational

measurement of organizations’ trust level [18, 48]. These mechanisms are the formalized

into mathematical equations resulted from causal influence analysis and interrelationships

the between trust criteria, the known factors within the VBE, and the

intermediate factors that are defined to link those causal relations. A causal model, as

inspired from the discipline of systems engineering, supports the analysis of causal

influence inter-relationships among measurable factors (trust criteria, known factors

and intermediate factors) while it also supports modeling the nature of influences

qualitatively [49]. For example, as shown in Fig. 9 while the factors “cash capital”

and “capital” are measured quantitatively, the influence of the cash capital on the

capital is qualitatively modeled as positive.

Fig. 9. A causal model of trust criteria associated with the economical perspective

Furthermore, applying techniques from systems engineering the formulation of

mathematical equations applying causal models is thoroughly addressed in [50]. To

exemplify the formulation of equations based on results of analysis and modeling of

causal influences, below we present the equations for two intermediate factors of

capital (CA) and financial acceptance (FA) (see Fig. 9):

CA = CC + PC + MC and

SC

FA =

RS

Where CC represents cash capital, PC represents physical capita, MC represents material

capital, SC represents standards complied, and RS represents required standards.

7.2.4 Approach for Development TrustMan Subsystem – Focused on Database

Aspects

TrustMan system is developed to support a number of different users in the VBE with

dissimilar roles as well as rights which means different services and user interfaces

are required for each user. As a part of the system analysis, all potential users of the

TrustMan system were identified and classified into groups depending on their roles

and rights on the VBE. Then, for each user group a set of functional requirements


Information Supporting Functional Dimension of Collaborative Networks 33

were identified, to be supported by the TrustMan system. The classified user groups

of the TrustMan system include: the VBE administrator, the VO planner, the VBE

member, the VBE membership applicant, the trust expert and the VBE guest. The

identified user requirements and their specified services for the TrustMan system are

addressed in [51].

Moreover, to enhance interoperability with other sub-systems in the VBE, the

design of TrustMan system adopts the service-oriented architecture and specifically,

the web service standards. In particular, the design of TrustMan system adapts the

layering approach for classifying services. A well-designed architecture of TrustMan

system based on the concepts of service oriented architecture is addressed in [36].

Focusing here only on the information management aspects of the TrustMan system,

one important issue is related to the development of the schemas for the implementation

of its required database.

In order to enhance the interoperability and sharing of data that is managed by the

TrustMan system with both the existing/legacy databases at different organizations as

well as with other sub-systems of the VBE management system, the relational approach

is adopted for the TrustMan database. More specifically, three schemas are

developed to support the following: (1) general information related to trust elements,

(2) general information about organizations, and (3) Specific trust related data of

organizations. These are further defined below

1. General information related to trust elements - This information constitutes a

list and a set of descriptions of trust elements, namely of different trust perspectives,

trust requirements, and trust criteria.

2. General information about organizations - This refers to the information that

is necessary to accurately describe each physical or virtual organization. For

physical organizations, this information may constitute the name, legal registration

details, address, and so on. For virtual organizations, this information

may constitute, among others, the VO coordinator details, launching and dissolving

dates, involved partners, and the customers.

3. Specific trust related data for organizations - This information constitutes the

values of trust criteria for each organization. This information represents primarily

the organization’s performance data, expressed in terms of different

trust criteria, and is used as the main input data for the services that assess the

level of trust in each organization.

8 Conclusion

A main challenging criterion for the success of collaborative networks is the effective

management of the wide variety of information that needs to be handled inside

the CNs to support their functional dimension. The paper defends that for efficient

creation of dynamic opportunity-based collaborative networks, such as virtual organizations

and virtual teams, complete and up-to-date information on wide variety

of aspects are necessary. Research and practice have indicated that preestablishment

of supporting long-term strategic alliances, can provide the needed

environment for creation of cost and time effective VOs and VTs. While some

manifestations of such strategic alliances already exist, their 2^nd generation needs


34 H. Afsarmanesh et al.

a much stronger management system, providing functionalities on top of enabling

information management systems. This management system is shown to model,

organize, and store partly the information gathered from the CN actors, and partly

the information generated within the CN itself.

The paper first addressed the main challenges of the CNs, while addressing their

requirements for management of information. Furthermore, the paper focuses down

on the strategic alliances and specifically on the management of the VBEs, in order to

introduce the complexity of their needed functionality. Specific examples of information

management challenges have been then addressed through the specification of

three subsystems of the VBE management system, namely the subsystems handling

the engineering of VBE Ontology, the profile and competency management in VBEs,

and assessment and management of the rational trust in VBEs. As illustrated by these

examples, collaborative networks raise quite complex challenges, requiring modeling

and management of large amounts of heterogeneous and incomplete information,

which require a combination of approaches such as distributed/federated databases,

ontology engineering, computational intelligence and qualitative modeling.

References

1. Afsarmanesh, H., Camarinha-Matos, L.M.: The ARCON modeling framework. In: Collaborative

networks reference modeling, pp. 67–82. Springer, New York (2008)

2. Afsarmanesh, H., Camarinha-Matos, L.M.: Towards a semi-typology for virtual organization

breeding environments. In: COA 2007 – 8th IFAC Symposium on Cost-Oriented

Automation, Habana, Cuba, vol. 8, part 1, pp. 22(1–12) (2007)

3. Camarinha-Matos, L.M., Afsarmanesh, H.: A comprehensive modeling framework for collaborative

networked organizations. The Journal of Intelligent Manufacturing 18(5), 527–

615 (2007)

4. Katzy, B., Zang, C., Loh, H.: Reference models for virtual organizations. In: Virtual organizations

– Systems and practices, pp. 45–58. Springer, Heidelberg (2005)

5. Camarinha-Matos, L.M., Afsarmanesh, H.: Collaboration forms. In: Collaborative networks

reference modeling, pp. 51–66. Springer, New York (2008)

6. Himmelman, A.T.: On coalitions and the transformation of power relations: collaborative

betterment and collaborative empowerment. American journal of community psychology

29(2), 277–284 (2001)

7. Pollard, D.: Will that be coordination, cooperation or collaboration? Blog (March 25,

2005), http://blogs.salon.com/0002007/2005/03/25.html#a1090

8. Bamford, J., Ernst, D., Fubini, D.G.: Launching a World-Class Joint Venture. Harvard

Business Review 82(2), 90–100 (2004)

9. Blomqvist, K., Hurmelinna, P., Seppänen, R.: Playing the collaboration game rightbalancing

trust and contracting. Technovation 25(5), 497–504 (2005)

10. Camarinha-Matos, L.M., Afsarmanesh, H.: Collaborative networks: A new scientific discipline.

J. Intelligent Manufacturing 16(4-5), 439–452 (2005)

11. Afsarmanesh, H., Camarinha-Matos, L.M.: On the classification and management of virtual

organization breeding environments. The International Journal of Information Technology

and Management – IJITM 8(3), 234–259 (2009)

12. Giesen, G.: Creating collaboration: A process that works! Greg Giesen & Associates

(2002)


Information Supporting Functional Dimension of Collaborative Networks 35

13. Ermilova, E., Afsarmanesh, H.: A unified ontology for VO Breeding Environments. In:

Proceedings of DHMS 2008 - IEEE International Conference on Distributed Human-

Machine Systems, Athens, Greece, pp. 176–181. Czech Technical University Publishing

House (2008) ISBN: 978-80-01-04027-0

14. Rabelo, R.: Advanced collaborative business ICT infrastructure. In: Methods and Tools for

collaborative networked organizations, pp. 337–370. Springer, New York (2008)

15. Abreu, A., Macedo, P., Camarinha-Matos, L.M.: Towards a methodology to measure the

alignment of value systems in collaborative Networks. In: Azevedo, A. (ed.) Innovation in

Manufacturing Networks, pp. 37–46. Springer, New York (2008)

16. Romero, D., Galeano, N., Molina, A.: VO breeding Environments Value Systems, Business

Models and Governance Rules. In: Methods and Tools for collaborative networked

organizations, pp. 69–90. Springer, New York (2008)

17. Rosas, J., Camarinha-Matos, L.M.: Modeling collaboration preparedness assesment. In:

Collaborative networks reference modeling, pp. 227–252. Springer, New York (2008)

18. Msanjila, S.S., Afsarmanesh, H.: Trust Analysis and Assessment in Virtual Organizations

Breeding Environments. The International Journal of Production Research 46(5), 1253–

1295 (2008)

19. Romero, D., Galeano, N., Molina, A.: A conceptual model for Virtual Breeding Environments

Value Systems. In: Accepted for publication in Proceedings of PRO-VE 2007 - 8th

IFIP Working Conference on Virtual Enterprises. Springer, Heidelberg (2007)

20. Winkler, R.: Keywords and Definitions Around “Collaboration”. SAP Design Guild, 5th

edn. (2002)

21. Ermilova, E., Afsarmanesh, H.: Competency modeling targeted on promotion of organizations

towards VO involvement. In: The proceedings of PRO-VE 2008 – 9th IFIP Working

Conference on Virtual Enterprises, Poznan, Poland, pp. 3–14. Springer, Boston (2008)

22. Brna, P.: Models of collaboration. In: Proceedings of BCS 1998 - XVIII Congresso Nacional

da Sociedade Brasileira de Computação, Belo Horizonte, Brazil (1998)

23. Oliveira, A.I., Camarinha-Matos, L.M.: Agreement negotiation wizard. In: Methods and

Tools for collaborative networked organizations, pp. 191–218. Springer, New York

(2008)

24. Wolff, T.: Collaborative Solutions – True Collaboration as the Most Productive Form of

Exchange. In: Collaborative Solutions Newsletter. Tom Wolff & Associates (2005)

25. Kangas, S.: Spectrum Five: Competition vs. Cooperation. The long FAQ on Liberalism

(2005),

http://www.huppi.com/kangaroo/

LiberalFAQ.htm#Backspectrumfive

26. Afsarmanesh, H., Camarinha-Matos, L.M., Ermilova, E.: VBE reference framework. In:

Methods and Tools for collaborative networked organizations, pp. 35–68. Springer, New

York (2008)

27. Afsarmanesh, H., Msanjila, S.S., Ermilova, E., Wiesner, S., Woelfel, W., Seifert, M.: VBE

management system. In: Methods and Tools for collaborative networked organizations, pp.

119–154. Springer, New York (2008)

28. Afsarmanesh, H., Camarinha-Matos, L.M.: Related work on reference modeling for collaborative

networks. In: Collaborative networks reference modeling, pp. 15–28. Springer,

New York (2008)

29. Tolle, M., Bernus, P., Vesterager, J.: Reference models for virtual enterprises. In: Camarinha-Matos,

L.M. (ed.) Collaborative business ecosystems and virtual enterprises, Kluwer

Academic Publishers, Boston (2002)


36 H. Afsarmanesh et al.

30. Zachman, J.A.: A Framework for Information Systems Architecture. IBM Systems Journal

26(3) (1987)

31. Camarinha-Matos, L.M., Afsarmanesh, H.: Emerging behavior in complex collaborative

networks. In: Collaborative Networked Organizations - A research agenda for emerging

business models, ch. 6.2. Kluwer Academic Publishers, Dordrecht (2004)

32. Shuman, J., Twombly, J.: Collaborative Network Management: An Emerging Role for Alliance

Management. In: White Paper Series - Collaborative Business, vol. 6. The Rhythm

of Business, Inc. (2008)

33. Afsarmanesh, H., Camarinha-Matos, L.M., Msanjila, S.S.: On Management of 2nd Generation

Virtual Organizations Breeding Environments. The Journal of Annual Reviews in

Control (in press, 2009)

34. Camarinha-Matos, L.M., Afsarmanesh, H.: A framework for Virtual Organization creation

in a breeding environment. Int. Journal Annual Reviews in Control 31, 119–135

(2007)

35. Nieto, M.A.M.: An Overview of Ontologies, Technical report, Conacyt Projects No.

35804−A and G33009−A (2003)

36. Ollus, M.: Towards structuring the research on virtual organizations. In: Virtual Organizations:

Systems and Practices. Springer Science, Berlin (2005)

37. Guevara-Masis, V., Afsarmanesh, H., Hetzberger, L.O.: Ontology-based automatic data

structure generation for collaborative networks. In: Proceedings of 5th PRO-VE 2004 –

Virtual Enterprises and Collaborative Networks, pp. 163–174. Kluwer Academic Publishers,

Dordrecht (2004)

38. Anjewierden, A., Wielinga, B.J., Hoog, R., Kabel, S.: Task and domain ontologies for

knowledge mapping in operational processes. Metis deliverable 2003/4.2. University of

Amsterdam (2003)

39. Afsarmanesh, H., Ermilova, E.: Management of Ontology in VO Breeding Environments

Domain. To appear in International Journal of Services and Operations Management –

IJSOM, special issue on Modelling and Management of Knowledge in Collaborative Networks

(2009)

40. Uschold, M., King, M., Moralee, S., Zorgios, Y.: The Enterprise Ontology. The Knowledge

Engineering Review 13(1), 31–89 (1998)

41. Ding, Y., Fensel, D.: Ontology Library Systems: The key to successful Ontology Re-use.

In: Proceedings of the First Semantic Web Working Symposium (2001)

42. Simoes, D., Ferreira, H., Soares, A.L.: Ontology Engineering in Virtual Breeding Environments.

In: Proceedings of PRO-VE 2007 conference, pp. 137–146 (2007)

43. Ermilova, E., Afsarmanesh, H.: Modeling and management of Profiles and Competencies

in VBEs. J. of Intelligent Manufacturing (2007)

44. Javidan, M.: Core Competence: What does it mean in practice? Long Range planning

31(1), 60–71 (1998)

45. Molina, A., Flores, M.: A Virtual Enterprise in Mexico: From Concepts to Practice. Journal

of Intelligent and Robotics Systems 26, 289–302 (1999)

46. Msanjila, S.S., Afsarmanesh, H.: On Architectural Design of TrustMan System Applying

HICI Analysis Results. The case of technological perspective in VBEs. The International

Journal of Software 3(4), 17–30 (2008)

47. Castelfranchi, C., Falcone, R.: Trust Is Much More than Subjective Probability: Mental

Components and Sources of Trust. In: Proceedings of the 33rd Hawaii International Conference

on System Sciences (2000)


Information Supporting Functional Dimension of Collaborative Networks 37

48. Msanjila, S.S., Afsarmanesh, H.: Modeling Trust Relationships in Collaborative Networked

Organizations. The International Journal of Technology Transfer and Commercialisation;

Special issue: Data protection, Trust and Technology 6(1), 40–55 (2007)

49. Pearl, J.: Graphs, causality, and structural equation models. The Journal of Sociological

Methods and Research 27(2), 226–264 (1998)

50. Byne, B.M.: Structural equation modeling with EQS: Basic concepts, Applications, and

Programming, 2nd edn. Routlege/Academic (2006)

51. Msanjila, S.S., Afsarmanesh, H.: On development of TrustMan system assisting configuration

of temporary consortiums. The International Journal of Production Research; Special

issue: Virtual Enterprises – Methods and Approaches for Coalition Formation 47(17)

(2009)


A Universal Metamodel and Its Dictionary

Paolo Atzeni 1 , Giorgio Gianforme 2 , and Paolo Cappellari 3

1 Università RomaTre,Italy

atzeni@dia.uniroma3.it

2 Università RomaTre,Italy

giorgio.gianforme@gmail.com

3 University of Alberta, Canada

paolo.cappellari@gmail.com

Abstract. We discuss a universal metamodel aimed at the representation

of schemas in a way that is at the same time model-independent (in

the sense that it allows for a uniform representation of different data models)

and model-aware (in the sense that it is possible to say to whether

a schema is allowed for a data model). This metamodel can be the basis

for the definition of a complete model-management system. Here we illustrate

the details of the metamodel and the structure of a dictionary for

its representation. Exemplifications of a concrete use of the dictionary are

provided, by means of the representations of the main data models, such

as relational, object-relational or XSD-based. Moreover, we demonstrate

how set operators can be redefined with respect to our dictionary and easily

applied on it. Finally, we show how such a dictionary can be exploited

to automatically produce detailed descriptions of schema and data models,

in a textual (i.e. XML) or visual (i.e. UML class diagram) way.

1 Introduction

Metadata is descriptive information about data and applications. Metadata is

used to specify how data is represented, stored, and transformed, or may describe

interfaces and behavior of software components.

The use of metadata for data processing was reported as early as fifty years

ago [22]. Since then, metadata-related tasks and applications have become truly

pervasive and metadata management plays a major role in today’s information

systems. In fact, the majority of information system problems involve the design,

integration, and maintenance of complex application artifacts, such as application

programs, databases, web sites, workflow scripts, object diagrams, and

user interfaces. These artifacts are represented by means of formal descriptions,

called schemas or models, and, consequently, metadata. Indeed, to solve these

problems we have to deal with metadata, but it is well known that applications

solving metadata manipulation are complex and hard to build, because of heterogeneity

and impedance mismatch. Heterogeneity arises because data sources

are independently developed by different people and for different purposes and

subsequently need to be integrated. The data sources may use different data

models, different schemas, and different value encodings. Impedance mismatch

A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 38–62, 2009.

c○ Springer-Verlag Berlin Heidelberg 2009


A Universal Metamodel and Its Dictionary 39

arises because the logical schemas required by applications are different from

the physical ones exposed by data sources. The manipulation includes designing

mappings (which describe how two schemas are related to each other) between

the schemas, generating a schema from another schema along with a mapping

between them, modifying a schema or mapping, interpreting a mapping, and

generating code from a mapping.

In the past, these difficulties have always been tackled in practical settings by

means of ad-hoc solutions, for example by writing a program for each specific application.

This is clearly very expensive, as it is laborious and hard to maintain.

In order to simplify such manipulation, Bernstein et al. [11,12,23] proposed the

idea of a model management system. Its goal is to factor out the similarities of

the metadata problems studied in the literature and develop a set of high-level

operators that can be utilized in various scenarios. Within such a system, we

can treat schemas and mappings as abstractions that can be manipulated by

operators that are meant to be generic in the sense that a single implementation

of them is applicable to all of the data models. Incidentally, let us remark

that in this paper we use the terms “schema” and “data model” as common

in the database literature, though some model-management literature follows a

different terminology (and uses “model” instead of “schema” and “metamodel”

instead of “data model”).

The availability of a uniform and generic description of data models is a

prerequisite for designing a model management system. In this paper we discuss

a “universal metamodel” (called the supermodel ), defined by means of metadata

and designed to properly represent “any” possible data model, together with the

structure of a dictionary for storing such metadata.

There are many proposals for dictionary structure in the literature. The use of

dictionaries to handle metadata has been popular since the early database systems

of the 1970’s, initially in systems that were external to those handling the database

(see Allen et al. [1] for an early survey). With the advent of relational systems in

the 1980’s, it became possible to have dictionaries be part of the database itself,

within the same model. Today, all DBMSs have such a component. Extensive discussion

was also carried out in even more general frameworks, with proposals for

various kinds of dictionaries, describing various features of systems (see for example

[9,19,21]) within the context of industrial CASE tools and research proposals.

More recently, a number of metadata repositories have been developed [26]. They

generally use relational databases for handling the information of interest. There

are other significant recent efforts towards the description of multiple models, including

the Model Driven Architecture (MDA) and, within it, the Common Warehouse

Metamodel (CWM) [27], and Microsoft Repository [10]; in contrast to our

approach, these do not distinguish metalevels, as the various models of interest

are all specializations of a most general one, UML based.

The description of models in terms of the (meta-)constructs of a metamodel

was proposed by Atzeni and Torlone [8]. But it used a sophisticated graph

language, which was hard to implement. The other papers that followed the

same or similar approaches [14,15,16,28] also used specific structures.


40 P. Atzeni, G. Gianforme, and P. Cappellari

We know of no literature that describes a dictionary that exposes schemas in

both model-specific and model-independent ways, together with a description of

models. Only portions of similar dictionaries have been proposed. None of them

offer the rich interrelated structure we have here.

The contributions of this paper and its organization are the following. In

Section 2 we briefly recall the metamodel approach we follow (based on the initial

idea by Atzeni and Torlone [8]). In Section 3 we illustrate the organization of

the dictionary we use to store our schemas and models, refining the presentation

given in a previous conference paper (Atzeni et al. [3]). In Section 4 we illustrate

a specific supermodel used to generalize a large set of models, some of which

are also commented upon. Then, in Section 5 we discuss how some interesting

operations on schemas can be specified and implemented on the basis of our

approach. Section 6 is devoted to the illustration of generic reporting and visualization

tools built out of the principles and structure of our dictionary. Finally,

in Section 7 we summarize our results.

2 Towards a Universal Metamodel

In this section we summarize the overall approach towards a model-independent

and model-aware representation of data models, based on an initial idea by

Atzeni and Torlone [8].

The first step toward a uniform solution is the adoption of a general model to

properly represent many different data models (e.g. entity-relationship, objectoriented,

relational, object-relational, XML). The proposed general model is

based on the idea of construct: a construct represents a “structural” concept

of a data model. We find out a construct for each “structural” concept of every

considered data model and, hence, a data model is completely represented by the

set of its constructs. Let us consider two popular data models, entity-relationship

(ER) and object-oriented (OO). Indeed, each of them is not “a model,” but “a

family of models,” as there are many different proposals for each of them: OO

with or without keys, binary and n-ary ER models, OO and ER with or without

inheritance, and so on. “Structural” concepts for these data models are, for example,

entity, attribute of entity, and binary relationship for the ER and class,

field, and reference for the OO. Moreover, constructs have a name, may have

properties and are related to one another.

A UML class diagram of this construct-based representation of a simple ER

model with entities, attributes of entities and binary relationships is depicted in

Figure 1. Construct Entity has no attribute and no reference; construct AttributeOfEntity

has a boolean property to specify whether an attribute is part of the

identifier of the entity it belongs to and a property type to specify the data type

of the attribute itself; construct BinaryRelationship has two references toward

the entities involved in the relationship and several properties to specify role,

minimum and maximum cardinalities of the involved entities, and whether the

first entity is externally identified by the relationship itself.


A Universal Metamodel and Its Dictionary 41

Fig. 1. A simple entity-relationship model

Fig. 2. A simple object-oriented model

With similar considerations about a simple OO model with classes, simple

fields (i.e. with standard type) and reference fields (i.e. a reference from a class

to another) we obtain the UML class diagram of Figure 2. Construct Class has

no attribute and no reference; construct Field is similar to AttributeOfEntity but

it does not have boolean attributes, assuming that we do not want to manage

explicit identifiers of objects; construct ReferenceField has two references toward

the class owner of the reference and the class pointed by the reference itself.

In this way, we have uniform representations of models (in terms of constructs)

but these representations are not general. This is unfeasible as the number of

(variants of) models grows because it implies a corresponding rise in the number

of constructs. To overcome this limit, we exploit an observation of Hull and

King [20], drawn on later by Atzeni and Torlone [7]: most known models have

constructs that can be classified according to a rather small set of generic (i.e.

model independent) metaconstructs: lexical, abstract, aggregation, generalization,

and function. Recalling our example, entities and classes play the same

role (or, in other terms, “they have the same meaning”), and so we can define

a generic metaconstruct, called Abstract, to represent both these concepts; the

same happens for attributes of entities and of relationships and fields of classes,

representable by means of a metaconstructs called Lexical. Conversely, relationships

and references do not have the same meaning and hence one metaconstruct

is not enough to properly represent both concepts (hence BinaryAggregationOfAbstracts

and AbstractAttribute are both included).

Hence, each model is defined by its constructs and the metaconstructs they

refer to. This representation is clearly at the same time model-independent (in


42 P. Atzeni, G. Gianforme, and P. Cappellari

the sense that it allows for a uniform representation of different data models)

and model-aware (in the sense that it is possible to say to whether a schema is

allowed for a data model). An even more important notion is that of supermodel

(also called universal metamodel in the literature [13,24]): it is a model that has

a construct for each metaconstruct, in the most general version. Therefore, each

model can be seen as a specialization of the supermodel, except for renaming of

constructs.

A conceptual view of the essentials of this idea is shown in Figure 3: the supermodel

portion is predefined, but can be extended (and we will present our recent

extension later in this paper), whereas models are defined by specifying their

respective constructs, each of which refers to a construct of the supermodel (SM-

Construct) and so to a metaconstruct. It is important to observe that our approach

is independent of the specific supermodel that is adopted, as new metaconstructs

and so SM-Constructs can be added. This allows us to show simplified examples

for the set of constructs, without losing the generality of the approach.

In this scenario, a schema for a certain model is a set of instances of constructs

allowed in that model. Let us consider the simple ER schema depicted in

Figure 4. Its construct-based representation would include two instances of Entity

(i.e. Employee and Project), one instance of BinaryRelationship (i.e. Membership)

and four instances of AttributeOfEntity (i.e. EN, Name, Code, and

Name). The model-independent representation (i.e. based on metaconstructs)

would include two instances of Abstract, one instance of BinaryAggregationOfAbstracts

and four instances of Lexical. For each of these instances we have

to specify values for its attributes and references, meaningful for the model. So

for example, the instance of Lexical corresponding to EN would refer to the

instance of Abstract of employee through its abstractOID reference and would

have a ‘true’ value for its isIdentifier attribute. This example is illustrated in

Figure 5, where we omit not relevant properties, represent references only by

means of arrows, and represent links between constructs and their instances by

means of dashed arrows. In the same way, we can state that a database for a

certain schema is a set of instances of constructs of that schema.

Fig. 3. A simplified conceptual view of models and constructs

Fig. 4. A simple entity-relationship schema


A Universal Metamodel and Its Dictionary 43

Fig. 5. A construct based representation of the schema of Figure 4

As a second example, let us consider the simple OO schema depicted in Figure 6.

Its construct-based representation would include two instances of Class (i.e. Employee

and Department), one instance of ReferenceField (i.e. Membership) and

five instances of Field (i.e. EmpNo, Name, Salary, DeptNo, and SeptName). Alternatively,

the model independent representation (i.e. based on metaconstructs)

would include two instances of Abstract, one instance of AbstractAttribute and five

instances of Lexical.

On the other side, it is possible to use the same approach based on “concepts of

interest” in order to obtain a high-level description of the supermodel (i.e. of the

whole set of metaconstructs). From this point of view the concepts of interest are

three: construct, construct property and construct reference. In this way we have

a full description of the supermodel, with constructs, properties and references,

as follows. Each construct has a name and a boolean attribute (isLexical) that


44 P. Atzeni, G. Gianforme, and P. Cappellari

Fig. 6. A simple object-oriented schema

Fig. 7. A description of the supermodel

specifies whether its instances have actual, elementary values associated with

(for example, this property would be true for AttributeOfAbstract and false for

Abstract). Each property belongs to a construct and has a name and a type.

Each reference relates two constructs and has a name. A UML class diagram of

this representation is presented in Figure 7.

3 A Multilevel Dictionary

The conceptual approach to the description of models and schemas presented in

Section 2, despite being very useful to introduce the approach, is not effective

to actually store data and metadata. Therefore, we have developed a relational

implementation of the idea, leading to a multilevel dictionary organized in four

parts, which can be characterized along two coordinates: the first corresponding

to whether they describe models or schemas and the second depending on

whether they refer to specific models or to the supermodel. This is represented

in Figure 8.

model descriptions

(the “metalevel”)

schema descriptions


metamodels

(mM)

models

(M)

meta-supermodel

(mSM)

supermodel

(SM)

model specific model generic

Fig. 8. The four parts of the dictionary


A Universal Metamodel and Its Dictionary 45

The various portions of the dictionary correspond to various UML class diagrams

of Section 2. In the rest of this section, we comment on them in detail.

The meta-supermodel part of the dictionary describes the supermodel, that is,

the set of constructs used for building schemas of various models. It is composed

of three relations (whose names begin with MSM to recall that we are in the

meta-supermodel portion), one for each “class” of the diagram of Figure 7. Every

relation has one OID column and one column for each attribute and reference

of the corresponding “class” of such a diagram. The relations of this part of the

dictionary, with some of the data, are depicted in Figure 9. It is worth noting

that these relations are rather small, because of the limited number of constructs

in our supermodel.

The metamodels part of the dictionary describes the individual models, that

is, the set of specific constructs allowed in the various models, each one corresponding

to a construct of the supermodel. It has the same structure as the

meta-supermodel part with two differences: first, each relation has an extra column

containing a reference towards the corresponding element of the supermodel

(i.e. of the meta-supermodel part of the dictionary); second, there is an extra

relation to store the names of the specific models and an extra column in the

Construct relation referring to this extra relation. The relations of this part of

the dictionary, with some of the data, are depicted in Figure 10.

We refer to these first two parts as the “metalevel” of the dictionary, as it

contains the description of the structure of the lower level, whose content describes

schemas. The lower level is also composed of two parts, one referring

to the supermodel constructs (therefore called the SM part) and the other to

model-specific constructs (the M part). The structure of the schema level is, in

MSM Property

OID Name Construct Type

MSM Construct

OID Name IsLexical

mc1 Abstract false

mc2 Lexical true

mc3 Aggregation false

mc4 BinaryAggregationOfAbstracts false

mc5 AbstractAttribute false

... ... ...

mp1 Name mc1 string

mp2 Name mc2 string

mp3 IsIdentifier mc2 bool

mp4 IsOptional mc2 bool

mp5 Type mc2 string

... ... ... ...

MSM Reference

OID Name Construct ConstructTo

mr1 Abstract mc2 mc1

mr2 Aggregation mc2 mc3

mr3 Abstract1 mc4 mc1

mr4 Abstract2 mc4 mc1

mr5 Abstract mc5 mc1

mr6 AbstractTo mc5 mc1

... ... ... ...

Fig. 9. The mSM part of the dictionary


46 P. Atzeni, G. Gianforme, and P. Cappellari

MM Model

OID Name

m1 ER

m2 OODB

MM Property

OID Name Constr. Type MSM-Prop.

pr1 Name co1 string mp1

pr2 Name co2 string mp2

pr3 IsKey co2 bool mp3

pr4 Name co3 string ...

pr5 IsOpt.1 co3 bool ...

... ... ... ... ...

pr6 Name co4 string mp1

pr7 Name co5 string mp2

... ... ... ... ...

MM Construct

OID Name Model MSM-Constr. IsLexical

co1 Entity m1 mc1 false

co2 AttributeOfEntity m1 mc2 true

co3 BinaryRelationship m1 mc4 false

co4 Class m2 mc1 false

co5 Field m2 mc2 true

co6 ReferenceField m2 mc5 false

MM Reference

OID Name Constr. Constr.To MSM-Ref.

ref1 Entity co2 co1 mr1

ref2 Entity co3 co1 mr3

ref3 Entity co3 co1 mr4

ref4 Class co5 co4 mr1

ref5 Class co6 co4 mr5

ref6 ClassTo co6 co4 mr6

Fig. 10. The mM part of the dictionary

our system, automatically generated out of the content of the metalevel: so, we

can say that the dictionary is self-generating out of a small core. In detail, in the

model part there is one relation for each row of MM Construct relation. Hence

each of these relations corresponds to a construct and has, besides an OID column,

one column for each property and reference specified for that construct in

relations MM Property and MM Reference, respectively. Moreover, there

is a relation schema to store the name of the schemas stored in the dictionary

and each relation has an extra column referring to it. Hence, in practice, there

is a set of relations for each specific model, with one relation for each construct

allowed in the model. This portion of the dictionary is depicted in Figure 11,

where we show the data for the schemas of Figures 4 and 6.

Analogously, in the supermodel part there is one relation for each row of

MSM Construct relation; hence each one of these relations corresponds to a

metaconstruct (or a construct of the supermodel) and has, besides an OID column,

one column for each property and reference specified for that metaconstruct in relations

MSM Property and MSM Reference, respectively. Again, there is a

relation schema to store the name of the schemas stored in the dictionary and each

relation has an extra column referring to it. Moreover, the Schema relation has an

extra column referring to the specific model each schema belongs to. This portion

of the dictionary is depicted in Figure 12, where we show the data for the schemas

of Figures 4 and 6, and hence we show the same data presented in Figure 11. It

is worth noting that Abstract contains the same data as ER-Entity and OO-

Class taken together. Similarly, AttributeOfAbstract contains data in ER-

AttributeOfEntity and OO-Field.


Schema

OID Name

s1 ER Schema

s2 OO Schema

ER-Entity

OID Name Schema

e1 Employee s1

e2 Project s1

A Universal Metamodel and Its Dictionary 47

ER-BinaryRelationship

ER-AttributeOfEntity

OID Entity Name Type isKey Schema

a1 e1 EN int true s1

a2 e1 Name string false s1

a3 e2 Code int true s1

a4 e2 Name string false s1

OID Name IsOptional1 IsFunctional1 ... Entity1 Entity2 Schema

r1 Membership false false ... e1 e2 s1

OO-Class

OID Name Schema

cl1 Employee s2

cl2 Department s2

OO-ReferenceField

OID Name Class ClassTo Schema

ref1 Membership cl1 cl2 s2

OO-Field

OID Class Name Type Schema

f1 cl1 EmpNo int s2

f2 cl1 Name string s2

f3 cl1 Salary int s2

f4 cl2 DeptNo int s2

f5 cl2 DeptName string s2

Fig. 11. The dictionary for schemas of specific models

Schema

OID Name Model

s1 ER Schema m1

s2 OO Schema m2

Abstract

OID Name Schema

e1 Employee s1

e2 Project s1

cl1 Employee s2

cl2 Department s2

Lexical

OID Abstract Name Type IsIdentifier Schema

a1 e1 EN int true s1

a2 e1 Name string false s1

a3 e2 Code int true s1

a4 e2 Name string false s1

f1 cl1 EmpNo int ? s2

f2 cl1 Name string ? s2

f3 cl1 Salary int ? s2

f4 cl2 DeptNo int ? s2

f5 cl2 DeptName string ? s2

AbstractAttribute

OID Name Abstract AbstractTo Schema

ref1 Membership cl1 cl2 s2

BinaryRelationship

OID Name IsOptional1 IsFunctional1 ... Entity1 Entity2 Schema

r1 Membership false false ... e1 e2 s1

Fig. 12. A portion of the SM part of the dictionary


48 P. Atzeni, G. Gianforme, and P. Cappellari

4 A Significant Supermodel with Models of Interest

As we said, our approach is fully extensible: it is possible to add new metaconstructs

to represent new data models, as well as to refine and increase precision of

actual representations of models. The supermodel we have mainly experimented

with so far is a supermodel for database models and covers a reasonable family

of them. If models were more detailed (as is the case for a fully-fledged XSD

model) then the supermodel would be more complex. Moreover, other supermodels

can be used in different contexts: we have had preliminary experiences

with Semantic Web models [5,6,18], with the management of annotations [25],

and with adaptive systems [17]. In this section we discuss in detail our actual

supermodel. We describe all the metaconstructs of the supermodel, describing

which concepts they represent, and how they can be used to properly represent

several well known data models.

A complete description of all the metaconstructs follows:

Abstract - Any autonomous concept of the scenario.

Aggregation - A collection of elements with heterogeneous components. It

make no sense without its components.

StructOfAttributes - A structured element of an Aggregation, anAbstract,

or another StructOfAttributes. It could be not always present (isOptional)

and/or admit null values (isNullable). It could be multivalued or not (isSet).

AbstractAttribute - A reference towards an Abstract that could admit null

values (isNullable). The reference may originate from an Abstract, anAggregation,

oraStructOfAttributes.

Generalization - It is a “structural” construct stating that an Abstract is a

root of a hierarchy, possibly total (isTotal).

ChildOfGeneralization - Another “structural” construct, related to the previous

one (it can not be used without Generalization). It is used to specify

that an Abstract is leaf of a hierarchy.

Nest - It is a “structural” construct used to specify nesting relationship between

StructOfAttributes.

BinaryAggregationOfAbstracts - Any binary correspondence between (two)

Abstracts. It is possible to specify optionality (isOptional1/2 ) and functionality

(isFunctional1/2 ) of the involved Abstracts as well as their role

(role1/2 ) or whether one of the Abstracts is identified in some way by such

a correspondence (isIdentified).

AggregationOfAbstracts - Any n-ary correspondence between two or more

Abstracts.

ComponentOfAggregationOfAbstracts - It states that an Abstract is one

of those involved in an AggregationOfAbstracts (and hence can not be used

without AggregationOfAbstracts). It is possible to specify optionality (isOptional1/2

) and functionality (isFunctional1/2 ) of the involved Abstract as

well as whether the Abstract is identified in some way by such a correspondence

(isIdentified).


A Universal Metamodel and Its Dictionary 49

Lexical - Any lexical value useful to specify features of Abstract, Aggregation,

StructOfAttributes, AggregationOfAbstracts, or BinaryAggregationOfAbstracts.

It is a typed attribute (type) that could admit null values, be optional,

and identifier of the object it refers to (the latter is not applicable to Lexical

of StructOfAttributes, BinaryAggregationOfAbstracts, andAggregationOfAbstracts).

ForeignKey - It is a “structural” construct stating the existence of some kind

of referential integrity constraints between Abstract, Aggregation and/or

StructOfAttributes, in every possible combination.

ComponentOfForeignKey - Another “structural” construct, related to the

previous one (it can not be used without ForeignKey). It is used to specify

which are the Lexical attributes involved (i.e. referring and referred) in a

referential integrity constraint.

A UML class diagram of these (meta)constructs is presented in Figure 13.

Fig. 13. The Supermodel

We summarize constructs and (families of) models in Figure 14, where we

show a matrix, whose rows correspond to the constructs and columns to the

families we have experimented with.

In the cells, we use the specific name used for the construct in the family (for

example, Abstract is called Entity in the ER model). The various models within


50 P. Atzeni, G. Gianforme, and P. Cappellari

Fig. 14. Constructs and models

a family differ from one another (i) on the basis of the presence or absence of

specific constructs and (ii) on the basis of details of (constraints on) them. To

give an example for (i) let us recall that versions of the ER model could have

generalizations, or not have them, and the OR model could have structured

columns or just simple ones. For (ii) we can just mention again the various

restrictions on relationships in the binary ER model (general vs. one-to-many),

which can be specified by means of constraints on the properties. It is also

worth mentioning that a given construct can be used in different ways (again,

on the basis of conditions on the properties) in different families: for example, a

structured attribute could be multivalued, or not, on the basis of the value of a

property isSet.

The remainder of this section is devoted to a detailed description of the various

models.

4.1 Relational

We consider a relational model with tables composed of columns of a specified

type; each column could allow null value or be part of the primary key of the

table. Moreover we can specify foreign keys between tables involving one or more

columns. Figure 15 shows a UML class diagram of the constructs allowed in the

relational model with the following correspondences:

Table - Aggregation.

Column - Lexical. We can specify the data type of the column (type) and

whether it is part of the primary key (isIdentifier) or it allows null value

(isNullable). It has a reference toward an Aggregation.


A Universal Metamodel and Its Dictionary 51

Fig. 15. The Relational model

Foreign Key - ForeignKey and ComponentOfForeignKey. With the first construct

(referencing two Aggregations) we specify the existence of a foreign key

between two tables; with the second construct (referencing one ForeignKey

and two Lexical s) we specify the columns involved in a foreign key.

4.2 Binary ER

We consider a binary ER model with entities and relationships together with

their attributes and generalizations (total or not). Each attribute could be optional

or part of the identifier of an entity. For each relationship we specify

minimum and maximum cardinality and whether an entity is externally identified

by it. Figure 16 shows a UML class diagram of the constructs allowed in the

model with the following correspondences:

Entity - Abstract.

Attribute of Entity - Lexical. We can specify the data type of the attribute

(type) and whether it is part of the identifier (isIdentifier) or it is optional

(isOptional). It refers to an Abstract.

Relationship - BinaryAggregationOfAbstracts. We can specify minimum (0 or

1 with the property isOptional) and maximum (1 or N with the property is-

Functional) cardinality of the involved entities (referenced by the construct).

Moreover we can specify the role (role) of the involved entities and whether

the first entity is externally identified by the relationship (IsIdentified).

Attribute of Relationship - Lexical. We can specify the data type of the attribute

(type) and whether it is optional (isOptional). It refers to a BinaryAggregationOfAbstracts

Generalization - Generalization and ChildOfGeneralization. With the first construct

(referencing an Abstract) we specify the existence of a generalization


52 P. Atzeni, G. Gianforme, and P. Cappellari

Fig. 16. ThebinaryERmodel

rooted in the referenced Entity; with the second construct (referencing one

Generalization and one Abstract) we specify the childs of the generalization.

We can specify whether the generalization is total or not (isTotal).

4.3 N-Ary ER

We consider an n-ary ER model with the same features of the aforementioned

binary ER. Figure 17 shows a UML class diagram of the constructs allowed in the

model with the following correspondences (we omit details already explained):

Entity - Abstract.

Attribute of Entity - Lexical.

Relationship - AggregationOfAbstracts and ComponentOfAggregationOfAbstracts.

With the first construct we specify the existence of a relationship;

with the second construct (referencing an AggregationOfAbstracts and an

Abstract) we specify the entities involved in such relationship. We can specify

minimum (0 or 1 with the property isOptional) andmaximum(1orNwith

the property isFunctional) cardinality of the involved entities. Moreover we

can specify whether an entity is externally identified by the relationship

(IsIdentified).


A Universal Metamodel and Its Dictionary 53

Fig. 17. The n-ary ER model

Attribute of Relationship - Lexical. It refers to an AggregationOfAbstracts.

Generalization - Generalization and ChildOfGeneralization.

4.4 Object-Oriented

We consider an Object-Oriented model with classes, simple and reference fields.

We can also specify generalizations of classes. Figure 18 shows a UML class diagram

of the constructs allowed in the model with the following correspondences

(we omit details already explained):

Class - Abstract.

Field - Lexical.

Reference Field - AbstractAttribute. It has two references toward the referencing

Abstract and the referenced one.

Generalization - Generalization and ChildOfGeneralization.

4.5 Object-Relational

We consider a simplified version of the Object-Relational model. We merge the

constructs of our Relational and OO model, where we have typed-tables rather


54 P. Atzeni, G. Gianforme, and P. Cappellari

Fig. 18. The OO model

than classes. Moreover we consider structured columns of tables (typed or not)

that can be nested. Reference columns must be toward a typed table but can be

part of a table (typed or not) or of a structured column. Foreign keys can involve

also typed tables and structured columns. Finally, we can specify generalizations

that can involve only typed tables. Figure 19 shows a UML class diagram of the

constructs allowed in the model with the following correspondences (we omit

details already explained):

Table - Aggregation.

Typed Table - Abstract.

Structured Column - StructOfAttributes and Nest. The structured column,

represented by a StructOfAttributes can allow null values or not (isNullable)

and can be part of a simple table or of a typed table (this is specified by its

references toward Abstract and Aggregation. We can specify nesting relationships

between structured columns by means of Nest, that has two references

toward the top StructOfAttributes and the nested one.

Column - Lexical. It can be part of (i.e. refer to) a simple table, a typed table

or a structured column.

Reference Column - AbstractAttribute. It may be part of a table (typed or

not) and of a structured column (specified by a reference) and must refer to

a typed table (i.e. it has a reference toward an Abstract).


A Universal Metamodel and Its Dictionary 55

Fig. 19. The Object-Relational model

Foreign Key - ForeignKey and ComponentOfForeignKey. With the first construct

(referencing two tables, typed or not, and a structured column) we

specify the existence of a foreign key between tables (typed or not) and

structured column; with the second construct (referencing one ForeignKey

and two Lexical s) we specify the columns involved in a foreign key.

Generalization - Generalization and ChildOfGeneralization.

4.6 XSD as a Data Model

XSD is a very powerful technique for organizing documents and data, described

by a very long specification. We consider a simplified version of the XSD language.

We are only interested in documents that can be used to store large

amount of data. Indeed we consider documents with at least one top element

unbounded. Then we deal with elements that can be simple or complex (i.e.

structured). For these elements we can specify whether they are optional or

whether they can be null (nillable according to the syntax and terminology of

XSD). Simple elements could be part of the key of the element they belong to

and have an associated type. Moreover we allow the definition of foreign keys

(key and keyref according to XSD terminology). Clearly, this representation is

highly simplified but, as we said, it could be extended with other features if there

were interest in them.


56 P. Atzeni, G. Gianforme, and P. Cappellari

Figure 20 shows a UML class diagram of the constructs allowed in the model

with the following correspondences (we omit details already explained):

Fig. 20. The XSD language

Root Element - Abstract.

Complex Element - StructOfAttributes and Nest. The first construct represent

structured elements that can be unbounded or not (isSet), can allow

null values or not (isNullable) and can be optional (isOptional). We can

specify nesting relationships between complex elements by means of Nest,

that has two references toward the top StructOfAttributes and the nested

one.

Simple Element - Lexical. It can be part of (i.e. refer to) a root element or a

complex one.

Foreign Key - ForeignKey and ComponentOfForeignKey.

5 Operators over Schema and Models

The model-independent and model-aware representation of data models and

schemas can be the basis for many fruitful applications. Our fist major application

has been the development of a model-independent approach for schema and

data translation [4] (a generic implementation of the modelgen operator, according

to Bernstein’s model management [11]). We are currently working on additional

applications, towards a more general model management system [2], the

most interesting of which is related to set operators (i.e. union, difference, intersection).

In this section we discuss the redefinition of these operators against our

construct-based representation. Let us concentrate on models first. The starting

point is clearly the definition of an equality function between constructs. Two


A Universal Metamodel and Its Dictionary 57

constructs belonging to two models are equal if and only if they correspond to

the same metaconstruct, have the same properties with the same values, and, if

they have references, they have the same references with the same values (i.e. the

same number of references, towards constructs proved to be equal). Two main

observations are needed. First, we can refer to supermodel constructs without

loss of generality, as every construct of every specific model corresponds to a

(meta)construct of the supermodel, as we said in Section 2. Second, the definition

is recursive but well defined as well, since the graph of the supermodel

(i.e. with constructs as nodes and references between constructs as edges) is

acyclic; this implies that a partial order on the constructs can be found, and

all the equality check between constructs can be performed traversing the graph

accordingly to such a partial order.

The union of two models is trivial, as we have simply to include in the result

the constructs of both the involved models. For difference and intersection, we

need the aforementioned definition of equality between constructs. When one

of these operators is applied, for each construct of the first model, we look for

an equal construct in the second model. If the operator is the difference, the

result is composed by all the constructs of the first model that has not an equal

construct in the second model; if the operator is the intersection, the result is

composed only by the constructs of the first model that has an equal construct

in the second model.

A very similar approach can be followed for set operators on schemas, which

are usually called merge and diff [11], but we can implement in terms of union

and difference, provided they are supported by a suitable notion of equivalence.

Some care is needed to consider details, but the basic idea is that the operators

can be implemented by executing the set operations on the constructs of the

various types, where the metalevel is used to see which are the involved types,

those that are used in the model at hand.

6 Reporting

In this section we focus on another interesting application of our approach, namely

the possibility of producing reports for models and schemas, again in a manner

that is both model-independent and model-aware. Reports can be rendered as

detailed textual documentations of data organization, in a readable and machineprocessable

way, or as diagrams in a graphical user interface. Again, this is possible

because of the supermodel: we visualize supermodel constructs together with

their properties, and relate them each other by means of their references.

In this way, we could obtain a “flat” report of a model, which does not distinguish

between type of references; so, for example, the references between a

ForeignKey and the two Aggregations involved in it would be represented as a

reference from a Lexical towards an Abstract. This is clearly not satisfactory. The

core idea is to classify the references in two classes: strong and weak. Instances

of constructs related by means of a strong reference (e.g. an Abstract with its

Lexical s) are presented together, while those having a weak relationship (e.g.


58 P. Atzeni, G. Gianforme, and P. Cappellari

a ForeignKey with the Aggregations involved in it) are presented in different

elements.

In rendering reports as text, we adopt the XML format. The main advantage

of XML reports is that they are both self-documenting and machine processable

if needed. Constructs and their instances can be presented according to a partial

order on the constructs that can be found since, as we already said in the previous

section, the graph of the supermodel (i.e.withconstructsasnodes and references

between constructs as edges) is acyclic.

As we said in Section 2, a schema (as well as a model) is completely represented

by the set of its constructs. Hence, a report for a schema would include a set of

construct elements. In order to produce a report for a schema S we can consider

its constructs following a total order, C1,C2, ..., Cn, for supermodel constructs

(obtained serializing a partial order of them). For each construct Ci, weconsider

its occurrences in S, and for each of them not yet inserted in the report, we add

a construct element named Ci with all its properties as XML attributes. Let us

consider an occurrence oij of Ci. Ifoij is pointed by any strong reference, we

add a set of component elements nested in the corresponding construct element:

the set would have a component element for each occurrence of a construct with

a strong reference toward oij .Ifoij has any weak reference towards another

occurrence of a construct, we add a set of reference elements: each element of

this set correspond to a weak reference and has OID and name properties of the

pointed occurrence as XML attributes. As an example, the textual report of the

ER schema of figure 4 would be as follows:










...









A Universal Metamodel and Its Dictionary 59

As we already said, a second option for report rendering is through a visual

graph. A few examples, for different models are shown in Figures 21, 22, and 23.

Fig. 21. An ER schema

Fig. 22. An OO schema


60 P. Atzeni, G. Gianforme, and P. Cappellari

Fig. 23. An XML-Schema

The rationale is the same as for textual reports:

– visualization is model independent as it is defined for all schemas of all models

in the same way: strong references lead to embedding the “component”

construct within the “containing” one, whereas weak references lead to separate

graphical objects, connected by means of arrows;

– visualization is model aware, in two sense: first of all, as usual the specific

features of each model are taken into account; second, and more important,

for each family of models it is possible to associate a specific shape with

each construct, thus following the usual representation for the model (see for

example the usual notation for relationships in the ER model in Figure 21.

An extra feature of the graphical visualization is the possibility to represent

instances of schemas also by means of a “relational” representation that follows

straightforward our construct-based modeling.

7 Conclusions

We have shown how a metamodel approach can be a the basis for a number

model-generic and model-aware techniques for the solution of interesting problems.

We have shown a dictionary we use to store our schemas and models, a

specific supermodel (a data model that generalizes all models of interest modulo

construct renaming). This is the bases for the specification and implementation

of interesting high-level operations, such as schema translation as well as


A Universal Metamodel and Its Dictionary 61

set-theoretic union and difference. Another interesting application is the development

of generic visualization and reporting features.

Acknowledgement

We would like to thank Phil Bernstein for many useful discussions during the

preliminary development of this work.

References

1. Allen, F.W., Loomis, M.E.S., Mannino, M.V.: The integrated dictionary/directory

system. ACM Comput. Surv. 14(2), 245–286 (1982)

2. Atzeni, P., Bellomarini, L., Bugiotti, F., Gianforme, G.: From schema and model

translation to a model management system. In: Gray, A., Jeffery, K., Shao, J. (eds.)

BNCOD 2008. LNCS, vol. 5071, pp. 227–240. Springer, Heidelberg (2008)

3. Atzeni, P., Cappellari, P., Bernstein, P.A.: A multilevel dictionary for model management.

In: Delcambre, L.M.L., Kop, C., Mayr, H.C., Mylopoulos, J., Pastor, Ó.

(eds.) ER 2005. LNCS, vol. 3716, pp. 160–175. Springer, Heidelberg (2005)

4. Atzeni, P., Cappellari, P., Torlone, R., Bernstein, P.A., Gianforme, G.: Modelindependent

schema translation. VLDB J. 17(6), 1347–1370 (2008)

5. Atzeni, P., Del Nostro, P.: Management of heterogeneity in the semantic web. In:

ICDE Workshops, p. 60. IEEE Computer Society, Los Alamitos (2006)

6. Atzeni, P., Paolozzi, S., Nostro, P.D.: Ontologies and databases: Going back and

forth. In: ODBIS (VLDB Workshop), pp. 9–16 (2008)

7. Atzeni, P., Torlone, R.: A metamodel approach for the management of multiple

models and translation of schemes. Information Systems 18(6), 349–362 (1993)

8. Atzeni, P., Torlone, R.: Management of multiple models in an extensible database

design tool. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996.

LNCS, vol. 1057, pp. 79–95. Springer, Heidelberg (1996)

9. Batini, C., Battista, G.D., Santucci, G.: Structuring primitives for a dictionary of

entity relationship data schemas. IEEE Trans. Software Eng. 19(4), 344–365 (1993)

10. Bernstein, P., Bergstraesser, T., Carlson, J., Pal, S., Sanders, P., Shutt, D.: Microsoft

repository version 2 and the open information model. Information Systems

22(4), 71–98 (1999)

11. Bernstein, P.A.: Applying model management to classical meta data problems. In:

CIDR Conference, pp. 209–220 (2003)

12. Bernstein, P.A., Halevy, A.Y., Pottinger, R.: A vision of management of complex

models. SIGMOD Record 29(4), 55–63 (2000)

13. Bernstein, P.A., Melnik, S.: Model management 2.0: manipulating richer mappings.

In: SIGMOD Conference, pp. 1–12 (2007)

14. Bézivin, J., Breton, E., Dupé, G., Valduriez, P.: The ATL transformation-based

model management framework. Research Report 03.08, IRIN, Université deNantes

(2003)

15. Claypool, K.T., Rundensteiner, E.A.: Sangam: A framework for modeling heterogeneous

database transformations. In: ICEIS (1), pp. 219–224 (2003)

16. Claypool, K.T., Rundensteiner, E.A., Zhang, X., Su, H., Kuno, H.A., Lee, W.-C.,

Mitchell, G.: Sangam - a solution to support multiple data models, their mappings

and maintenance. In: SIGMOD Conference, p. 606 (2001)


62 P. Atzeni, G. Gianforme, and P. Cappellari

17. De Virgilio, R., Torlone, R.: Modeling heterogeneous context information in adaptive

web based applications. In: ICWE Conference, pp. 56–63. ACM, New York

(2006)

18. Gianforme, G., Virgilio, R.D., Paolozzi, S., Nostro, P.D., Avola, D.: A novel approach

for practical semantic web data management. In: Lovrek, I., Howlett, R.J.,

Jain, L.C. (eds.) KES 2008, Part II. LNCS, vol. 5178, pp. 650–655. Springer, Heidelberg

(2008)

19. Hsu, C., Bouziane, M., Rattner, L., Yee, L.: Information resources management in

heterogeneous, distributed environments: A metadatabase approach. IEEE Trans.

Software Eng. 17(6), 604–625 (1991)

20. Hull, R., King, R.: Semantic database modelling: Survey, applications and research

issues. ACM Computing Surveys 19(3), 201–260 (1987)

21. Kahn, B.K., Lumsden, E.W.: A user-oriented framework for data dictionary systems.

DATA BASE 15(1), 28–36 (1983)

22. McGee, W.C.: Generalization: Key to successful electronic data processing. J.

ACM 6(1), 1–23 (1959)

23. Melnik, S.: Generic Model Management: Concepts and Algorithms. Springer, Heidelberg

(2004)

24. Mork, P., Bernstein, P.A., Melnik, S.: Teaching a schema translator to produce

O/R views. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER

2007. LNCS, vol. 4801, pp. 102–119. Springer, Heidelberg (2007)

25. Paolozzi, S., Atzeni, P.: Interoperability for semantic annotations. In: DEXA Workshops,

pp. 445–449. IEEE Computer Society, Los Alamitos (2007)

26. Rahm, E., Do, H.: On metadata interoperability in data warehouses. Technical

report, University of Leipzig (2000)

27. Soley, R., The OMG Staff Strategy Group: Model driven architecture. White paper,

draft 3.2, Object Management Group (November 2000)

28. Song, G., Zhang, K., Wong, R.: Model management though graph transformations.

In: IEEE Symposium on Visual Languages and Human Centric Computing, pp.

75–82 (2004)


Data Mining Using Graphics Processing Units

Christian Böhm 1 , Robert Noll 1 , Claudia Plant 2 ,

Bianca Wackersreuther 1 , and Andrew Zherdin 2

1 University of Munich, Germany

{boehm,noll,wackersreuther}@dbs.ifi.lmu.de

2 Technische Universität München, Germany

{plant,zherdin}@lrz.tum.de

Abstract. During the last few years, Graphics Processing Units (GPU)

have evolved from simple devices for the display signal preparation into

powerful coprocessors that do not only support typical computer graphics

tasks such as rendering of 3D scenarios but can also be used for

general numeric and symbolic computation tasks such as simulation and

optimization. As major advantage, GPUs provide extremely high parallelism

(with several hundred simple programmable processors) combined

with a high bandwidth in memory transfer at low cost. In this paper,

we propose several algorithms for computationally expensive data mining

tasks like similarity search and clustering which are designed for the

highly parallel environment of a GPU. We define a multidimensional index

structure which is particularly suited to support similarity queries

under the restricted programming model of a GPU, and define a similarity

join method. Moreover, we define highly parallel algorithms for

density-based and partitioning clustering. In an extensive experimental

evaluation, we demonstrate the superiority of our algorithms running on

GPU over their conventional counterparts in CPU.

1 Introduction

In recent years, Graphics Processing Units (GPUs) have evolved from simple

devices for the display signal preparation into powerful coprocessors supporting

the CPU in various ways. Graphics applications such as realistic 3D games are

computationally demanding and require a large number of complex algebraic operations

for each update of the display image. Therefore, today’s graphics hardware

contains a large number of programmable processors which are optimized

to cope with this high workload of vector, matrix, and symbolic computations

in a highly parallel way. In terms of peak performance, the graphics hardware

has outperformed state-of-the-art multi-core CPUs by a large margin.

The amount of scientific data is approximately doubling every year [26]. To

keep pace with the exponential data explosion, there is a great effort in many

research communities such as life sciences [20,22], mechanical simulation [27],

cryptographic computing [2], or machine learning [7] to use the computational

capabilities of GPUs even for purposes which are not at all related to computer

A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 63–90, 2009.

� Springer-Verlag Berlin Heidelberg 2009


64 C. Böhm et al.

graphics. The corresponding research area is called General Processing-Graphics

Processing Units (GP-GPU).

In this paper, we focus on exploiting the computational power of GPUs for

data mining. Data Mining consists of ’applying data analysis algorithms, that,

under acceptable efficiency limitations, produce a particular enumeration of patterns

over the data’ [9]. The exponential increase in data does not necessarily

come along with a correspondingly large gain in knowledge. The evolving research

area of data mining proposes techniques to support transforming the

raw data into useful knowledge. Data mining has a wide range of scientific and

commercial applications, for example in neuroscience, astronomy, biology, marketing,

and fraud detection. The basic data mining tasks include classification,

regression, clustering, outlier identification, as well as frequent itemset and association

rule mining. Classification and regression are called supervised data

mining tasks, because the aim is to learn a model for predicting a predefined

variable. The other techniques are called unsupervised, because the user does not

previously identify any of the variables to be learned. Instead, the algorithms

have to automatically identify any interesting regularities and patterns in the

data. Clustering probably is the most common unsupervised data mining task.

The goal of clustering is to find a natural grouping of a data set such that data

objects assigned to a common group called cluster are as similar as possible and

objects assigned to different clusters differ as much as possible. Consider for

example the set of objects visualized in Figure 1. A natural grouping would be

assigning the objects to two different clusters. Two outliers not fitting well to

any of the clusters should be left unassigned. Like most data mining algorithms,

the definition of clustering requires specifying some notion of similarity among

objects. In most cases, the similarity is expressed in a vector space, called the

feature space. In Figure 1, we indicate the similarity among objects by representing

each object by a vector in two dimensional space. Characterizing numerical

properties (from a continuous space) are extracted from the objects, and taken

together to a vector x ∈ R d where d is the dimensionality of the space, and

the number of properties which have been extracted, respectively. For instance,

Figure 2 shows a feature transformation where the object is a certain kind of

an orchid. The phenotype of orchids can be characterized using the lengths and

widths of the two petal and the three sepal leaves, of the form (curvature) of

the labellum, and of the colors of the different compartments. In this example, 5

Fig. 1. Example for Clustering


Data Mining Using Graphics Processing Units 65

Fig. 2. The Feature Transformation

features are measured, and each object is thus transformed into a 5-dimensional

vector space. To measure the similarity between two feature vectors, usually a

distance function like the Euclidean metric is used. To search in a large database

for objects which are similar to a given query objects (for instance to search for

anumberk of nearest neighbors, or for those objects having a distance that

does not exceed a threshold ɛ), usually multidimensional index structures are

applied. By a hierarchical organization of the data set, the search is made efficient.

The well-known indexing methods (like e.g. the R-tree [13]) are designed

and optimized for secondary storage (hard disks) or for main memory. For the

use in the GPU specialized indexing methods are required because of the highly

parallel but restricted programming environment. In this paper, we propose such

an indexing method.

It has been shown that many data mining algorithms, including clustering

can be supported by a powerful database primitive: The similarity join [3]. This

operator yields as result all pairs of objects in the database having a distance

of less than some predefined threshold ɛ. To show that also the more complex

basic operations of similarity search and data mining can be supported by novel

parallel algorithms specially designed for the GPU, we propose two algorithms

for the similarity join, one being a nested block loop join, and one being an

indexed loop join, utilizing the aforementioned indexing structure.

Finally, to demonstrate that highly complex data mining tasks can be efficiently

implemented using novel parallel algorithms, we propose parallel versions

of two widespread clustering algorithms. We demonstrate how the density-based

clustering algorithm DBSCAN [8] can be effectively supported by the parallel

similarity join. In addition, we introduce a parallel version of K-means clustering

[21] which follows an algorithmic paradigm which is very different from densitybased

clustering. We demonstrate the superiority of our approaches over the

corresponding sequential algorithms on CPU.

All algorithms for GPU have been implemented using NVIDIA’s technology

Compute Unified Device Architecture (CUDA) [1]. Vendors of graphics hardware

have recently anticipated the trend towards general purpose computing on GPU

and developed libraries, pre-compilers and application programming interfaces

to support GP-GPU applications. CUDA offers a programming interface for the


66 C. Böhm et al.

C programming language in which both the host program as well as the kernel

functions are assembled in a single program [1]. The host program is the main

program, executed on the CPU. In contrast, the so-called kernel functions are

executed in a massively parallel fashion on the (hundreds of) processors in the

GPU. An analogous technique is also offered by ATI using the brand names

Close-to-Metal, Stream SDK,andBrook-GP.

The remainder of this paper is organized as follows: Section 2 reviews the

related work in GPU processing in general with particular focus on database

management and data mining. Section 3 explains the graphics hardware and

the CUDA programming model. Section 4 develops an multidimensional index

structure for similarity queries on the GPU. Section 5 presents the non-indexed

andindexedjoinongraphicshardware.Section 6 and Section 7 are dedicated to

GPU-capable algorithms for density-based and partitioning clustering. Section 8

contains an extensive experimental evaluation of our techniques, and Section 9

summarizes the paper and provides directions for future research.

2 Related Work

In this section, we survey the related research in general purpose computations

using GPUs with particular focus on database management and data mining.

General Processing-Graphics Processing Units. Theoretically, GPUs are

capable of performing any computation that can be transformed to the model

of parallelism and that allow for the specific architecture of the GPU. This

model has been exploited for multiple research areas. Liu et al. [20] present a

new approach to high performance molecular dynamics simulations on graphics

processing units by the use of CUDA to design and implement a new parallel

algorithm. Their results indicate a significant performance improvement on an

NVIDIA GeForce 8800 GTX graphics card over sequential processing on CPU.

Another paper on computations from the field of life sciences has been published

by Manavski and Valle [22]. The authors propose an extremely fast solution

of the Smith-Waterman algorithm, a procedure for searching for similarities in

protein and DNA databases, running on GPU and implemented in the CUDA

programming environment. Significant speedups are achieved on a workstation

running two GeForce 8800 GTX.

Another widespread application area that uses the processing power of the

GPU is mechanical simulation. One example is the work by Tascora et al. [27],

that presents a novel method for solving large cone complementarity problems

by means of a fixed-point iteration algorithm, in the context of simulating the

frictional contact dynamics of large systems of rigid bodies. As the afore reviewed

approaches in the field of life sciences, the algorithm is also implemented in

CUDA for a GeForce 8800 GTX to simulate the dynamics of complex systems.

To demonstrate the nearly boundless possibilities of performing computations

on the GPU, we introduce one more example, namely cryptographic computing

[2]. In this paper, the authors present a record-breaking performance for


Data Mining Using Graphics Processing Units 67

the elliptic curve method (ECM) of integer factorization. The speedup takes

advantage of two NVIDIA GTX 295 graphics cards, using a new ECM implementation

relying on new parallel addition formulas and functions that are made

available by CUDA.

Database Management Using GPUs. Some papers propose techniques to

speed up relational database operations on GPU. In [14] some algorithms for the

relational join on an NVIDIA G80 GPU using CUDA are presented.

Two recent papers [19,4] address the topic of similarity join in feature space

which determines all pairs of objects from two different sets R and S fulfilling a

certain join predicate.The most common join predicate is the ɛ-join which determines

all pairs of objects having a distance of less than a predefined threshold

ɛ. The authors of [19] propose an algorithm based on the concept of space filling

curves, e.g. the z-order, for pruning of the search space, running on an NVIDIA

GeForce 8800 GTX using the CUDA toolkit. The z-order of a set of objects

can be determined very efficiently on GPU by highly parallelized sorting. Their

algorithm operates on a set of z-lists of different granularity for efficient pruning.

However, since all dimensions are treated equally, performance degrades in

higher dimensions. In addition, due to uniform space partitioning in all areas

of the data space, space filling curves are not suitable for clustered data. An

approach that overcomes that kind of problem is presented in [4]. Here the authors

parallelize the baseline technique underlying any join operation with an

arbitrary join predicate, namely the nested loop join (NLJ), a powerful database

primitive that can be used to support many applications including data mining.

All experiments are performed on NVIDIA 8500GT graphics processors by the

use of a CUDA-supported implementation.

Govindaraju et al. [10,11] demonstrate that important building blocks for

query processing in databases, e.g. sorting, conjunctive selections, aggregations,

and semi-linear queries can be significantly speed up by the use of GPUs.

Data Mining Using GPUs. Recent approaches concerning data mining using

the GPU are two papers on clustering on GPU, that pass on the use of CUDA.

In [6] a clustering approach on a NVIDIA GeForce 6800 GT graphics card is

presented, that extends the basic idea of K-means by calculating the distances

from a single input centroid to all objects at one time that can be done simultaneously

on GPU. Thus the authors are able to exploit the high computational

power and pipeline of GPUs, especially for core operations, like distance computations

and comparisons. An additional efficient method that is designed to

execute clustering on data streams confirms a wide practical field of clustering

on GPU.

The paper [25] parallelizes the K-means algorithm for use of a GPU by using

multi-pass rendering and multi shader program constants. The implementation

on NVIDIA 5900 and NVIDIA 8500 graphics processors achieves significant increasing

performances for both various data sizes and cluster sizes. However

the algorithms of both papers are not portable to different GPU models, like

CUDA-approaches are.


68 C. Böhm et al.

3 Architecture of the GPU

Graphics Processing Units (GPUs) of the newest generation are powerful coprocessors,

not only designed for games and other graphics-intensive applications,

but also for general-purpose computing (in this case, we call them GP-GPUs).

From the hardware perspective, a GPU consists of a number of multiprocessors,

each of which consists of a set of simple processors which operate in a SIMD

fashion, i.e. all processors of one multiprocessor execute in a synchronized way

the same arithmetic or logic operation at the same time, potentially operating

on different data. For instance, the GPU of the newest generation GT200 (e.g.

on the graphics card Geforce GTX280) has 30 multiprocessors, each consisting

of 8 SIMD-processors, summarizing to a total amount of 240 processors inside

one GPU. The computational power sums up to a peak performance of 933

GFLOP/s.

3.1 The Memory Model

Apart from some memory units with special purpose in the context of graphics

processing (e.g. texture memory), we have three important types of memory, as

visualized in Figure 3. The shared memory (SM) is a memory unit with fast

access (at the speed of register access, i.e. no delay). SM is shared among all

processors of a multiprocessor. It can be used for local variables but also to

exchange information between threads on different processors of the same multiprocessor.

It cannot be used for information which is shared among threads

on different multiprocessors. SM is fast but very limited in capacity (16 KBytes

per multiprocessor). The second kind of memory is the so-called device memory

(DM), which is the actual video RAM of the graphics card (also used for

frame buffers etc.). DM is physically located on the graphics card (but not inside

the GPU), is significantly larger than SM (typically up to some hundreds

of MBytes), but also significantly slower. In particular, memory accesses to DM

cause a typical latency delay of 400-600 clock cycles (on G200-GPU, corresponding

to 300-500ns). The bandwidth for transferring data between DM and GPU

(141.7 GB/s on G200) is higher than that of CPU and main memory (about 10

GB/s on current CPUs). DM can be used to share information between threads

Fig. 3. Architecture of a GPU


Data Mining Using Graphics Processing Units 69

on different multiprocessors. If some threads schedule memory accesses from contiguous

addresses, these accesses can be coalesced, i.e. taken together to improve

the access speed. A typical cooperation pattern for DM and SM is to copy the

required information from DM to SM simultaneously from different threads (if

possible, considering coalesced accesses), then to let each thread compute the

result on SM, and finally, to copy the result back to DM. The third kind of

memory considered here is the main memory which is not part of the graphics

card. The GPU has no access to the address space of the CPU. The CPU can

only write to or read from DM using specialized API functions. In this case, the

data packets have to be transferred via the Front Side Bus and the PCI-Express

Bus. The bandwidth of these bus systems is strictly limited, and therefore, these

special transfer operations are considerably more expensive than direct accesses

of the GPU to DM or direct accesses of the CPU to main memory.

3.2 The Programming Model

The basis of the programming model of GPUs are threads. Threads are

lightweight processes which are easy to create and to synchronize. In contrast to

CPU processes, the generation and termination of GPU threads as well as context

switches between different threads do not cause any considerable overhead

either. In typical applications, thousands or even millions of threads are created,

for instance one thread per pixel in gaming applications. It is recommended

to create a number of threads which is even much higher than the number of

available SIMD-processors because context switches are also used to hide the

latency delay of memory accesses: Particularly an access to the DM may cause

a latency delay of 400-600 clock cycles, and during that time, a multiprocessor

may continue its work with other threads. The CUDA programming library [1]

contains API functions to create a large number of threads on the GPU, each of

which executes a function called kernel function. The kernel functions (which are

executed in parallel on the GPU) as well as the host program (which is executed

sequentially on the CPU) are defined in an extended syntax of the C programming

language. The kernel functions are restricted with respect to functionality

(e.g. no recursion).

On GPUs the threads do not even have an individual instruction pointer. An

instruction pointer is rather shared by several threads. For this purpose, threads

are grouped into so-called warps (typically 32 threads per warp). One warp is

processed simultaneously on the 8 processors of a single multiprocessor (SIMD)

using 4-fold pipelining (totalling in 32 threads executed fully synchronously). If

not all threads in a warp follow the same execution path, the different execution

paths are executed in a serialized way. The number (8) of SIMD-processors per

multiprocessor as well as the concept of 4-fold pipelining is constant on all current

CUDA-capable GPUs.

Multiple warps are grouped into thread groups (TG). It is recommended [1]

to use multiples of 64 threads per TG. The different warps in a TG (as well

as different warps of different TGs) are executed independently. The threads in

one thread group use the same shared memory and may thus communicate and


70 C. Böhm et al.

share data via the SM. The threads in one thread group can be synchronized

(let all threads wait until all warps of the same group have reached that point

of execution). The latency delay of the DM can be hidden by scheduling other

warps of the same or a different thread group whenever one warp waits for an

access to DM. To allow switching between warps of different thread groups on

a multiprocessor, it is recommended that each thread uses only a small fraction

of the shared memory and registers of the multiprocessor [1].

3.3 Atomic Operations

In order to synchronize parallel processes and to ensure the correctness of parallel

algorithms, CUDA offers atomic operations such as increment, decrement, or

exchange (to name just those out of the large number of atomic operations, which

will be needed by our algorithms). Most of the atomic operations work on integer

data types in Device Memory. However, the newest version of CUDA (Compute

Capability 1.3 of the GPU GT200) allows even atomic operations in SM. If, for

instance, some parallel processes share a list as a common resource with concurrent

reading and writing from/to the list, it may be necessary to (atomically)

increment a counter for the number of list entries (which is in most cases also

used as the pointer to the first free list element). Atomicity implies in this case

the following two requirements: If two or more threads increment the list counter,

then (1) the value counter after all concurrent increments must be equivalent to

the value before plus the number of concurrent increment operations. And, (2),

each of the concurrent threads must obtain a separate result of the increment

operation which indicates the index of the empty list element to which the thread

can write its information. Therefore, most atomic operations return a result after

their execution. For instance the operation atomicInc has two parameters, the

address of the counter to be incremented, and an optional threshold value which

must not be exceeded by the operation. The operation works as follows: The

counter value at the address is read, and incremented (provided that the threshold

is not exceeded). Finally, the old value of the counter (before incrementing) is

returned to the kernel method which invoked atomicInc. If two or more threads

(of the same or different thread groups) call some atomic operations simultaneously,

the result of these operations is that of an arbitrary sequentialization

of the concurrent operations. The operation atomicDec works in an analogous

way. The operation atomicCAS performs a Compare-and-Swap operation. It has

three parameters, an address, a compare value and a swap value. If the value

at the address equals the compare value, the value at the address is replaced by

the swap value. In every case, the old value at the address (before swapping) is

returned to the invoking kernel method.

4 An Index Structure for Similarity Queries on GPU

Many data mining algorithms for problems like classification, regression, clustering,

and outlier detection use similarity queries as a building block. In many


Data Mining Using Graphics Processing Units 71

cases, these similarity queries even represent the largest part of the computational

effort of the data mining tasks, and, therefore, efficiency is of high importance

here. Similarity queries are defined as follows: Given is a database

D = {x1, ...xn} ⊆R d of a number n of vectors from a d-dimensional space, and

a query object q ∈ R d . We distinguish between two different kinds of similarity

queries, the range queries and the nearest neighbor-queries:

Definition 1 (Range Query)

Let ɛ ∈ R + 0 be a threshold value. The result of the range query is the set of the

following objects:

Nɛ(q) ={x ∈D: ||x − q|| ≤ ɛ}.

where ||x − q|| is an arbitrary distance function between two feature vectors x

and q, e.g. the Euclidean distance.

Definition 2 (Nearest Neighbor Query)

The result of a nearest neighbor query is the set:

NN(q) ={x ∈D: ∀x ′ ∈D: ||x − q|| ≤ ||x ′ − q||}.

Definition 2 can also be generalized for the case of the k-nearest neighbor query

(NNk(q)), where a number k of nearest neighbors of the query object q is retrieved.

The performance of similarity queries can be greatly improved if a multidimensional

index structure supporting the similarity search is available. Our

index structure needs to be traversed in parallel for many search objects using

the kernel function. Since kernel functions do not allow any recursion, and as

they need to have small storage overhead by local variables etc., the index structure

must be kept very simple as well. To achieve a good compromise between

simplicity and selectivity of the index, we propose a data partitioning method

with a constant number of directory levels. The first level partitions the data set

D according to the first dimension of the data space, the second level according

to the second dimension, and so on. Therefore, before starting the actual

data mining method, some transformation technique should be applied which

guarantees a high selectivity in the first dimensions (e.g. Principal Component

Analysis, Fast Fourier Transform, Discrete Wavelet Transform, etc.). Figure 4

shows a simple, 2-dimensional example of a 2-level directory (plus the root node

which is considered as level-0), similar to [16,18]. The fanout of each node is 8.

In our experiments in Section 8, we used a 3-level directory with fanout 16.

Before starting the actual data mining task, our simple index structure must

be constructed in a bottom-up way by fractionated sorting of the data: First,

the data set is sorted according to the first dimension, and partitioned into the

specified number of quantile partitions. Then, each of the partitions is sorted

individually according to the second dimension, and so on. The boundaries are

stored using simple arrays which can be easily accessed in the subsequent kernel

functions. In principle, this index construction can already be done on the GPU,

because efficient sorting methods for GPU have been proposed [10]. Since bottom


72 C. Böhm et al.

Fig. 4. Index Structure for GPU

up index construction is typically not very costly compared to the data mining

algorithm, our method performs this preprocessing step on CPU.

When transferring the data set from the main memory into the device memory

in the initialization step of the data mining method, our new method has

additionally to transfer the directory (i.e. the arrays in which the coordinates

of the page boundaries are stored). Compared to the complete data set, the

directory is always small.

The most important change in the kernel functions in our data mining methods

regards the determination of the ɛ-neighborhood of some given seed object

q, which is done by exploiting SIMD-parallelism inside a multiprocessor. In the

non-indexed version, this is done by a set of threads (inside a thread group) each

of which iterates over a different part of the (complete) data set. In the indexed

version, one of the threads iterates in a set of nested loops (one loop for each level

of the directory) over those nodes of the index structure which represent regions

of the data space which are intersected by the neighborhood-sphere of Nɛ(q). In

the innermost loop, we have one set of points (corresponding to a data page of

the index structure) which is processed by exploiting the SIMD-parallelism, like

in the non-indexed version.

5 The Similarity Join

The similarity join is a basic operation of a database system designed for similarity

search and data mining on feature vectors. In such applications, we are

given a database D of objects which are associated with a vector from a multidimensional

space, the feature space. The similarity join determines pairs of

objects which are similar to each other. The most widespread form is the ɛ-join

which determines those pairs from D×D which have a Euclidean distance of no

more than a user-defined radius ɛ:

Definition 3 (Similarity Join). Let D⊆Rd be a set of feature vectors of a

d-dimensional vector space and ɛ ∈ R + 0 be a threshold. Then the similarity join

is the following set of pairs:

SimJoin(D,ɛ)={(x, x ′ ) ∈ (D ×D): ||x − x ′ || ≤ ɛ} ,


Data Mining Using Graphics Processing Units 73

If x and x ′ are elements of the same set, the join is a similarity self-join. Most

algorithms including the method proposed in this paper can also be generalized

to the more general case of non-self-joins in a straightforward way. Algorithms

for a similarity join with nearest neighbor predicates have also been proposed.

The similarity join is a powerful building block for similarity search and data

mining. It has been shown that important data mining methods such as clustering

and classification can be based on the similarity join. Using a similarity join

instead of single similarity queries can accelerate data mining algorithms by a

high factor [3].

5.1 Similarity Join without Index Support

The baseline technique to process any join operation with an arbitrary join

predicate is the nested loop join (NLJ) which performs two nested loops, each

enumerating all points of the data set. For each pair of points, the distance is

calculated and compared to ɛ. The pseudocode of the sequential version of NLJ

is given in Figure 5.

algorithm sequentialNLJ(data set D)

for each q ∈Ddo // outer loop

for each x ∈Ddo // inner loop: search all points x which are similar to q

if dist(x, q) ≤ ɛ then

report (x, q) as a result pair or do some further processing on (x, q)

end

Fig. 5. Sequential Algorithm for the Nested Loop Join

It is easily possible to parallelize the NLJ, e.g. by creating an individual thread

for each iteration of the outer loop. The kernel function then contains the inner

loop, the distance calculation and the comparison. During the complete run of

the kernel function, the current point of the outer loop is constant, and we call

this point the query point q of the thread, because the thread operates like a

similarity query, in which all database points with a distance of no more than

ɛ from q are searched. The query point q is always held in a register of the

processor.

Our GPU allows a truly parallel execution of a number m of incarnations

of the outer loop, where m is the total number of ALUs of all multiprocessors

(i.e. the warp size 32 times the number of multiprocessors). Moreover, all the

different warps are processed in a quasi-parallel fashion, which allows to operate

on one warp of threads (which is ready-to-run) while another warp is blocked

due to the latency delay of a DM access of one of its threads.

The threads are grouped into thread groups, which share the SM. In our

case, the SM is particularly used to physically store for each thread group the

current point x of the inner loop. Therefore, a kernel function first copies the

current point x from the DM into the SM, and then determines the distance

of x to the query point q. The threads of the same warp are running perfectly


74 C. Böhm et al.

simultaneously, i.e. if these threads are copying the same point from DM to SM,

this needs to be done only once (but all threads of the warp have to wait until

this relatively costly copy operation is performed). However, a thread group may

(and should) consist of multiple warps. To ensure that the copy operation is only

performed once per thread group, it is necessary to synchronize the threads of

the thread group before and after the copy operation using the API function

synchronize(). This API function blocks all threads in the same TG until all

other threads (of other warps) have reached the same point of execution. The

pseudocode for this algorithm is presented in Figure 6.

algorithm GPUsimpleNLJ(data set D) // host program executed on CPU

deviceMem float D ′ [][] := D[][]; // allocate memory in DM for the data set D

#threads := n; // number of points in D

#threadsPerGroup := 64;

startThreads (simpleNLJKernel, #threads, #threadsPerGroup); // one thread per point

waitForThreadsToFinish();

end.

kernel simpleNLJKernel (int threadID)

register float q[] := D ′ [threadID][]; // copy the point from DM into the register

// and use it as query point q

// index is determined by the threadID

for i := 0 ... n − 1 do // this used to be the inner loop in Figure 5

synchronizeThreadGroup();

shared float x[] := D ′ [i][]; // copy the current point x from DM to SM

synchronizeThreadGroup(); // Now all threads of the thread group can work with x

if dist(x, q) ≤ ɛ then

report (x, q) as a result pair using synchronized writing

or do some further processing on (x, q) directly in kernel

end.

Fig. 6. Parallel Algorithm for the Nested Loop Join on the GPU

If the data set does not fit into DM, a simple partitioning strategy can be

applied. It must be ensured that the potential join partners of an object are

within the same partition as the object itself. Therefore, overlapping partitions

of size 2 · ɛ can be created.

5.2 An Indexed Parallel Similarity Join Algorithm on GPU

The performance of the NLJ can be greatly improved if an index structure is

available as proposed in Section 4. On sequential processing architectures, the

indexed NLJ leaves the outer loop unchanged. The inner loop is replaced by an

index-based search retrieving candidates that may be join partners of the current

object of the outer loop. The effort of finding these candidates and refining them

is often orders of magnitude smaller compared to the non-indexed NLJ.

When parallelizing the indexed NLJ for the GPU, we follow the same paradigm

as in the last section, to create an individual thread for each point of the outer

loop. It is beneficial to the performance, if points having a small distance to each

other are collected in the same warp and thread group, because for those points,

similar paths in the index structure are relevant.


Data Mining Using Graphics Processing Units 75

After index construction, we have not only a directory in which the points

are organized in a way that facilitates search. Moreover, the points are now

clustered in the array, i.e. points which have neighboring addresses are also

likely to be close together in the data space (at least when projecting on the

first few dimensions). Both effects are exploited by our join algorithm displayed

in Figure 7.

algorithm GPUindexedJoin(data set D)

deviceMem index idx := makeIndexAndSortData(D); // changes ordering of data points

int #threads := |D|, #threadsPerGroup := 64;

for i =1... (#threads/#threadsPerGroup) do

deviceMem float blockbounds[i][] := calcBlockBounds(D, blockindex);

deviceMem float D ′ [][] := D[][];

startThreads (indexedJoinKernel, #threads, #threadsPerGroup); // one thread per data point

waitForThreadsToFinish ();

end.

algorithm indexedJoinKernel (int threadID, int blockID)

register float q[] := D ′ [threadID][]; // copy the point from DM into the register

shared float myblockbounds[] := blockbounds[blockID][];

for xi := 0 ... indexsize.x do

if IndexPageIntersectsBoundsDim1(idx,myblockbounds,xi) then

for yi := 0 ... indexsize.y do

if IndexPageIntersectsBoundsDim2(idx,myblockbounds,xi,yi) then

for zi := 0 ... indexsize.z do

if IndexPageIntersectsBoundsDim3(idx,myblockbounds,xi,yi,zi) then

for w := 0 ... IndexPageSize do

synchronizeThreadGroup();

shared float p[] :=GetPointFromIndexPage(idx,D ′ ,xi,yi,zi,w);

synchronizeThreadGroup();

if dist(p, q) ≤ ɛ then

report (p, q) as a result pair using synchronized writing

end.

Fig. 7. Algorithm for Similarity Join on GPU with Index Support

Instead of performing an outer loop like in a sequential indexed NLJ, our

algorithm now generates a large number of threads: One thread for each iteration

of the outer loop (i.e. for each query point q). Since the points in the array

are clustered, the corresponding query points are close to each other, and the

join partners of all query points in a thread group are likely to reside in the

same branches of the index as well. Our kernel method now iterates over three

loops, each loop for one index level, and determines for each partition if the

point is inside the partition or, at least no more distant to its boundary than ɛ.

The corresponding subnode is accessed if the corresponding partition is able to

contain join partners of the current point of the thread. When considering the

warps which operate in a fully synchronized way, a node is accessed, whenever

at least one of the query points of the warps is close enough to (or inside) the

corresponding partition.

For both methods, indexed and non-indexed nested loop join on GPU, we

need to address the question how the resulting pairs are processed. Often, for

example to support density-based clustering (cf. Section 6), it is sufficient to

return a counter with the number of join partners. If the application requires to


76 C. Böhm et al.

report the pairs themselves, this is easily possible by a buffer in DM which can

be copied to the CPU after the termination of all kernel threads. The result pairs

must be written into this buffer in a synchronized way to avoid that two threads

write simultaneously to the same buffer area. The CUDA API provides atomic

operations (such as atomic increment of a buffer pointer) to guarantee this kind

of synchronized writing. Buffer overflows are also handled by our similarity join

methods. If the buffer is full, all threads terminate and the work is resumed after

the buffer is emptied by the CPU.

6 Similarity Join to Support Density-Based Clustering

As mentioned in Section 5, the similarity join is an important building block to

support a wide range of data mining tasks, including classification [24], outlier

detection [5] association rule mining [17] and clustering [8], [12]. In this section,

we illustrate how to effectively support the density-based clustering algorithm

DBSCAN [8] with the similarity join on GPU.

6.1 Basic Definitions and Sequential DBSCAN

The idea of density-based clustering is that clusters are areas of high point

density, separated by areas of significantly lower point density. The point density

can be formalized using two parameters, called ɛ ∈ R + and MinPts ∈ N + .The

central notion is the core object. A data object x is called a core object of a

cluster, if at least MinPts objects (including x itself) are in its ɛ-neighborhood

Nɛ(x), which corresponds to a sphere of radius ɛ. Formally:

Definition 4. (Core Object)

Let D be a set of n objects from R d , ɛ ∈ R + and MinPts ∈ N + .Anobjectx ∈D

is a core object, if and only if

|Nɛ(x)| ≥MinPts,whereNɛ(x) ={x ′ ∈D: ||x ′ − x|| ≤ ɛ}.

Note that this definition is equivalent to Definition 1. Two objects may be assigned

to a common cluster. In density-based clustering this is formalized by the

notions direct density reachability, and density connectedness.

Definition 5. (Direct Density Reachability)

Let x, x ′ ∈D. x ′ is called directly density reachable from x (in symbols: x ✁ x ′ )

if and only if

1. x is a core object in D, and

2. x ′ ∈ Nɛ(x).

If x and x ′ are both core objects, then x ✁ x ′ is equivalent with x ✄ x ′ . The density

connectedness is the transitive and symmetric closure of the direct density

reachability:


Data Mining Using Graphics Processing Units 77

Definition 6. (Density Connectedness)

Two objects x and x ′ are called density connected (in symbols: x⊲⊳x ′ )ifand

only if there is a sequence of core objects (x1, ..., xm) of arbitrary length m such

that

x ✄ x1 ✄ ... ✁ xm ✁ x ′ .

In density-based clustering, a cluster is defined as a maximal set of density

connected objects:

Definition 7. (Density-based Cluster)

AsubsetC ⊆Dis called a cluster if and only if the following two conditions

hold:

1. Density connectedness: ∀x, x ′ ∈ C : x⊲⊳x ′ .

2. Maximality: ∀x ∈ C, ∀x ′ ∈D\C : ¬x ⊲⊳x ′ .

The algorithm DBSCAN [8] implements the cluster notion of Definition 7 using

a data structure called seed list S containing a set of seed objects for cluster

expansion. More precisely, the algorithm proceeds as follows:

1. Mark all objects as unprocessed.

2. Consider an arbitrary unprocessed object x ∈D.

3. If x is a core object, assign a new cluster ID C, and do step (4) for all

elements x ′ ∈ Nɛ(x) which do not yet have a cluster ID:

4. (a) mark the element x ′ with the cluster ID C and

(b) insert the object x ′ into the seed list S.

5. While S is not empty repeat step 6 for all elements s ∈ S:

6. If s is a core object, do step (7) for all elements x ′ ∈ Nɛ(s) which do not yet

have any cluster ID:

7. (a) mark the element x ′ with the cluster ID C and

(b) insert the object x ′ into the seed list S.

8. If there are still unprocessed objects in the database, continue with step (2).

To illustrate the algorithmic paradigm, Figure 8 displays a snapshot of DBSCAN

during cluster expansion. The light grey cluster on the left side has been processed

already. The algorithm currently expands the dark grey cluster on the

right side. The seedlist S currently contains one object, the object x. x is a

core object since there are more than MinPts = 3 objects in its ɛ-neighborhood

(|Nɛ(x)| = 6, including x itself). Two of these objects, x ′ and x ′′ have not been

processed so far and are therefore inserted into S. This way, the cluster is iteratively

expanded until the seed list is empty. After that, the algorithm continues

with an arbitrary unprocessed object until all objects have been processed.

Since every object of the database is considered only once in Step 2 or 6

(exclusively), we have a complexity which is n times the complexity of Nɛ(x)

(which is linear in n if there is no index structure, and sublinear or even O(log(n))

in the presence of a multidimensional index structure. The result of DBSCAN

is determinate.


78 C. Böhm et al.

Fig. 8. Sequential Density-based Clustering

algorithm GPUdbscanNLJ(data set D) // host program executed on CPU

deviceMem float D ′ [][] := D[][]; // allocate memory in DM for the data set D

deviceMem int counter [n]; // allocate memory in DM for counter

#threads := n; // number of points in D

#threadsPerGroup := 64;

startThreads (GPUdbscanKernel, #threads, #threadsPerGroup); // one thread per point

waitForThreadsToFinish();

copy counter from DM to main memory ;

end.

kernel GPUdbscanKernel (int threadID)

register float q[] := D ′ [threadID][]; // copy the point from DM into the register

// and use it as query point q

// index is determined by the threadID

for i := 0 ... threadID do // option 1 OR

for i := 0 ... n − 1 do // option 2

synchronizeThreadGroup();

shared float x[] := D ′ [i][]; // copy the current point x from DM to SM

synchronizeThreadGroup(); // Now all threads of the thread group can work with x

if dist(x, q) ≤ ɛ then

atomicInc (counter[i]); atomicInc (counter[threadID]); // option 1 OR

inc counter[threadID]; // option 2

end.

Fig. 9. Parallel Algorithm for the Nested Loop Join to Support DBSCAN on GPU

6.2 GPU-Supported DBSCAN

To effectively support DBSCAN on GPU we first identify the two major stages

of the algorithm requiring most of the processing time:

1. Determination of the core object property.

2. Cluster expansion by computing the transitive closure of the direct density

reachability relation.

The first stage can be effectively supported by the similarity join. To check the

core object property, we need to count the number of objects which are within

the ɛ-neighborhood of each point. Basically, this can be implemented by a self

join. However, the algorithm for self-join described in Section 5 needs to be

modified to be suitable to support this task. The classical self-join only counts

the total number of pairs of data objects with distance less or equal than ɛ.

For the core object property, we need a self-join with a counter associated to


Data Mining Using Graphics Processing Units 79

each object. Each time when the algorithm detects a new pair fulfilling the join

condition, the counter of both objects needs to be incremented.

We propose two different variants to implement the self-join to support DB-

SCAN on GPU which are displayed in pseudocode in Figure 9. Modifications

over the basic algorithm for nested loop join (cf. Figure 6) are displayed in

darker color. As in the simple algorithm for nested loop join, for each point q

of the outer loop a separate thread with a unique threadID is created. Both

variants of the self-join for DBSCAN operate on a array counter which stores

the number of neighbors for each object. We have two options how to increment

the counters of the objects when a pair of objects (x, q) fulfills the join condition.

Option 1 is first to add the counter of x and then the counter of q using

the atomic operation atomicInc() (cf. Section 3). The operation atomicInc()

involves synchronization of all threads. The atomic operations are required to

assure the correctness of the result, since it is possible that different threads try

to increment the counters of objects simultaneously.

In clustering, we typically have many core objects which causes a large number

of synchronized operations which limit parallelism. Therefore, we also implemented

option 2 which guarantees correctness without synchronized operations.

Whenever a pair of objects (x, q) fulfills the join condition, we only increment

the counter of point q. Pointq is that point of the outer loop for which the thread

has been generated, which means q is exclusively associated with the threadID.

Therefore, the cell counter[threadID] can be safely incremented with the ordinary,

non-synchronized operation inc(). Since no other point is associated with

the same threadID as q no collision can occur. However, note that in contrast

to option 1, for each point of the outer loop, the inner loop needs to consider

all other points. Otherwise results are missed. Recall that for the conventional

sequential nested loop join (cf. Figure 5) it is sufficient to consider in the inner

loop only those points which have not been processed so far. Already processed

points can be excluded because if they are join partners of the current point, this

has already been detected. The same holds for option 1. Because of parallelism,

we can not state which objects have been already processed. However, it is still

sufficient when each object searches in the inner loop for join partners among

those objects which would appear later in the sequential processing order. This

is because all other object are addressed by different threads. Option 2 requires

checking all objects since only one counter is incremented. With sequential processing,

option 2 would thus duplicate the workload. However, as our results

in Section 8 demonstrate, option 2 can pay-off under certain conditions since

parallelism is not limited by synchronization.

After determination of the core object property, clusters can be expanded

starting from the core objects. Also this second stage of DBSCAN can be effectively

supported on the GPU. For cluster expansion, it is required to compute

the transitive closure of the direct density reachability relation. Recall that this is

closely connected to the core object property as all objects within the ɛ range of

a core object x are directly density reachable from x. To compute the transitive

closure, standard algorithms are available. The most well-known among them is


80 C. Böhm et al.

the algorithm of Floyd-Warshall. A highly parallel variant of the Floyd-Warshall

algorithm on GPU has been recently proposed [15], but this is beyond the scope

of this paper.

7 K-Means Clustering on GPU

7.1 The Algorithm K-Means

A well-established partitioning clustering method is the K-means clustering algorithm

[21]. K-means requires a metric distance function in vector space. In

addition, the user has to specify the number of desired clusters k as an input

parameter. Usually K-means starts with an arbitrary partitioning of the objects

into k clusters. After this initialization, the algorithm iteratively performs the

following two steps until convergence: (1) Update centers: For each cluster, compute

the mean vector of its assigned objects. (2). Re-assign objects: Assign each

object to its closest center. The algorithm converges as soon as no object changes

its cluster assignment during two subsequent iterations.

Figure 10 illustrates an example run of K-means for k = 3 clusters. Figure

10(a) shows the situation after random initialization. In the next step, every

data point is associated with the closest cluster center (cf. Figure 10(b)). The

resulting partitions represent the Voronoi cells generated by the centers. In the

following step of the algorithm, the center of each of the k clusters is updated, as

shown in Figure 10(c). Finally, assignment and update steps are repeated until

convergence.

In most cases, fast convergence can be observed. The optimization function of

K-means is well defined. The algorithm minimizes the sum of squared distances

of the objects to their cluster centers. However, K-means is only guaranteed

to converge towards a local minimum of the objective function. The quality of

the result strongly depends on the initialization. Finding that clustering with

k clusters minimizing the objective function actually is a NP-hard problem, for

details see e.g. [23]. In practice, it is therefore recommended to run the algorithm

several times with different random initializations and keep the best result. For

large data sets, however, often only a very limited number of trials is feasible.

Parallelizing K-means in GPU allows for a more comprehensive exploration of

(a) Initialization (b) Assignment (c) Recalculation (d) Termination

Fig. 10. Sequential Partitioning Clustering by the K-means Algorithm


Data Mining Using Graphics Processing Units 81

the search space of all potential clusterings and thus provides the potential to

obtain a good and reliable clustering even for very large data sets.

7.2 CUDA-K-Means

In K-means, most computing power is spent in step (2) of the algorithm, i.e.

re-assignment which involves distance computation and comparison. The number

of distance computations and comparisons in K-means is O(k · i · n), where

i denotes the number of iterations and n is the number of data points.

The CUDA-K-meansKernel. In K-means clustering, the cluster assignment

of each data point is determined by comparing the distances between that point

and each cluster center. This work is performed in parallel by the CUDA-KmeansKernel.

The idea is, instead of (sequentially) performing cluster assignment

of one single data point, we start many different cluster assignments at the

same time for different data points. In detail, one single thread per data point

is generated, all executing the CUDA-K-meansKernel. Every thread which is

generated from the CUDA-K-meansKernel (cf. Figure 11) starts with the ID of

adatapointx which is going to be processed. Its main tasks are, to determine

the distance to the next center and the ID of the corresponding cluster.

algorithm CUDA-K-means(data set D, int k) // host program executed on CPU

deviceMem float D ′ [][] := D[][]; // allocate memory in DM for the data set D

#threads := |D|; // number of points in D

#threadsPerGroup := 64;

deviceMem float Centroids[][] := initCentroids(); // allocate memory in DM for the

// initial centroids

double actCosts := ∞; // initial costs of the clustering

repeat

prevCost := actCost;

startThreads (CUDA-K-meansKernel, #threads, #threadsPerGroup); // one thread per point

waitForThreadsToFinish();

float minDist := minDistances[threadID]; // copy the distance to the nearest

// centroid from DM into MM

float cluster := clusters[threadID]; // copy the assigned cluster from DM into MM

double actCosts := calculateCosts(); // update costs of the clustering

deviceMem float Centroids[][] := calculateCentroids(); // copy updated centroids to DM

until |actCost − prevCost| < threshold // convergence

end.

kernel CUDA-K-meansKernel (int threadID)

register float x[] := D ′ [threadID][]; // copy the point from DM into the register

float minDist := ∞; // distance of x to the next centroid

int cluster := null; // ID of the next centroid (cluster)

for i := 1 ... k do // process each cluster

register float c[] := Centroids[i][] // copy the actual centroid from DM into the register

double dist := distance(x,c);

if dist < minDist then

minDist := dist;

cluster := i;

report(minDist, cluster); // report assigned cluster and distance using synchronized writing

end.

Fig. 11. Parallel Algorithm for K-means on the GPU


82 C. Böhm et al.

A thread starts by reading the coordinates of the data point x into the register.

The distance of x to its closest center is initialized by ∞ and the assigned

cluster is therefore set to null. Then a loop encounters all c1,c2,...,ck centers

and considers them as potential clusters for x. Thisisdonebyallthreadsinthe

thread group allowing a maximum degree of intra-group parallelism. Finally, the

cluster whose center has the minimum distance to the data point x is reported

together with the corresponding distance value using synchronized writing.

The Main Program for CPU. Apart from initialization and data transfer

from main memory (MM) to DM, the main program consists of a loop starting

the CUDA-K-meansKernel on the GPU until the clustering converges. After the

parallel operations are completed by all threads of the group, the following steps

are executed in each cycle of the loop:

1. Copy distance of processed point x to the nearest center from DM into MM.

2. Copy cluster, x is assigned to, from DM into MM.

3. Update centers.

4. Copy updated centers to DM.

A pseudocode of these procedures is illustrated in Figure 11.

8 Experimental Evaluation

To evaluate the performance of data mining on the GPU, we performed various

experiments on synthetic data sets. The implementation for all variants is written

in C and all experiments are performed on a workstation with Intel Core 2 Duo

CPU E4500 2.2 GHz and 2 GB RAM which is supplied with a Gainward NVIDIA

GeForce GTX280 GPU (240 SIMD-processors) with 1GB GDDR3 SDRAM.

8.1 Evaluation of Similarity Join on the GPU

The performance of similarity join on the GPU, is validated by the comparison

of four different variants for executing similarity join:

1. Nested loop join (NLJ) on the CPU

2. NLJ on the CPU with index support (as described in Section 4)

3. NLJ on the GPU

4. NLJ on the GPU with index support (as described in Section 4)

For each version we determine the speedup factor by the ratio of CPU runtime

and GPU runtime. For this purpose we generated three 8-dimensional synthetic

data sets of various sizes (up to 10 million (m) points) with different data distributions,

as summarized in Table 1. Data set DS1 contains uniformly distributed

data. DS2 consists of five Gaussian clusters which are randomly distributed in

feature space (see Figure 12(a)). Similar to DS2, DS3 is also composed of five

Gaussian clusters, but the clusters are correlated. An illustration of data set


(a) Random

Clusters

(b) Linear

Clusters

Fig. 12. Illustration of the

data sets DS2 and DS3

Data Mining Using Graphics Processing Units 83

Table 1. Data Sets for the Evaluation of the

Similarity Join on the GPU

Name Size Distribution

DS1 3m - 10m points uniform distribution

DS2 250k - 1m points normal distribution,

gaussian clusters

DS3 250k - 1m points normal distribution,

gaussian clusters

DS3 is given in Figure 12(b). The threshold ɛ was selected to obtain a join result

where each point was combined with one or two join partners on average.

Evaluation of the Size of the Data Sets. Figure 13 displays the runtime in

seconds and the corresponding speedup factors of NLJ on the CPU with/without

index support and NLJ on the GPU with/without index support in logarithmic

scale for all three data sets DS1, DS2 and DS3. The time needed for data transfer

from CPU to the GPU and back as well as the (negligible) index construction

time has been included. The tests on data set DS1 were performed with a join

selectivity of ɛ =0.125, and ɛ =0.588 on DS2 and DS3 respectively.

NLJ on the GPU with index support performs best in all experiments, independent

of the data distribution or size of the data set. Note that, due to

massive parallelization, NLJ on the GPU without index support outperforms

CPU without index by a large factor (e.g. 120 on 1m points of normal distributed

data with gaussian clusters). The GPU algorithm with index support

outperforms the corresponding CPU algorithm (with index) by a factor of 25 on

data set DS2. Remark that for example the overall improvement of the indexed

GPU algorithm on data set DS2 over the non-indexed CPU version is more

than 6,000. This results demonstrate the potential of boosting performance of

database operations with designing specialized index structures and algorithms

for the GPU.

Evaluation of the Join Selectivity. In these experiments we test the impact

of the parameter ɛ on the performance of NLJ on GPU with index support

and use the indexed implementation of NLJ on the CPU as benchmark. All

experiments are performed on data set DS2 with a fixed size of 500k data points.

The parameter ɛ is evaluated in a range from 0.125 to 0.333.

Figure 14(a) shows that the runtime of NLJ on GPU with index support

increases for larger ɛ values. However, the GPU version outperforms the CPU

implementation by a large factor (cf. Figure 14(b)), that is proportional to the

value of ɛ. In this evaluation the speedup ranges from 20 for a join selectivity of

0.125 to almost 60 for ɛ =0.333.


84 C. Böhm et al.

10000000.0

1000000.0

100000.0 100000.0

10000.0

1000.0

100.0

10.0

1.0

Timme�(

(sec)

Timme�(s

sec)

Time�( sec)

2 4 6 8 10 12

Size�(m)

CPU

CPU CPU�indexed indexed

GPU

GPU�indexed

(a) Runtime on Data Set DS1

100000.0

10000.0

1000.0

100.0

10.0

10 1.0

01 0.1 100 400 700 1000

Size�(k)

CPU

CPU CPU�indexed i d d

GPU

GPU�indexed

(c) Runtime on Data Set DS2

100000.0

10000.0

1000.0

100.0

10.0

1.0

0.1 100 400 700 1000

Size�(k)

CPU

CPU CPU�indexed indexed

GPU

GPU�indexed

(e) Runtime on Data Set DS3

Sp peedup�F

Factoor

Speed dup�Factor

Speed dup�Factor

150.0

130 130.00

110.0

90.0

70.0

50.0

30.0

10.0

2 4 6 8 10 12

Size�(m)

Without Without�index index

With�index

(b) Speedup on Data Set DS1

150.0

130.0

110.0

90.0

70.0 70.0

50.0

30.0

10.0

100 400 700 1000

Size�(k)

Without Without�index index

With�index

(d) Speedup on Data Set DS2

150.0

130.0

110.0

90.0

70.0 70.0

50.0

30.0

10.0

100 400 700 1000

Size�(k)

Without Without�index index

With�index

(f) Speedup on Data Set DS3

Fig. 13. Evaluation of the NLJ on CPU and GPU with and without Index Support

w.r.t. the Size of Different Data Sets

Evaluation of the Dimensionality. These experiments provide an evaluation

with respect to the dimensionality of the data. As in the experiments for the

evaluation of the join selectivity, we use again the indexed implementations both

on CPU and GPU and perform all tests on data set DS2 with a fixed number

of 500k data objects. The dimensionality is evaluated in a range from 8 to 32.

We also performed these experiments with two different settings for the join

selectivity, namely ɛ =0.588 and ɛ =1.429.

Figure 15 illustrates that NLJ on GPU outperforms the benchmark method

on CPU by factors of about 20 for ɛ =0.588 to approximately 70 for ɛ =1.429.

This order of magnitude is relatively independent of the data dimensionality.

As in our implementation the dimensionality is already known at compile time,

optimization techniques of the compiler have an impact on the performance of


Timme�(seec)

1000.0

100.0

10.0

1.0

0.10 0.15 0.20 0.25 0.30 0.35

epsilon

(a) Runtime on Data Set DS2

Data Mining Using Graphics Processing Units 85

CPU

GPU

Speed dup�Factor

70.0

60.0

50.0

40.0

30.0

20.0

10.0

0.0

0.10 0.15 0.20 0.25 0.30 0.35

epsilon

(b) Speedup on Data Set DS2

Fig. 14. Impact of the Join Selectivity on the NLJ on GPU with Index Support

the CPU version as can be seen especially in Figure 15(c). However the dimensionality

also affects the implementation on GPU, because higher dimensional

data come along with a higher demand of shared memory. This overhead affects

the number of threads that can be executed in parallel on the GPU.

8.2 Evaluation of GPU-Supported DBSCAN

As described in Section 6.2, we suggest two different variants to implement the

self-join to support DBSCAN on GPU, whose characteristic are briefly reviewed

in the following:

Time�(seec)

1000.0

100 100.00

10.0

1.0

2 6 10 14 18 22 26 30

Dimensionality

CPU

GPU

(a) Runtime on Data Set D2 (ɛ =0.588)

Time�(sec)

100.0

10.0

1.0

2 6 10 14 18 22 26 30

Dimensionality

CPU

GPU

(c) Speedup on Data Set D2 (ɛ =1.429)

Speed dup�Factor

100.0

80.0

60.0

40 40.00

20.0

0.0

2 6 10 14 18 22 26 30

Dimensionality

(b) Speedup on Data Set D2 (ɛ =0.588)

Speed dup�Factor

100.0

80.0

60.0

40 40.00

20.0

0.0

2 6 10 14 18 22 26 30

Dimensionality

(d) Speedup on Data Set D2 (ɛ =1.429)

Fig. 15. Impact of the Dimensionality on the NLJ on GPU with Index Support


86 C. Böhm et al.

Timme�(se

ec)

1000.0

100.0

10.0

1.0

0.10 0.35 0.60 0.85 1.10

epsilon

Synchronization

no�

Synchronization

Fig. 16. Evaluation of two versions for the self-join on GPU w.r.t. the join selectivity

1. Increment of the counters regarding a pair of objects (x, q) that fulfills the

join condition is done by the use of an atomic operation that involves synchronization

of all threads.

2. Increment of the counters can be performed without synchronization but

with duplicated workload instead.

We evaluate both options on a synthetic data set with 500k points generated

as specified as DS1 in Table 1. Figure 16 displays the runtime of both options.

For ɛ ≤ 0.6, the runtime is in the same order of magnitude, the synchronized

variant 1 being slightly more efficient. From this point on, the non-synchronized

variant 2 is clearly outperforming variant 1 since parallelism is not limited by

synchronization.

8.3 Evaluation of CUDA-K-Means

To analyze the efficiency of K-means clustering on the GPU, we present experiments

with respect to different data set sizes, number of clusters and dimensionality

of the data. As benchmark we apply a single-threaded implementation

of K-means on the CPU to determine the speedup of the implementation of

K-means on the GPU. As the number of iterations may vary in each run of the

experiments, all results are normalized by a number of 50 iterations both on the

GPU and the CPU implementation of K-means. All experiments are performed

on synthetic data sets as described in detail in each of the following settings.

Evaluation of the Size of the Data Set. For these experiments we created

8-dimensional synthetic data sets of different size, ranging from 32k to 2m data

points. The data sets consist of different numbers of random clusters, generated

as as specified as DS1 in Table 1.

Figure 17 displays the runtime in seconds in logarithmic scale and the corresponding

speedup factors of CUDA-K-means and the benchmark method on the

CPU for different number of clusters. The time needed for data transfer from

CPU to GPU and back has been included. The corresponding speedup factors

are given in Figure 17(d). Once again, these experiments support the evidence

that the performance of data mining approaches on GPU outperform classic


Time�(sec)

Time�(sec)

1000.0

100.0

10 10.00

1.0

0 1000 2000

0.1

Size�(k)

(a) Runtime for 32 clusters

100000.0

10000.0

1000.0

100 100.00

10.0

1.0

0.1

0 1000

Size�(k)

2000

(c) Runtime for 256 clusters

Data Mining Using Graphics Processing Units 87

CPU

GPU

CPU CPU

GPU

Time�(sec)

Sp peedup�F

Factoor

1000.0

100.0

10 10.00

1.0

0 1000 2000

0.1

Size�(k)

(b) Runtime for 64 clusters

1200.0

1000 1000.00

800.0

600.0

400.0

200 200.00

0.0

0 1000 2000

Size�(k)

CPU

GPU

k�=�256

k�=�64

kk�=�32 32

(d) Speedup for 32, 64 and 256 clusters

Fig. 17. Evaluation of CUDA-K-means w.r.t. the Size of the Data Set

CPU versions by significant factors. Whereas a speedup of approximately 10 to

100 can be achieved for relatively small number of clusters, we obtain a speedup

of about 1000 for 256 clusters, that is even increasing with the number of data

objects.

Evaluation of the Impact of the Number of Clusters. We performed

several experiments to validate CUDA-K-means with respect to the number of

clusters K. Figure 18 shows the runtime in seconds of CUDA-K-means compared

with the implementation of K-means on the CPU on 8-dimensional synthetic

data sets that contain different number of clusters, ranging from 32 to 256,

again together with the corresponding speedup factors in Figure 18(d).

The experimental evaluation of K on a data set that consists of 32k points

results in a maximum performance benefit of more than 800 compared to the

benchmark implementation. For 2m points the speedup ranges from nearly 100

up to even more than 1,000 for a data set that comprises 256 clusters. In this

case the calculation on the GPU takes approximately 5 seconds, compared to

almost 3 hours on the CPU. Therefore, we determine that due to massive parallelization,

CUDA-K-means outperforms CPU by large factors, that are even

growing with K and the number of data objects n.

Evaluation of the Dimensionality. These experiments provide an evaluation

with respect to the dimensionality of the data. We perform all tests on synthetic


88 C. Böhm et al.

Time�(sec)

Tim me�(sec)

1000.0

100.0

10 10.00

1.0

0 64 128 192 256

0.1

k

10000.0

(a) Runtime for 32k points

1000.0

100.0

10.0

1.0

0 64 128 192 256

k

(c) Runtime for 2m points

CPU

GPU

CPU

GPU

Tim me�(sec)

Speed dup�Factor

10000.0

1000.0

100.0

10.0

1.0

0 64 128 192 256

(b) Runtime for 500k points

1200.0

1000.0

800.0

600.0

400.0

200.0

0.0

k

0 64 128 192 256

k

CPU

GPU

2m�points

500k�points

32k�points

(d) Speedup for 32k, 500k and 2m points

Fig. 18. Evaluation of CUDA-K-means w.r.t. the number of clusters K

data consisting of 16k data objects. The dimensionality of the test data sets

vary in a range from 4 to 256. Figure 19(b) illustrates that CUDA-K-means

outperforms the benchmark method K-means on the CPU by factors of 230 for

128-dimensional data to almost 500 for 8-dimensional data. On the GPU and

the CPU, the dimensionality affects possible compiler optimization techniques,

like loop unrolling as already shown in the experiments for the evaluation of the

similarity join on the GPU.

In summary, the results of this section demonstrate the high potential of

boosting performance of complex data mining techniques by designing specialized

index structures and algorithms for the GPU.

Ti ime�(sec)

)

10000.0

1000.0

100.0

10.0

1.0

0

01 0.1

64 128 192 256

Dimensionality

(a) Runtime

CPU

GPU

Sp peedup�F

Factoor

700.0

600.0

500 500.00

400 400.00

300 300.00

200.0

0 64 128 192 256

Dimensionality

(b) Speedup

Fig. 19. Impact of the Dimensionality of the Data Set on CUDA-K-means


9 Conclusions

Data Mining Using Graphics Processing Units 89

In this paper, we demonstrated how Graphics processing Units (GPU) can effectively

support highly complex data mining tasks. In particular, we focussed

on clustering. With the aim of finding a natural grouping of an unknown data

set, clustering certainly is among the most wide spread data mining tasks with

countless applications in various domains. We selected two well-known clustering

algorithms, the density-based algorithm DBSCAN and the iterative algorithm Kmeans

and proposed algorithms illustrating how to effectively support clustering

on GPU. Our proposed algorithms are accustomed to the special environment

of the GPU which is most importantly characterized by extreme parallelism at

low cost. A single GPU consists of a large number of processors. As buildings

blocks for effective support of DBSCAN, we proposed a parallel version of the

similarity join and an index structure for efficient similarity search. Going beyond

the primary scope of this paper, these building blocks are applicable to

support a wide range of data mining tasks, including outlier detection, association

rule mining and classification. To illustrate that not only local density-based

clustering can be efficiently performed on GPU, we additionally proposed a parallelized

version of K-means clustering. Our extensive experimental evaluation

emphasizes the potential of the GPU for high-performance data mining. In our

ongoing work, we develop further algorithms to support more specialized data

mining tasks on GPU, including for example subspace and correlation clustering

and medical image processing.

References

1. NVIDIA CUDA Compute Unified Device Architecture - Programming Guide

(2007)

2. Bernstein, D.J., Chen, T.-R., Cheng, C.-M., Lange, T., Yang, B.-Y.: Ecm on graphics

cards. In: Soux, A. (ed.) EUROCRYPT 2009. LNCS, vol. 5479, pp. 483–501.

Springer, Heidelberg (2009)

3. Böhm, C., Braunmüller, B., Breunig, M.M., Kriegel, H.-P.: High performance clustering

based on the similarity join. In: CIKM, pp. 298–305 (2000)

4. Böhm, C., Noll, R., Plant, C., Zherdin, A.: Indexsupported similarity join on graphics

processors. In: BTW, pp. 57–66 (2009)

5. Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: Lof: Identifying density-based

local outliers. In: SIGMOD Conference, pp. 93–104 (2000)

6. Cao, F., Tung, A.K.H., Zhou, A.: Scalable clustering using graphics processors. In:

Yu, J.X., Kitsuregawa, M., Leong, H.-V. (eds.) WAIM 2006. LNCS, vol. 4016, pp.

372–384. Springer, Heidelberg (2006)

7. Catanzaro, B.C., Sundaram, N., Keutzer, K.: Fast support vector machine training

and classification on graphics processors. In: ICML, pp. 104–111 (2008)

8. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering

clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996)

9. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Knowledge discovery and data

mining: Towards a unifying framework. In: KDD, pp. 82–88 (1996)


90 C. Böhm et al.

10. Govindaraju, N.K., Gray, J., Kumar, R., Manocha, D.: Gputerasort: high performance

graphics co-processor sorting for large database management. In: SIGMOD

Conference, pp. 325–336 (2006)

11. Govindaraju, N.K., Lloyd, B., Wang, W., Lin, M.C., Manocha, D.: Fast computation

of database operations using graphics processors. In: SIGMOD Conference,

pp. 215–226 (2004)

12. Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large

databases. In: SIGMOD Conference, pp. 73–84 (1998)

13. Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIG-

MOD Conference, pp. 47–57 (1984)

14. He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N.K., Luo, Q., Sander, P.V.:

Relational joins on graphics processors. In: SIGMOD, pp. 511–524 (2008)

15. Katz, G.J., Kider, J.T.: All-pairs shortest-paths for large graphs on the gpu. In:

Graphics Hardware, pp. 47–55 (2008)

16. Kitsuregawa, M., Harada, L., Takagi, M.: Join strategies on kd-tree indexed relations.

In: ICDE, pp. 85–93 (1989)

17. Koperski, K., Han, J.: Discovery of spatial association rules in geographic information

databases. In: Egenhofer, M.J., Herring, J.R. (eds.) SSD 1995. LNCS, vol. 951,

pp. 47–66. Springer, Heidelberg (1995)

18. Leutenegger, S.T., Edgington, J.M., Lopez, M.A.: Str: A simple and efficient algorithm

for r-tree packing. In: ICDE, pp. 497–506 (1997)

19. Lieberman, M.D., Sankaranarayanan, J., Samet, H.: A fast similarity join algorithm

using graphics processing units. In: ICDE, pp. 1111–1120 (2008)

20. Liu, W., Schmidt, B., Voss, G., Müller-Wittig, W.: Molecular dynamics simulations

on commodity gpus with cuda. In: Aluru, S., Parashar, M., Badrinath, R.,

Prasanna, V.K. (eds.) HiPC 2007. LNCS, vol. 4873, pp. 185–196. Springer, Heidelberg

(2007)

21. Macqueen, J.B.: Some methods of classification and analysis of multivariate observations.

In: Fifth Berkeley Symposium on Mathematical Statistics and Probability,

pp. 281–297 (1967)

22. Manavski, S., Valle, G.: Cuda compatible gpu cards as efficient hardware accelerators

for smith-waterman sequence alignment. BMC Bioinformatics 9 (2008)

23. Meila, M.: The uniqueness of a good optimum for k-means. In: ICML, pp. 625–632

(2006)

24. Plant, C., Böhm, C., Tilg, B., Baumgartner, C.: Enhancing instance-based classification

with local density: a new algorithm for classifying unbalanced biomedical

data. Bioinformatics 22(8), 981–988 (2006)

25. Shalom, S.A.A., Dash, M., Tue, M.: Efficient k-means clustering using accelerated

graphics processors. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008.

LNCS, vol. 5182, pp. 166–175. Springer, Heidelberg (2008)

26. Szalay, A., Gray, J.: 2020 computing: Science in an exponential world. Nature 440,

413–414 (2006)

27. Tasora, A., Negrut, D., Anitescu, M.: Large-scale parallel multi-body dynamics

with frictional contact on the graphical processing unit. Proc. of Inst. Mech. Eng.

Journal of Multi-body Dynamics 222(4), 315–326


Context-Aware Data and IT Services Collaboration

in E-Business

Khouloud Boukadi 1 , Chirine Ghedira 2 , Zakaria Maamar 3 , Djamal Benslimane 2 ,

and Lucien Vincent 1

1 Ecole des Mines, Saint Etienne, France

2 University Lyon 1, France

3 Zayed University, Dubai, U.A.E.

{boukadi,Vincent}@emse.fr, {cghedira,dbenslim}@liris.cnrs.fr,

zakaria.maamar@zu.ac.ae

Abstract. This paper discusses the use of services in the design and development

of adaptable business processes, which should let organizations quickly

react to changes in regulations and needs. Two types of services are adopted

namely Data and Information Technology. A data service is primarily used to

hide the complexity of accessing distributed and heterogeneous data sources,

while an information technology service is primarily used to hide the complexity

of running requests that cross organizational boundaries. The combination of

both services takes place under the control of another service, which is denoted

by service domain. A service domain orchestrates and manages data and information

technology services in response to the events that arise and changes that

occur. This happens because service domains are sensible to context. Policies

and aspect-oriented programming principles support the exercise of packaging

data and information technology services into service domains as well as making

service domains adapt to business changes.

Keywords: service, service adaptation, context, aspect-oriented programming,

policy.

1 Introduction

With the latest development of technologies for knowledge management on the one

hand, and techniques for project management on the other hand, both coupled with

the widespread use of the Internet, today’s enterprises are now under the pressure of

adjusting their know-how and enforcing their best practices. These enterprises have to

be more focused on their core competencies and hence, have to seek the support of

other peers through partnership to carry out their non-core competencies. The success

of this partnership depends on how business processes are designed as these processes

should be loosely coupled and capable to cross organizational boundaries. Since the

inception of the Service-Oriented Architecture (SOA) paradigm along with its multiple

implementation technologies such as Jini services and Web services, the focus of

the industry community has been on providing tools that would allow seamless and

flexible application integration within and across organizational boundaries. Indeed,

A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 91–115, 2009.

© Springer-Verlag Berlin Heidelberg 2009


92 K. Boukadi et al.

SOA offers solutions to interoperability, adaptability, and scalability challenges that

today’s enterprises have to tackle. The objective, here, is to let enterprises collaborate

by putting their core services together, which leads to the creation of new applications

that should be responsive to changes in business requirements and regulations. Nevertheless,

looking at enterprise applications from a narrowed perspective, which

consists of services and processes only, has somehow overlooked the data that these

applications use in terms of input and output. Data identification and integration are

left at a later stage of the development cycle of these applications, which is not very

convenient when these data are spread over different sources including relational

databases, silos of data-centric homegrown or packaged applications, XML files, just

to cite a few [1]. As a result, data identification and integration turn out tedious for

SOA application developers: bits and pieces of data need to be retrieved/updated

from/in heterogeneous data sources with different interfaces and access methods. This

situation has undermined SOA benefits and forced the SOA community to recognize

the importance of adopting a data-oriented view of services. This has resulted into the

emergence of the concept of data services.

In this paper, we look into ways of exposing IT applications and data sources as

services in compliance with SOA principles. To this end, we treat a service as either

an IT Service (ITS) or a Data Service (DS) and identify the necessary mechanisms

that would let ITSs and DSs first, work hand-in-hand during the integration exercise

of enterprise applications and second, engage in a controlled way in high-level functionalities

to be referred to as Service Domains (SDs). Basically, a SD orchestrates

ITSs and DSs in order to provide ready-to-use high-level functionalities to users. By

ready-to-use we mean service publication, selection, and combination of fine-grained

ITSs and DSs are already complete.

We define ITSs as active components that make changes in the environment using

update operations that empower them, whereas DSs as passive component that return

data only (consultation) and thus, do not impact the environment. In this paper we populate

this environment with specific elements, which we denote by Business Objects

(BOs). The dynamic nature of BOs (e.g., new BOs are made available, some cease to

exist without prior notice, etc.) and the business processes that are built upon these BOs,

requires that SDs should be flexible and sensible to all changes in an enterprise’s requirements

and regulations. We enrich the specification of SDs with contextual details

such as execution status of each participating service (DS or ITS), type of failure along

with the corrective actions, etc. This enrichment is illustrated with the de facto standard

namely the Business Process Execution Language (BPEL) for service integration. BPEL

specifies a business process behavior through automated process integration both within

and between organizations. We complete the enrichment of BPEL with context in compliance

with Aspect-Oriented Programming (AOP) principles in terms of aspect injection,

activation, and execution and through a set of policies.

While context and policies are used separately in different SOA initiatives[2, 3],

we examine in this paper their role in designing and developing SDs. This role is

depicted using a multi-level architecture that supports the orchestration of DSs and

ITSs in response to changes detected in the context. Three levels are identified

namely executive, business, and resource. The role of policies and context in this

architecture is as follows:


Context-Aware Data and IT Services Collaboration in E-Business 93

• The trend towards context-aware, adaptive, and on-demand computing requires

that SDs should respond to changes in the environment. This could happen by letting

SDs sense the environment and take actions.

• Policies manage and control the participation of DSs and ITSs in SDs to guarantee

free-of-conflicts SDs. Conflicts could be related to non-sharable resources and semantic

mismatches. Several types of policies will be required so that the particularities

of DSs and ITSs are taken into account.

The rest of the paper is organized as follows. Section 2 defines some concepts and

introduces a running example. Section 3 presents the multi-level architecture for service

(ITSs and DSs) collaboration and outlines the specification of these services.

Section 4 introduces the adaptation of services based on aspect as well as the role of

policies first, in managing the participation of DSs and ITSs in SDs and second, in

controlling the aspect injection within SDs. Prior to concluding in Section 6, related

work is reported in Section 5.

2 Background

2.1 Definitions

IT Service. The term IT services is used very often nowadays, even though not always

with the same meaning. Existing definitions range from the very generic and allinclusive

to the very specific and restrictive. Some authors in [4, 5] define IT services

as a software application accessible to other applications over the Web. Another definition

is provided by [6], which says that an IT service is provided by an IT system or

the IT department (respectively an external IT service provider) to support business

processes. The characteristics of IT services can vary significantly. They can comprise

single software components as well as bundles of software components, infrastructure

elements, and additional services. These additional services are usually information

services, consulting services, training services, problem solving services, or modification

services. They are provided by operational processes (IT service processes)

within the IT department or the external service provider [7]. In this paper, we consider

that IT services should be responsible for representing and implementing business

processes in compliance with SOA principles. An IT service corresponds to a

functional representation of a real-life business activity having a meaningful effect to

end-users. Current practices suggest that IT services could be obtained by applying IT

SOA methods such as SAP NetWeaver and IBM WebSphere. These methods bundle

IT software and infrastructure and offer them as Web services with standardized and

well-defined interfaces. We refer to Web services that result out of the application of

such methods as enterprise IT Web services or simply IT services.

Data Service is, in a recent report from Forrester Research, “an information service

(i.e., data service) provides a simplified, integrated view of real-time, high-quality information

about a specific business entity, such as a customer or product. It can be

provided by middleware or packaged as an individual software component. The information

that it provides comes from a diverse set of information resources, including


94 K. Boukadi et al.

operational systems, operational data stores, data warehouses, content repositories,

collaboration stores, and even streaming sources in advanced cases’’ [8]. Another

definition suggests that a data service is “a form of Web service, optimized for the

real-time data integration demands of SOA. Data services virtualize data to decouple

physical and logical locations and therefore avoid unnecessary data replication. Data

services abstract complex data structures and syntax. Data services federate disparate

data into useful composites. Data services also support data integration across both

SOA and non-SOA applications” [9]. Data services can be seen as a new class of services

that sits between service-based applications and enterprises’ data sources. By

doing so, the complexity of accessing these sources is minimized, which lets application

developers focus on the application logics of the solutions to develop. These data

sources are the basis of different business objects such as customer, order, and invoice.

Business object is “the representation of a thing active in the business domain, including

at least its business name and definition, attributes, behavior, relationships,

and constraints” (OMG’s Business Object Management Special Interest Group).

Another definition by the Business Object Management Architecture (jeffsutherland.com/oopsla97/marshall.html)

suggests that a business object could be defined

through the concepts of purpose, process, resource, and organization. A purpose is

about the rationale of a business. A process illustrates how this purpose is reached

through a series of dependent activities. A resource is a computing platform upon

which the activities of a process are executed. Finally, an organization manages resources

in terms of maintenance, access rights, etc.

Policies are primarily used to express the actions to take in response to the occurrence

of some events. According to [10], policies are “information which can be used to modify

the behavior of a system”. Another definition suggests that policies are “external,

dynamically modifiable rules and parameters that are input to a system so that this

latter can then adjust to administrative decisions and changes in the execution environment”

[5]. In the Web services field, policies are treated as rules and constraints that

specify and control the behavior of a Web service upon invocation or participation in

composition. For example, a policy determines when a Web service can be invoked,

what constraints are put on the inputs a Web service expects, how a Web service can be

substituted by another in case of failure, etc. According to [11], policies could be at two

levels. At the higher level, policies monitor the execution progress of a Web service. At

the lower level, policies address issues like how Web services communicate and what

information is needed to enable comprehensive data exchange.

Context “… is not simply the state of a predefined environment with a fixed set of

interaction resources. It is part of process of interacting with an ever-changing environment

composed of reconfigurable, migratory, distributed, and multi-scale resources”[12].

In the field of Web services, context facilitates the development and

deployment of context-aware Web services. Standard Web services descriptions are

then, enriched with context details and new frameworks to support this enrichment are

developed [13].

Aspect Oriented Programming. AOP is a paradigm that captures and modularizes

concerns that crosscut a software system into modules called Aspects. Aspects can be


Context-Aware Data and IT Services Collaboration in E-Business 95

integrated dynamically into a system using the dynamic weaving principle [14]. In

AOP, unit of modularity is introduced using aspects that contain different code fragments

(known as advice) and location descriptions (known as pointcuts) that identify

where to plug the code fragment. These points, which can be selected using pointcuts,

are called join points. The most popular Aspect language is Java-based AspectJ [15].

2.2 Running Example/Motivating Scenario

Our running example is about a manufacturer of plush toys that gets extremely busy

with orders during Christmas time. When an order is received, the first step consists

of requesting from suppliers the different components that contribute to the production

of the plush toys as per an agreed time frame. When the necessary components

are received, the assembly operations begin. Finally, the manufacturer selects a logistic

company to deliver these products by the due date. In this scenario, the focus is on

the delivery service only.

Let us assume an inter-enterprise collaboration is established between the manufacturer

(service consumer) and a logistic enterprise (service provider). This latter delivers

parcels from the manufacturer’s warehouse to a specific location (Fig. 1 - step (i)).

If there are no previous interactions between these two bodies, the logistic enterprise

verifies the shipped merchandise. Upon verification approval, putting merchandise in

parcels service is immediately invoked. This one uses a data service known as parcel

service, which checks the number of parcels to deliver. Putting merchandise in parcels

service is followed by delivery price and computing delivery price data services.

The delivery price data service retrieves the delivery price that corresponds to the

manufacturer order based on the different enterprise business objects it has access to

such as toy (e.g., size of toy), customer (e.g., discount for regular customers), and

parcel (e.g., size of parcel).

Finally, the merchandise is transported to the specified location at the delivery due

date. The delivery service is considered as a SD that orchestrates four IT services and

two data services: picking merchandise, verifying merchandise, putting merchandise

in parcels, delivery price, computing delivery price, and delivering merchandise.

Fig.1 depicts a graph-based orchestration schema (for instance a BPEL process) of the

delivery service.

Fig. 1. The delivery service internal process


96 K. Boukadi et al.

The inter-enterprise collaboration between the manufacturer and the logistic enterprise

raises the importance of establishing dynamic (not pre-established) contacts

since different enterprise logistics exist. To this end, the orchestration schema of the

SD should be aware of the contexts of both manufacturer and nature of collaboration.

Additional details can easily affect the progress of any type collaboration. For example,

if the manufacturer is located outside the country, some security controls should

be added and price calculation should be reviewed. Thus, the SD orchestration

schema should be enhanced with contextual information that triggers changes in the

process components (ITSs and DSs) in a timely manner. In this case, environmental

context such as weather conditions (e.g., snow storm, heavy rain) may affect the IT

service "putting merchandise in parcels". Consequently, several actions should be

anticipated to avoid the deterioration of the merchandise by using metal instead of

regular cardboard boxes.

Besides, the participation of the different ITSs and DSs in the delivery SD should

be managed and controlled in order to guarantee that the obtained SD is free-ofconflicts.

ITSs and DSs belong to different IT departments, each with its own characteristics,

rules, and constraints. As a result, effective mechanisms that would ensure

and regulate ITSs and DSs interaction are required. Besides, these mechanisms should

also capture changes in context and guarantee the adaptation of service domains’

behaviors in order to accommodate the situation in which they are going to operate.

3 Multi-level Architecture for Service Collaboration

In this section, the multi-level architecture that supports the orchestration of DSs and

ITSs during the exercise of developing SDs is presented in terms of concepts, duties

per layer, and service specifications.

3.1 Service Domain Concept

The rationale of the service domain concept is to abstract at a higher-level the integration

of a large number of ITSs and DSs. A service domain is built upon (in fact, it

uses existing standards such as WSDL, SOAP, and UDDI) and enhances the Web

service concept. It does not define new application programming interfaces or standards,

but hides the complexity of this exercise by facilitating service deployment and

self-management. Fig. 2 illustrates the idea of developing inter-enterprise business

processes using several service domains.

A SD involves ITSs and DSs in two types of orchestration schemas: vertical and

horizontal. In the former, a SD controls a set of ITSs, which themselves controls a set

of DSs. In the latter, a SD controls both ITSs and DSs at the same time. These two

schemas offer the possibility of applying different types of control over the services

whether data or IT. Businesses have to address different issues, so they should be

given the opportunity to do so in different ways. DSs and ITSs participation in either

type of orchestration is controlled through a set of policies, which permit to guarantee

that SDs are free-of-conflicts. More details on policy use are given in subsections 4.2

and 4.3.2.


Context-Aware Data and IT Services Collaboration in E-Business 97

Fig. 2. Inter-enterprise collaboration based on service domains

According to the running example, the delivery SD orchestrates four IT services

and two data services: picking merchandise, verifying merchandise, putting merchandise

in parcels, delivery price, computing delivery price, and delivering merchandise.

Keeping these ITSs and DSs in one place facilitates manageability and avoids extra

composition work on the client side as well as exposing non-significant services like

"Verifying merchandise" on the enterprise side.

Fig. 3. Multi-layer architecture


98 K. Boukadi et al.

The multi-level architecture in Fig.3 operates in a top-down way. It starts from executive-layer

level, goes through the resource and business levels. These layers are

described in detail in the following.

3.2 Roles and Duties per Layer

Executive layer. In this layer, a SD consists of an Entry Module (EM), a Context

Manager Module (CMM), a Service Orchestration Module (SOM), and an Aspect

Activator Module (AAM). In Fig. 3, CMM, SOM, and AAM provide external interfaces

to the SD. The EM is SOAP-based to receive users’ requests and return responses.

In addition to these requests, the EM supports the administration of a SD.

For example, an administrator can send a register command to add a new ITS to a

given SD after signing up this ITS in the corresponding ITS registry. The register

command can also be used to add a new orchestration schema to the orchestration

schemas registry. When the EM receives a user’s request, it screens the orchestration

schemas registry to select a suitable orchestration schema for this request and identify

the best ITSs and DSs. The selection of this schema and other services takes into

account the customer context (detailed in section 4.1). Afterwards, the selected orchestration

schema is delivered to the SOM, which is basically an orchestration engine

based on BPEL [16].

The SOM presents an external interface called Execution Control Interface (ECI) that

lets user obtain information about the status of a SD that is under execution. This interface

is very useful in case of external collaboration as it ensures the monitoring of the

SD progress. This is a major difference between a SD and a regular Web service. In

fact, with the ECI a SD is based on the Glass box principle, which is opposite to the

black box principle. In the glass box the SD way-of-doing is visible to the environment

and mechanisms are provided to monitor the execution progress of the SD. Contrarily, a

Web service is seen as a black box piece of functionality: described by its message

interface and has no internal process structure that is visible to its environment.

Finally, the final external interface known as Context Detection Interface (CDI) is

used by the CMM to detect and catch changes in context so that SD adaptability is

guaranteed. This happens by selecting and injecting the right aspect with respect to

the current context change. To this end, a SD uses the AAM to identify a suitable

aspect for the current situation so that the AAM injects this aspect into the BPEL

process.

Resource Layer. It consists of two sub-layers namely source and service instances.

• The sources sub-layer is populated with different registries namely orchestration

schemas, ITS, DS, context, and aspect. The orchestration

schemas registry consists of a set of abstract processes based on ITSs

and DSs, which are at a later stage implemented as executable processes.

In addition, this sub-layer includes a set of data sources that DSs use for

functioning. The processing of these requests is subject to access privileges

that could be set by different bodies in the enterprise such as security

administrators. The content of the data sources evolves over time

following data sources addition, withdrawal, or modification. According


Context-Aware Data and IT Services Collaboration in E-Business 99

to the running example, customer and inventory databases are examples

of data sources.

• The services instances sub-layer is populated with a set of instance services

that originate from ITSs and DSs. On the one hand, DS instances

collect data from BOs and not from data sources. This permit to shield

DSs from semantic issues and changes in data sources and to prepare

data from disparate BOs that are scattered cross the enterprise. The data

that a DS collects could have different recipients including SDs, ITSs, or

other DSs. Basically, a DS crawls the business-objects level looking for

the BOs it needs. A DS consults the states that the BOs are currently in

to collect the data that are reported in these states. This collection depends

on the access rights (either public or private) that are put on data;

some data might not be available to DSs. According to the running example,

a DS could be developed to track the status of the parcels included

in the delivery process (i.e., number of used parcels). This DS

would have to access order and update parcel BOs. If order BO takes

now on orderChecked state, the DS will know the parcels that are confirmed

for inclusion in the delivery and hence, interact with the right ITS

to update the parcel BO. This is not the case if this BO was still in orderUpdated

state; some items might not have been confirmed yet. On

the other hand, ITS instances implement business processes that characterize

enterprises’ day-to-day activities. ITSs need the help of DSs and

BOs for the data they produce and host, respectively. Update means,

here, make a BO take on a new state, which is afterwards reflected on

some data sources by updating their respective data. This is not the case

with DSs that consult BOs only.

Business Layer. It (1) tracks the BOs that are developed and deployed according to

the profile of the enterprise and (2) identifies the capabilities of each BO. In [17] we

suggest that BOs should exhibit a goal-driven behavior instead of just responding to

stimuli. The objective is to let BOs (i) screen the data sources that have the data they

need, (ii) resolve data conflicts in case they raise, and (iii) inform other BOs about

their capabilities of data source access and data mediation. As mentioned earlier,

order, parcel, and customer are examples of BOs. These BOs are related to each other,

e.g., an order is updated only upon inspection of customer record and order status.

The data sources that these BOs access could be customer and inventory databases.

3.3 Layer Dependencies

Executive, resource, and business layers are connected to each other through a set of

dependencies. We distinguish two types of dependencies: intra-layer dependencies

and inter-layers dependencies.

1. Intra-layer dependencies: Within the resource layer two types of intra-layer

dependencies are identified:


100 K. Boukadi et al.

• Type and instance dependency: we differentiate between the ITSs types

that are published in the IT services registry, which are groups of similar (in term

of functionality) IT services, and the actual IT service instances that are made

available for invocation. To illustrate the complexity of the dependencies that

arises, we suggest hereafter a simple illustration of the number of IT service instances

that could be obtained out of ITSs. Let us consider ITSt = {ITSt1, . . .,

ITStα} a set of α IT service types and ITSi = {ITSi1, . . ., ITSiβ} a set of β service

instances that exist in the services instances sub-layer. The mapping of S

onto I is subjective and one-to-many. Assuming each IT service type ITst1 has N

instantiations, β =N×α. The same dependency exists between DS types in the

sources sub-Layer and data services instances in the services instances sub-layer.

• Composition dependency involves orchestration schemas in two sublayers

namely source and service instance. Composition illustrates today’s

business processes that are generally developed using distributed

and heterogeneous modules for instance services. For an enterprise

business process, it is critical to identify the services that are required,

the data that these services require, make sure that these services collaborate,

and develop strategies in case conflicts occur.

2. Inter-layers dependencies

• Access dependency involves the service instances sub-layer and the

business layer. Current practices expose services directly to data sources,

which could hinder the reuse opportunities of these services. The opposite

is here adopted by making services interact with BOs for their needs

of data. For a service, it is critical to identify the BOs it needs, comply

with the access privileges of these BOs, and identify the next services

that it will interact with after completing an orchestration process. Access

dependency could be of different types namely on-demand, periodic,

or event-driven. In the on-demand case, services submit requests to

BOs when needed. In the periodic case, services submit requests to BOs

according to a certain agreed-upon plan. Finally, in the event-driven case

services submit requests to BOs in response to some event.

• Invocation dependency involves the business and source layers. An invocation

implements the technical mechanisms that allow a BO access

the available data sources in terms of consultation or update (see Fig. 4).

A given BO includes a set of operations:

o A set of read methods, which provide various ways to retrieve and

return one or more instances of the data included in a data source.

o A set of write methods, responsible for updating (inserting, modifying,

deleting) one or more instances of the data included in the data sources.

o A set of navigation methods, responsible for traversing relationships

from one data source to one or more data of a second data source. For example,

the Customer BO can have two navigation methods getDelivOrder

and getElecOrder, to fetch for a given customer the delivery orders from a

delivery database and electronic orders from electronic order database.


Context-Aware Data and IT Services Collaboration in E-Business 101

Fig. 4. Invocation dependency between the business objects and the data sources

3.4 Services Specifications

3.4.1 Data Service Specification

DSs come along with a good number of benefits that would smooth the development

of enterprise SD:

Access unification: data related to a BO might be scattered across independent data

sources that could present three kinds of heterogeneities:

o Model heterogeneity: each data source has its own data model or data format

(relational tables, WSDL with XML schema, XML documents, flat files,

etc.).

o Interface heterogeneity: each type of data source has its own programming

interface; JDBC/SQL for relational databases, REST/SOAP for Web services,

file I/O calls, and custom APIs for packaged or homegrown datacentric

applications (like BAPI for SAP) .

The adoption of DSs relieves SOA application developers from having to directly

cope with the first two forms of heterogeneity. That is, in the field of Web services all

data sources are described using WSDL and invoked via REST or SOAP calls (which

means having the same interface), and all data are in XML form and described using

XML Schema (which means having the same data model).

Reuse and agility: the value-added of SOA to application development is reuse and

agility, but without flexibility at the data tier, this value-added could quickly erode.

Instead of relying on non-reusable proprietary codes to access and manipulate data in

monolithic application silos, DSs can be used and reused in multiple business

processes. This simplifies the development and maintenance of service-oriented applications

and introduces easy-to-use capabilities to use information in dynamic and

real-time processes.

To define DSs, we took into account the interactions that should take place between

the DSs and BOs. DSs are given access to BOs for consultation purposes. Each

DS represents a specific data-driven request whose satisfaction requires the participation

of several BOs. The following suggests an example of itemStatusOrder DS

whose role is to confirm the status of the items to include in a customer's order.


102 K. Boukadi et al.

Table 1. DS service structure

In the above structure, the following arguments are used:

1. Input argument identifies the elements that need to be submitted to a DS. These

elements could be obtained from different parties such as users and other DSs.

2. Output argument identifies the elements that a DS returns after its processing is

complete.

3. Method argument identifies the actions that a DS implements in response to the

access requests it runs over the different BOs.

Because DSs could request sensitive data from BOs, we suggest in Section 4.2.1 that

appropriate policies should be developed so that data misuse cases are avoided. We

refer to these policies as privacy.

3.4.2 IT Service Specification

For the IT service specification, we follow the one proposed by Papazoglou and

Heuvel in [18] who specify an IT service as (1) a structural specification that defines

service types, messages, port types, (2) a behavioral specification that defines service

Table 2. IT service structure


Context-Aware Data and IT Services Collaboration in E-Business 103

operations, effects, and side effects of service operations, and (3) a policy specification

that defines the policy assertions and constraints on the service. Based on this

specification, we propose in our work a set of policies as follows:

− Business policies correspond to policy specification.

− Behavior policy corresponds to structural specification.

− Privacy policies correspond to behavioral specification.

4 Services Collaboration

4.1 Context-Aware Orchestration

The concept of context appears in many disciplines as a meta-information that characterizes

the specific situation of an entity, to describe a group of conceptual entities,

partition a knowledge base into manageable sets or as a logical construct to facilitate

reasoning services [19]. The categorization of context is critical for the development

of adaptable applications. Context includes implicit and explicit inputs. For example,

user context can be deducted in an implicit way by the service provider such as in

pervasive environment using physical or software sensors. Explicit context is determined

precisely by entities that the context involves. Nevertheless, despite the various

attempts to suggest a context categorization, there is no proper categorization. Relevant

information differs from one domain to another and depends on their effective

use [20]. In this paper, we propose an OWL-based context categorization in Fig. 4.

This categorization is dynamic as new sub-categories can be added at any time. Each

context definition belongs to a certain category, which can be related to provider,

customer, and collaboration.

Fig. 5. Ontology for categories of context


104 K. Boukadi et al.

In the following, we explain the different concepts that constitute our ontology-based

model for context categorization:

− Provider-related context deals with the conditions under which providers can offer

their SDs. For example, performance attributes including some metrics to

measure a service quality: time, cost, QoS, and reputation. These attributes are

used to model the competition between providers.

− Customer-related context represents the set of available information and metadata

used by service providers to adapt their services. For example, a customer

profile permits to characterize a user.

− Collaboration-related context represents the context of the business opportunity.

We identify three sub-categories: location, time, and business domain. The location

and time represent the geographical location and the period of time within

which the business opportunity should be accomplished.

4.2 Policy Specification

As stated earlier, policies are primarily used to first, govern the collaboration between

ITSs, DSs, and SDs and second, reinforce specific aspects of this collaboration such

as when an ITS accepts to take part in a SD and when a DS rejects a data request from

an ITS because of risk of access right violation. Because of the variety of these aspects,

we decompose policies into different types and dissociate policies from the

business logics that services implement. Any change in a policy should “slightly’’

affect a service’s business logic and vice versa.

4.2.1 Types of Policies

Policies might be imposed by different types of initiators like the service itself, service

provider, and user who plans to use the service [21].

− Service driven policy is defined by the individual organizational that offer services.

This description is not enough.

− Service flow driven policy is defined by the organizations offering a composite

web service.

− Customer driven policy is meant to future consumers of services. Generally, a

user has various preferences in selecting a particular service, and these preferences

have to be taken into account during composition or even during other

steps like section selection, composition, and execution. For example, if two

providers have two services with the same functionality, the user would like to

consider the cheapest.

Policies are used in different application domains such as telecommunication, learning,

just to cite a few, which supports the rationale of developing different types of

policies. In this paper, we suggest the following types based on some of our previous

works [11, 22]:

− Business policy defines the constraints that restrict the completion of a business

process and determines how this process should be executed according to users’

requirements and organizations’ internal regulations. For example, a car loan


Context-Aware Data and IT Services Collaboration in E-Business 105

application needs to be treated within 48 hours and a bank account should maintain

a minimum balance.

− Behavior policy supports the decisions that a service (ITS and DS) has to make

when it receives a request from a DS to be part of the orchestration schema that

this associated with this DS. In [22], we defined three behaviors namely permission,

dispensation, and restriction, which we continue to use in this paper.

Additional details on these behaviors are given later.

− Privacy policy safeguards against the cases of data misuse by different parties

with focus here on DSs and ITSs that interact with BOs. For example, an ITS

needs to have the necessary credentials to submit an update request to a BO.

Credentials of an ITS could be based on the history of submitting similar request

and reputation level.

Fig. 5 illustrates how the three behaviors of a service (DS or ITS) are related to each

other based on the execution outcome of behavior policies [22]. In this figure, dispensation

(P) and dispensation(R) stand for dispensation related to permission and related

to restriction, respectively. In addition, engagement (+) and engagement (-) stand for

positive and negative engagement in a SD, respectively.

• Permission: a service accepts to take part in a service domain upon validation of

its current commitments in other service domains.

• Restriction: a service does not wish to take part in a service domain for various

reasons such as inappropriate rewards or lack of computing resources.

• Dispensation means that a service breaks either a permission or a restriction of

engagement in a service domain. In the former case, the service refuses to engage

despite the positive permission that is granted. This could be due to the

no

Engagement(-)

Permission

yes

Engagement(-)

Engagement(-)

yes

Dispensation P

yes

Dispensation R

no

Restriction

no yes

Engagement(+)

no

Engagement(+)

Fig. 6. Behaviors associated with a service


106 K. Boukadi et al.

unexpected breakdown of a resource upon which the service performance was

scheduled. In the latter case, the service does engage despite the restrictions that

are detected. The restrictions are overridden because of the priority level of the

business scenario that the service domain implements, which requires an immediate

handling of this scenario.

In [11], Maamar et al. report that several types of policy specification languages exist.

The selection of a policy specification language is guided by some requirements that

need to be satisfied [23]: expressiveness to support the wide range of policy requirements

arising in the system being managed, simplicity to ease the policy definition

tasks for people with various levels of expertise, enforceability to ensure a mapping of

policy specification into concrete policies for various platforms, scalability to guarantee

adequate performance, and analyzability to allow reasoning about and over policies.

In this paper we adopt WSPL is used. WSPL syntax is based on the OASIS

eXtensible Access Control Markup Language (XACML) standard (www.oasisopen.org/committees/download.php/2406/oasis-xacml-1.0.pdf).

The Listing.1 suggests a specification of a behavior policy with focus on privacy in

WSPL. It shows an example of an ITS that checks the minimum age and income of a

person prior to approving a car loan application.

Listing. 1. A behavior policy specification for an ITS

The Listing.2 suggests a specification of a business policy in WSPL. It shows an

example of a DS that checks the possibility of taking part in a service domain.

In addition to the arguments that form WSPL-defined policies, we added additional

arguments for the purpose of tracking the execution of these policies. These additional

arguments are as follows:

− Purpose: describes the rationale of developing a policy P.

− Monitoring authority: identifies the party that checks the applicability of a

policy P so that the outcomes of this policy are reinforced. A service provider or

policy developer could illustrate these parties.


Context-Aware Data and IT Services Collaboration in E-Business 107

Listing. 2. A business policy specification for a DS

− Scope (local or global): identifies the parties that are involved in the execution

of a policy P. “Local” means that the policy involves a specific services,

where “global” means that the policy involves different services.

− Side-effect: describes the policies that could be triggered following the

completion of policy P.

− Restriction: limits the applicability of a policy P according to different factors

such as time (e.g., business hours) and location (e.g., departments affected

by policy P performance).

4.3 Service Domain Adaptability Using Aspects

In the following we describe how we define and implement a context adaptive service

domain using AOP.

4.3.1 Rationale of AOP

AOP is based on two arguments. First, AOP enables crosscutting concerns, which is

crucial to separate context information from the business logic. For example, in

Delivery Service Domain, an aspect related to the calculation of extra fees could be

defined in case there is a change in the delivery date. Second, AOP promotes the

dynamic weaving principle. Aspects are activated and deactivated at runtime. Consequently,

a BPEL process can be dynamically altered upon request.

For the needs of SD adaptation, we suggest the following improvements in the existing

AOP techniques: runtime activation of aspects in the BPEL process to enable

dynamic adaptation according to context changes, and aspects selection to enable

customer-specific contextualization of the Service Domain.

4.3.2 Using Policies to Express Contexts

Modeling context is a crucial issue that needs to be addressed to assist context-aware

applications. By context modeling we mean the language that will be used to define

both service and enterprise collaboration contexts. Since, there is a diversity of contextual

information, we find several context modeling languages such as ConteXtML [24],

contextual schemas [25], CxBR (context-based reasoning) [26], and CxG (contextual

graphs) [27]. These languages provide the means for defining context in specific application

domains such as pervasive and mobile computing. All these representations have


108 K. Boukadi et al.

strengths and weaknesses. As stated in [28], lack of generality is the most frequent

drawback: usually, each representation is suited for only a specific type of application

and expresses a narrow vision of the context. Consequently, they present little or no

support for defining context in Web service based collaboration scenarios. In this paper,

we model the different types of context based on policy. Relation between context and

policies is depicted in the definitions below:

Definition 1. A service Context Ctxt is a pair (Ctxt-name, P) where Ctxt-name corresponds

to the context name derived from the context ontology (Fig) and P is the policy

related to the given context Ctxt. Let P-set= {P , P , …, P } denotes the set of

1 2 n

policies and SCx= {Cx , Cx ,…, Cx } the set of context properties related to a par-

1 2 n

ticular ITS or DS. We express the mapping between ITS or data service’ contexts and

policies with the mapping function MFs: SCx�P-set which gives the policies related

to a given ITS or Data service.

Definition 2. A customer context Custxt is a pair (Custxt -name, P) where Custxt-name

corresponds to the context name derived from the context ontology (Fig) and P is the

policy related to the given context Custxt. Let P-set= {P1, P2, …, Pn} denotes the set of

policies and CCx= {Cx1, Cx2,…, Cxn} the set of context properties related to a particular

customer. Same as the definition 1, we define a mapping function which retrieves the

set of policies relating to a given customer context: MFc: CCx �P-set.

In these definitions, context is described with policies. Consequently, to express context

we need to express at first policies. We introduce the specification of context

(customer, collaboration, and service contexts) in WSPL. Introducing the context

concept in WSPL comes from the need to specify certain constraints that can depend

on the environment in which the customer, the service, and the business collaboration

are operational. For instance, a customer context that depicts a security requirement

can be specified as follows.

Listing. 3. A customer context specified as a policy

4.3.3 Controlled Aspect Injection through Policies

We show how policies and AOP can work hand-in-hand Policies related to customer

and collaboration contexts are used to control the aspect injection within a SD. A SD

provides a set of adaptation actions that are context dependent. We implement these

actions as a set of aspects in order not to create any invasive code in the functional

service implementation. An aspect includes a pointcut that matches a given ITS or


Context-Aware Data and IT Services Collaboration in E-Business 109

data service and one or more advices. These advices refer to the context dependent

adaptation actions of this service. Advices are defined as Java methods and pointcuts

are specified in XML format.

Our implementation approach for the controlled aspect injection through policies is

presented in Fig.7. In this figure, the Aspect Activator Module previously presented in

Fig.3, includes the Aspect Manager Module (AMM), the Matching Module (MM),

and the Weaver Module (WM).

− The AMM is responsible for adding new aspects to a corresponding aspect registry.

In addition, the AMM can deal with a new advice implementation, which

could be added to this registry. The aspect registry contains the method names

of the different advices related to a given ITS or data service.

− The MM is the cornerstone of the proposed aspect injection approach. It receives

matching requests from the AMM and returns one or a list of matched

aspects.

− The WM is based on an AOP mechanism known as weaving. The WM performs

a run time weaving, which consists of injecting an advice implementation into

the core logic of an ITS or data service.

The control of an aspect injection into a DS or ITS is as follows. Once a context dependent

IT service or Data service operation is reached, the Context Manager Module

sends the AAM the service’s ID and its context dependent operation’ ID. Then, the

AMM identifies the set of aspects that can be executed to the ITS or DS based on the

information sent by the Context Manager Module (i.e., service’s ID and Operation’s

ID) (action 1 and 2). The set of aspects as well as the customer policies are

Fig. 7. Controlled aspect injection through policies


110 K. Boukadi et al.

transmitted to Matching Module which returns the aspects that match the customer

and the collaborations policies. The matching module is based on a matching algorithm

and uses domain ontology. Finally, the WM integrates the advice implementation

into the core logic of the service. By doing so, the service will execute the appropriate

aspect in response to the current context (customer and collaboration contexts).

For illustration purposes, consider a payment ITS which is aware of the past interactions

with customers. For loyal customer, the credit card payment is accepted, but bank

transfer is required for new customers. Hence, the payment operation depends on the

customer context, i.e., loyal or new one. The context dependent behaviors of the payment

ITS are exposed as a set of aspects. Three of them are depicted in Listing.4.

///Aspect 1

///Aspect 2


Aspect 3

Context-Aware Data and IT Services Collaboration in E-Business 111

Listing. 4. The three aspects related to the payment ITS

For example, the advice of Aspect 1 is expressed as a Java class, which is executed

instead of the operation captured by the pointcut (line 9). The join point, where the

advice is weaved, is the payment operation (line 10). The pointcuts are expressed as a

condition If Customer ="Loyal" (i.e., "past interaction=Yes") the advice uses the

credit card number, in order to perform the customer payment. Consider now the

customer related context which specifies a security requirement, which is previously

described. Based on this requirement, when executing the payment service, the

matching module will determine that only aspect 3 with secured transaction should be

applied. This aspect is then transmitted to the weaver module in order to be injected in

the payment service.

5 Related Work

In this work, we identify two types of works related to our proposal: on the one hand,

those proposals that, come from the data engineering field and propose approaches for

data service modeling and development; and, on the other hand, those ones that focus

specially in the adaptation of ITS (Web services).

Data services & the SOA software industry. Data services have gained considerable

attention from SOA software industry leaders over the last three years. Many

products are currently offered or being developed to make the creation of Data services

easier than ever, to cite a few, AquaLogic by BEA Systems [29], Astoria by

Microsoft [30], MetaMatrix by RedHat [31], Composite Software [9], Xcalia [32],

and IBM [33]. The products offered here integrate the enterprise’s data sources and

provide a uniform access to data through Data services. As a representative example,


112 K. Boukadi et al.

AquaLogic BEA’s data service is a collection of functions that all have a common

output schema, accept different sets of parameters, and are implemented via individual

XQuery expressions. In a simplified example, a Data Service exports a set of

functions returning Customer objects where one function takes as input the customer’s

last name, another one her city and state, and so on. AquaLogic exports these

Data services to SOA application developers as Data Web Services, where functions

become operations.

In Microsoft’s Astoria project a data service or ADO.NET data services is a RESTbased

framework that allows releasing data via flexible data services and well-known

industry standards (JSON and Atom). As opposed to message-oriented frameworks

like SOAP-based services, REST-based services use basic HTTP requests (GET,

POST, PUT and DELETE) to perform CRUD standing for Create, Read, Update and

Delete operations. Such query patterns allow navigating through data, following the

links established with the data schema. For example, /Customers ('PKEY1')/Orders

(1)/Employees returns the employees that created sales order 1 for the customer with

a key of 'PKEY1. (Source: http://msdn.microsoft.com/en-us/library/cc907912.aspx)

In addition, most commercial databases products incorporate mechanisms to export

database functionalities as Data Web services. Representative examples are the IBM

Document Access Definition Extension (DADX) technology (Db2XMLextender 1 )

and the Native XML Web Services for Microsoft SQL Server 2005[34]. DADX is

part of the IBM DB2 XML Extender, an XML/relational mapping layer, and facilitates

the development of Web services on top of relational databases that can, among

other things, execute SQL queries and retrieve relational data as XML.

Web services adaptation. Regarding the adaptation of Web services according to

context changes [35]; [36] many ongoing research have been released. In the proposed

work, we focus specially on the adaptation of a process. Some research efforts

from the Workflow community address the need for adaptability. They focus on formal

methods to make the workflow process able to adapt to changes in the environment

conditions. For example, authors in [37] propose eFlow with several constructs

to achieve adaptability. The authors use parallel execution of multiple equivalent

services and the notion of generic service that can be replaced by a specific set of

services at runtime. However, adaptability remains insufficient and vendor specific.

Moreover, many adaptation triggers, like infrastructure changes, considered by workflow

adaptation are not relevant for Web services because services hide all implementation

details and only expose interfaces described in terms of types of exchanged

messages and message exchange patterns. In addition, authors in [38] extend existing

process modeling languages to add context sensitive regions (i.e., parts of the business

process that may have different behaviors depending on context). They also introduce

context change patterns as a mean to identify the contextual situations (and especially

context change situations) that may have an impact on the behavior of a business

process. In addition, they propose a set of transformation rules that allow generating

a BPEL based business process from a context sensitive business process. However,

context change patterns which regulate the context changes are specific to their running

example with no-emphasis on proposing more generic patterns.

1 Go online to http://www.306.ibm.com/software/data/db2/extenders/xmlext/


Context-Aware Data and IT Services Collaboration in E-Business 113

There are a few works using an Aspect based adaptability in BPEL. In [39], the authors

presented an Aspect oriented extension to BPEL: the AO4BPEL which allows

dynamically adaptable BPEL orchestration. The authors combine business rules modeled

as Aspects with a BPEL orchestration engine. When implementing rules, the

choice of the pointcut depends only on the activities (invoke, reply or sequence).

Business rules in this work are very simple and do not express a pragmatic adaptability

constraint like context change in our case. Another work is proposed in [40] in

which the authors propose a policy-driven adaptation and dynamic specification of

Aspects to enable instance specific customization of the service composition. However,

they do not mention how they can present the aspect advices or how they will

consider the pointcuts.

6 Conclusion

In this paper, was presented a multi-level architecture that supports the design and

development of a high-level type of service known as Service Domain. This one orchestrates

a set of related ITSs and DSs. Service Domain enhances the Web service

concept to tackle the challenges that E-Business collaboration poses. In addition, to

address enterprise adaptability to context changes, we made Service Domain sensible

to context. We enhanced BPEL execution with AOP mechanisms. We have shown

that AOP enables crosscutting and context-sensitive logic to be factored out of the

service orchestration and modularized into Aspects. Last but not least, we illustrated

the role of policies and context in a Service Domain. Different types of policies were

proposed and then used first, to manage the participation of DSs and ITSs in SDs and

second, to control aspect injection within the SD. In term of future work, we plan to

complete the SD multi-level architecture and conduct a complete empirical study of

our approach.

References

1. Carey, M., et al.: Integrating enterprise information on demand with xQuery. XML Journal

2(6/7) (2003)

2. Yang, S.J.H., et al.: A new approach for context aware SOA. In: Proc. e-Technology, e-

Commerce and e-Service, EEE 2005, pp. 438–443 (2005)

3. Gorton, S., et al.: StPowla: SOA, Policies and Workflows. In: Book StPowla: SOA, Policies

and Workflows. Series StPowla: SOA, Policies and Workflows, pp. 351–362 (2007)

4. Arsanjani, A.: Service-oriented modeling and architecture (2004),

http://www.ibm.com/developerworks/library/ws-soa-design1/

5. Erl, T.: Service-Oriented Architecture (SOA): Concepts, Technology, and Design, p. 792.

Prentice Hall, Englewood Cliffs (2005)

6. Huang, Y., et al.: A Service Management Framework for Service-Oriented Enterprises. In:

Proceedings of the IEEE International Conference on E-Commerce Technology (2004)

7. Braun, C., Winter, R.: Integration of IT Service Management into Enterprise Architecture.

In: Proc. The 22th Annual ACM Symposium on Applied Computing, SAC 2007 (2007)


114 K. Boukadi et al.

8. Gilpin, M., Yuhanna, N.: Information-As-A-Service: What’s Behind This Hot New Trend?

(2007),

http://www.forrester.com/Research/Document/Excerpt/

0,7211,41913,00.html

9. C. Software, SOA Data Services Solutions, technical report (2008),

http://compositesoftware.com/solutions/soa.html

10. Lupu, E., Sloman, M.: Conflicts in Policy-Based Distributed Systems Management. IEEE

Transactions on Software Engineering 25(6) (1999)

11. Zakaria, M., et al.: Using policies to manage composite Web services. IT Professional 8(5)

(2006)

12. Coutaz, J., et al.: Context is key. Communications of the ACM 48(3) (2005)

13. Keidl, M., Kemper, A.: A Framework for Context-Aware Adaptable Web Services (Demonstration).

In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis,

M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 826–829.

Springer, Heidelberg (2004)

14. AOP, Aspect-Oriented Software Development (2007), http://www.aosd.net

15. AspectJ, The AspectJ Programming Guide (2007),

http://dev.eclipse.org/viewcvs/indextech.cgi/~

checkout~aspectj-home/doc/progguide/index.html

16. Andrews, T., et al.: Business Process Execution Language for Web Services (2003),

http://www.ibm.com/developerworks/library/specification/

ws-bpel/

17. Maamar, Z., Sutherland, J.: Toward Intelligent Business Objects. Communications of the

ACM 43(10)

18. Papazoglou, M.P., Heuvel, W.-J.v.d.: Service-oriented design and development methodology.

International Journal of Web Engineering and Technology (IJWET) 2(4), 412–442

(2006)

19. Benslimane, D., Arara, A., Falquet, G., Maamar, Z., Thiran, P., Gargouri, F.: Contextual

Ontologies: Motivations, Challenges, and Solutions. In: Yakhno, T., Neuhold, E.J. (eds.)

ADVIS 2006. LNCS, vol. 4243, pp. 168–176. Springer, Heidelberg (2006)

20. Mostefaoui, S.K., Mostefaoui, G.K.: Towards A Contextualisation of Service Discovery

and Composition for Pervasive Environments. In: Proc. the Workshop on Web-services

and Agent-based Engineering (2003)

21. Dan, A.: Use of WS-Agreement in Job Submission (September 2004)

22. Maamar, Z., et al.: Towards a context-based multi-type policy approach for Web services

composition. Data & Knowledge Engineering 62(2) (2007)

23. Damianou, N., Dulay, N., Lupu, E.C., Sloman, M.: The ponder policy specification language.

In: Sloman, M., Lobo, J., Lupu, E.C. (eds.) POLICY 2001. LNCS, vol. 1995, pp.

18–38. Springer, Heidelberg (2001)

24. Ryan, N.: ConteXtML: Exchanging contextual information between a Mobile Client and

the FieldNote Server,

http://www.cs.kent.ac.uk/projects/mobicomp/fnc/

ConteXtML.html

25. Turner, R.M.: Context-mediated behavior for intelligent agents. Human-Computer studies

48(3), 307–330 (1998)

26. Gonzales, A.J., Ahlers, R.: Context-based representation of intelligent behavior in training

simulations. International Transactions of the Society for Computer Simulation, 153–166

(1999)


Context-Aware Data and IT Services Collaboration in E-Business 115

27. Brezillon, P.: Context-based modeling of operators’ Practices by Contextual Graphs. In:

Proc. 14th Mini Euro Conference in Human Centered Processes (2003)

28. Bucur, O., et al.: What Is Context and How Can an Agent Learn to Find and Use it When

Making Decisions? In: Proc. international workshop of central and eastern europe on multi

agent systems, pp. 112–121 (2005)

29. Carey, M.: Data delivery in a service-oriented world: the BEA aquaLogic data services

platform. In: Proc. The 2006 ACM SIGMOD international conference on Management of

data (2006)

30. C. Microsoft, ADO.NET Data Services (also known as Project Astoria) (2007),

http://astoria.mslivelabs.com/

31. Hat, R.: MetaMatrix Enterprise Data Services Platform (2007),

http://www.redhat.com/jboss/platforms/dataservices/

32. X. Inc, Xcalia Data Access Services (2009), http://www.xcalia.com/products/xcalia-xdasdata-access-service-SDO-DAS-data-integration-through-web-services.jsp

33. Williams, K., Daniel, B.: SOA Web Services - Data Access Service. Java Developer’s

Journal (2006)

34. Microsoft, Native XML Web services for Microsoft SQL server (2005),

http://msdn2.microsoft.com/en-us/library/ms345123.aspx

35. Maamar, Z., et al.: Towards a context-based multi-type policy approach for Web services

composition. Data & Knowledge Engineering 62(2), 327–351 (2007)

36. Bettini, C., et al.: Distributed Context Monitoring for the Adaptation of Continuous Services.

World Wide Web 10(4), 503–528 (2007)

37. Casati, F., Ilnicki, S., Jin, L., Krishnamoorthy, V., Shan, M.-C.: Adaptive and Dynamic

Service Composition in eFlow. In: Wangler, B., Bergman, L.D. (eds.) CAiSE 2000.

LNCS, vol. 1789, p. 13. Springer, Heidelberg (2000)

38. Modafferi, S., et al.: A Methodology for Designing and Managing Context-Aware Workflows.

In: Mobile Information Systems II, pp. 91–106 (2005)

39. Charfi, A., Mezini, M.: AO4BPEL: An Aspect-oriented Extension to BPEL. World Wide

Web 10(3), 309–344 (2007)

40. Erradi, A., et al.: Towoards a Policy-Driven Framework For Adaptive Web Services Composition.

In: Proceedings of the International Conference on Next Generation Web Services

Practices 2005, pp. 33–38 (2005)


Facilitating Controlled Tests of Website Design

Changes Using Aspect-Oriented Software

Development and Software Product Lines

Javier Cámara 1 and Alfred Kobsa 2

1 Department of Computer Science, University of Málaga

Campus de Teatinos, 29071. Málaga, Spain

jcamara@lcc.uma.es

2 Dept. of Informatics, University of California, Irvine

Bren School of Information and Computer Sciences, Irvine, CA 92697, USA

kobsa@uci.edu

Abstract. Controlled online experiments in which envisaged changes

to a website are first tested live with a small subset of site visitors have

proven to predict the effects of these changes quite accurately. However,

these experiments often require expensive infrastructure and are costly in

terms of development effort. This paper advocates a systematic approach

to the design and implementation of such experiments in order to overcome

the aforementioned drawbacks by making use of Aspect-Oriented

Software Development and Software Product Lines.

1 Introduction

During the past few years, e-commerce on the Internet has experienced a remarkable

growth. For online vendors like Amazon, Expedia and many others,

creating a user interface that maximizes sales is thereby crucially important. Different

studies [11,10] revealed that small changes at the user interface can cause

surprisingly large differences in the amount of purchases made, and even minor

difference in sales can make a big difference in the long run. Therefore, interface

modifications must not be taken lightly but should be carefully planned.

Experience has shown that it is very difficult for interface designers and marketing

experts to foresee how users react to small changes in websites. The behavioral

difference that users exhibit at Web pages with minimal differences in

structure or content quite often deviates considerably from all plausible predictions

that designers had initially made [22,30,27]. For this reason, several techniques

have been developed by industry that use actual user behavior to measure

the benefits of design modifications [17]. These techniques for controlled online

experiments on the Web can help to anticipate users’ reactions without putting

a company’s revenue at risk. This is achieved by implementing and studying the

effects of modifications on a tiny subset of users rather than testing new ideas

directly on the complete user base.

Although the theoretical foundations of such experiments have been well established,

and interesting practical lessons compiled in the literature [16], the

A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 116–135, 2009.

c○ Springer-Verlag Berlin Heidelberg 2009


Facilitating Controlled Tests of Website Design Changes 117

infrastructure required to implement such experiments is expensive in most cases

and does not support a systematic approach to experimental variation. Rather,

the support for each test is usually crafted for specific situations.

In this work, we advocate a systematic approach to the design and implementationofsuchexperimentsbasedonSoftware

Product Lines [7] and Aspect

Oriented Software Development (AOSD) [12]. Section 2 provides an overview of

the different techniques involved in online tests, and Section 3 points out some of

their shortcomings. Section 4 describes our systematic approach to the problem,

giving a brief introduction to software product lines and AOSD. Section 5 introduces

a prototype tool that we developed to test the feasibility of our approach.

Section 6 compares our proposal to currently available solutions, and Section 7

presents some conclusions and future work.

2 Controlled Online Tests on the Web: An Overview

The underlying idea behind controlled online tests of a Web interface is to create

one or more different versions of it by incorporating new or modified features,

and to test each version by presenting it to a randomly selected subset of users

inordertoanalyzetheir reactions. User response is measured along an overall

evaluation criterion (OEC) or fitness function, which indicates the performance

of the different versions or variants. A simple yet common OEC in e-commerce is

the conversion rate, that is, the percentage of site visits that result in a purchase.

OECs may however also be very elaborate, and consider different factors of user

behavior.

Controlled online experiments can be classified into two major categories,

depending on the number of variables involved:

Fig. 1. Checkout screen: variants A (original, left) and B (modified, right) 1

1 c○ 2007 ACM, Inc. Included by permission.


118 J. Cámara and A. Kobsa

– A/B, A/B/C, ..., A/../N split testing: These tests compare one or more

variations of a single site element or factor, such as a promotional offer. Site

developers can quickly see which variation of the factor is most persuasive

and yields the highest conversion rates. In the simplest case (A/B test), the

original version of the interface is served to 50% of the users (A or Control

Group), and the modified version is served to the other 50% (B or Treatment

Group 2 ). While A/B tests are simple to conduct, they are often not very

informative. For instance, consider Figure 1, which depicts the original version

and a variant of a checkout example taken from [11]. 3 This variant has

been obtained by modifying 9 different factors. While an A/B test tells us

which of two alternatives is better, it does not yield reliable information on

how combinations of the different factors influence the performance of the

variant.

– Multivariate testing: A multivariate test can be viewed as a combination

of many A/B tests, whereby all factors are systematically varied. Multivariate

testing extends the effectiveness of online tests by allowing the impact

of interactions between factors to be measured. A multivariate test can, e.g.,

reveal that two interface elements yield an unexpectedly high conversion rate

only when they occur together, or that an element that has a positive effect

on conversion loses this effect in the presence of other elements.

The execution of a test can be logically separated into two steps, namely (a)

the assignment of users to the test, and to one of the subgroups for each of the

interfaces to be tested, and (b) the subsequent selection and presentation of this

interface to the user. The implementation of online tests partly blurs the two

different steps.

The assignment of users to different subgroups is generally randomized, but

different methods exist such as:

– Pseudo-random assignment with caching: consists in the use of a

pseudo-random number generator coupled with some form of caching in order

to preserve consistency between sessions (i.e., a user should be assigned

to the same interface variant on successive visits to the site); and

– Hash and partitioning: assigns a unique user identifier that is either stored

in a database or in a cookie. The entire set of indentifiers is then partitioned,

and each partition is assigned to a variant. This second method is usually

preferred due to scalability problems with the first method.

Three implementation methods are being used for the selection and presentation

of the interface to the user:

2 In reality, the treatment group will only comprise a tiny fraction of the users of a

website, so as to keep losses low if the conversion rate of the treatment version should

turn out to be poorer than that of the existing version.

3 Eisenberg reports that Interface A resulted in 90% fewer purchases, probably because

potential buyers who had no promotion code were put off by the fact that others

could get lower prices.


Facilitating Controlled Tests of Website Design Changes 119

– Traffic splitting: In order to generate the different variants, different implementations

are created and placed on different physical or virtual servers.

Then, by using a proxy or a load balancer which invokes the randomization

algorithm, a user’s traffic is diverted to the assigned variant.

– Server-side selection: All the logic which invokes the randomization algorithm

and produces the different variants for users is embedded in the code

of the site.

– Client-side selection: Assignment and generation of variants is achieved

through dynamic modification of each requested page at the client side using

JavaScript.

3 Problems with Current Online Test Design and

Implementation

The three implementation methods discussed above entail a number of disadvantages,

which are a function of the choices made at the architectural level and

not of the specific characteristics of an online experiment (such as the chosen

OEC or the interface features being modified):

– Traffic splitting: Although traffic splitting does not require any changes to

the code in order to produce the different user assignments to variants, the

implementation of this approach is relatively expensive. The website and

the code for the measurement of the OEC have to be replicated n times,

where n is the number of tested combinations of different factors (number of

possible variants). In addition to the complexity of creating each variant for

the test manually by modifying the original website’s code (impossible in the

case of multivariate tests involving several factors), there is also a problem

associated to the hardware required for the execution of the test. If physical

servers are used, a fleet of servers will be needed so that each of the variants

tested will be hosted on one of them. Likewise, if virtual servers are being

used, the amount of system resources required to acommodate the workload

will easily exceed the capacity of the physical server, requiring the use of

several servers and complicating the supporting infrastructure.

– Server-side selection: Extensive code modification is required if interface

selection and presentation is performed at the server side. Not only has

randomization and user assignment to be embedded in the code, but also a

branching logic has to be added in order to produce the different interfaces

corresponding to the different combinations of variants. In addition, the code

may become unnecessarily complex, particularly if different combinations of

factors are to be considered at the sametimewhentestsarebeingrun

concurrently. However, if these problems are solved, server-side selection is

a powerful alternative which allows deep modifications to the system and is

cheap in terms of supporting infrastructure.

– Client-side selection: Although client-side selection is to some extent easier

to implement than server-side selection, it suffers from the same shortcomings.

In addition, the features subject to experimentation are far more


120 J. Cámara and A. Kobsa

limited (e.g., modifications which go beyond the mere interface are not possible,

JavaScript must be enabled in the client browser, execution is errorprone,

etc.).

Independent of the chosen form of implementation, substantial support for systematic

online experimentation at a framework level is urgently needed. The

framework will need to support the definition of the different factors and their possible

combinations at the test design stage, and their execution at runtime. Being

able to evolve a site safely by keeping track of each of the variants’ performance

as well as maintaining a record of the different experiments is very desirable when

contrasted with the execution of isolated tests on an ad-hoc basis.

4 A Systematic Approach to Online Test Design and

Implementation

To overcome the various limitations described in the previous section, we advocate

a systematic approach to the development of online experiments. For this

purpose, we rely on two different foundations: (i) software product lines provide

the means to properly model the variability inherent in the design of the experiments,

and (ii) aspect-oriented software development (AOSD) helps to reduce

the effort and cost of implementing the variants of the test by capturing variation

factors on aspects. The use of AOSD will also help in presenting variants to

users, as well as simplifying user assignment and data collection. By combining

these two foundations we aim at supplying developers with the necessary tools to

design tests in a systematic manner, enabling the partial automation of variant

generation and the complete automation of test deployment and execution.

4.1 Test Design Using Software Product Lines

Software Product Line models describe all requirements or features in the potential

variants of a system. In this work, we use a feature-based model similar to

the models employed by FODA [13] or FORM [14]. This model takes the form

of a lattice of parent-child relationships which is typically quite large. Single

systems or variants are then built by selecting a set of features from the model.

Product line models allow the definion of the directly reusable (DR) or mandatory

features which are common to all possible variants, and three types of discriminants

or variation points, namely:

– Single adaptors (SA): a set of mutually exclusive features from which

only one can be chosen when defining a particular system.

– Multiple adaptors (MA): a list of alternatives which are not mutually

exclusive. At least one must be chosen.

– Options (O): a single optional feature that may or may not be included in

a system definition.


Facilitating Controlled Tests of Website Design Changes 121

F1(MA) The cart component must include a checkout screen.

– F1.1(SA) There must be an additional “Continue Shopping” button present.

• F1.1.1(DR) The button is placed on top of the screen.

• F1.1.2(DR) The button is placed at the bottom of the screen.

– F1.2(O) There must be an “Update” button placed under the quantity box.

– F1.3(SA) There must be a “Total” present.

• F1.3.1(DR) Text and amount of the “Total” appear in different boxes.

• F1.3.2(DR) Text and amount of the “Total” appear in the same box.

– F1.4(O) The screen must provide discount options to the user.

• F1.4.1(DR) There is a “Discount” box present, with amount in a box next

to it on top of the “Total” box.

• F1.4.2(DR) There is an “Enter Coupon Code” input box present on top of

“Shipping Method”.

• F1.4.3(DR) There must be a “Recalculate” button left of “Continue Shopping.”

Fig. 2. Feature model fragment corresponding to the checkout screen depicted

in Figure 1

In order to define the different interface variants that are present in an online test,

we specify all common interface features as DR features in a product line model.

Varying elements are modeled using discriminants. Different combinations of

interface features will result in different interface variants. An example for such

a feature model is given in Figure 2, which shows a fragment of a definition

of some of the commonalities and discriminants of the two interface variants

depicted in Figure 1.

Variants can be manually created by the test designer through the selection

of the desired interface features in the feature model, or automatically by generating

all the possible combinations of feature selections. Automatic generation

is especially interesting in the case of multivariate testing. However, it is worth

noting that not all combinations of feature selections need to be valid. For instance,

if we intend to generate a variant which includes F1.3.1 in our example,

that same selection cannot include F1.3.2 (single adaptor).

Likewise, if F1.4 is selected, it is mandatory to include F1.4.1-F1.4.3 in the

selection. These restrictions are introduced by the discriminants used in the

product line model. If restrictions are not satisfied, we have generated an invalid

variant that should not be presented to users. Therefore, generating all possible

feature combinations for a multivariate test is not enough for our purposes.

Fortunately, the feature model can be easily translated into a logical expression

by using features as atomic propositions and discriminants as logical connectors.

The logical expression of a feature model is the conjunction of the logical

expressions for each of the sub-graphs in the lattice and is achieved using logical

AND. If Gi and Gj are the logical expressions for two different sub-graphs, then

the logical expression for the lattice is:


122 J. Cámara and A. Kobsa

Gi ∧ Gj

Parent-child dependency is expressed using a logical AND as well. If ai is a

parent requirement and aj is a child requirement such that the selection of aj is

dependent on ai then ai ∧ aj. Ifai also has other children ak ...az then:

ai ∧ (ak ∧ ...∧ az)

The logical expression for a single adaptor discriminant is exclusive OR. If ai

and aj are features such that ai is mutually exclusive to aj then ai ⊕aj. Multiple

adaptor discriminants correspond to logical OR. If ai and aj are features such

that at least one of them must be chosen then ai ∨ aj. The logical expression for

an option discriminant is a bi-conditional 4 .Ifai is the parent of another feature

aj then the relationship between the two features is ai ↔ aj.

Table 1 summarizes the relationships and logical definitions of the model. The

general expression for a product line model is G1 ∧ G2 ∧ ...∧ Gn where Gi is

ai R aj R ak R ... R an and R is one of ∧, ∨, ⊕, or↔. The logical expression

for the checkout example feature model shown in Figure 2 is:

F 1 ∧ ( F 1.1 ∧ (F 1.1.1 ⊕ F 1.1.2) ∨

F 1.2 ∨

F 1.3 ∧ (F 1.3.1 ⊕ F 1.3.2) ∨

F 1.4 ↔ (F 1.4.1 ∧ F 1.4.2 ∧ F 1.4.3) )

By instantiating all the feature variables in the expression to true if selected,

and false if unselected, we can generate the set of possible variants and then test

their validity using the algorithm described in [21]. A valid variant is one for

which the logical expression of the complete feature model evaluates to true.

Table 1. Feature model relations and equivalent formal definitions

Feature Model Relation Formal Definition

Sub-graph Gi ∧ Gj

Dependency ai ∧ aj

Single adaptor ai ⊕ aj

Multiple adaptor ai ∨ aj

Option ai ↔ aj

Manual selection can also benefit from this approach since the test administrator

can be guided in the process of feature selection by pointing out inconsistencies

in the resulting variant as features are selected or unselected. Figure 3

depicts the feature selections for variants A and B of our checkout example. In the

feature model, mandatory features are represented with black circles, whereas

options are represented with white circles. White triangles express alternative

(single adaptors), and black triangles multiple adaptors.

4 ai ↔ aj is true when ai and aj have the same value.


Variant A (Original)

Facilitating Controlled Tests of Website Design Changes 123

(F1) Checkout Screen

(F1.1) Continue Button (F1.2) Update Button (F1.3) Total Display (F1.4) Discount

(F1.1.1) Placed Top (F1.1.2) Placed Bottom

Variant B

(F1.3.1) Split Box (F1.3.2) Same Box

(F1) Checkout Screen

(F1.4.1) Discount Box (F1.4.3) Recalculate Button

(F1.4.2) Coupon Code Box

(F1.1) Continue Button (F1.2) Update Button (F1.3) Total Display (F1.4) Discount

(F1.1.1) Placed Top (F1.1.2) Placed Bottom

(F1.3.1) Split Box (F1.3.2) Same Box

(F1.4.1) Discount Box (F1.4.3) Recalculate Button

(F1.4.2) Coupon Code Box

Fig. 3. Feature selections for the generation of variants A and B from Figure 1

As regards automatic variant generation, we must bear in mind that full factorial

designs (i.e., testing every possible combination of interface features) provides

the greatest amount of information about the individual and joint impacts

of the different factors. However, obtaining a statistically meaningful number

of cases for this type of experiment takes time, and handling a huge number

of variants aggravates this situation. In our approach, the combinatorial explosion

in multivariate tests is dealt with by bounding the parts of the hierarchy

which descend from an unselected feature. This avoids the generation of all the

variations derived from that specific part of the product line.

In addition, our approach does not confine the test designer to a particular

selection strategy. It is possible to integrate any optimization method for reducing

the complexity of full factorial designs, such as for instance hill climbing

strategies like the Taguchi approach [28].

4.2 Case Study: Checkout Screen

Continuing with the checkout screen example described in Section 1, we introduce

a simplified implementation of the shopping cart in order to illustrate our

approach.

We define a class ‘shopping cart’ (Cart) that allows for the addition and

removal of different items (see Figure 4). This class contains a number of methods

that render the different elements in the cart at the interface level, such

as printTotalBox() or printDiscountBox(). These are private class methods

called from within the public method printCheckoutTable(), which is intended


124 J. Cámara and A. Kobsa

General

+printHeader ()

+printBanner ()

+printMenuTop ()

+printMenuBottom ()

Item

-Id

-name

-price * 1

1

1

Cart

-shippingmethod

-subtotal

-tax

-total

+addItem()

+removeItem()

-printDiscountBox()

-printTotalBox()

-printCouponCodeBox()

-printShippingMethodBox()

-recalculateButton()

-continueShoppingButton()

+printCheckoutTable()

+doCheckout()

Fig. 4. Classes involved in the shopping cart example

1

1

User

-name

-email

-username

-password

to render the main body of our checkout screen. A user’s checkout is completed

when doCheckout() is invoked. On the other hand, the General class contains

auxiliary functionality, such as representing common elements of the site (e.g.,

headers, footers and menus).

4.3 Implementing Tests with Aspects

Aspect-Oriented Software Development (AOSD) is based on the idea that systems

are better programmed by separately specifying their different concerns

(areas of interest), using aspects and a description of their relations with the

rest of the system. Those specifications are then automatically woven (or composed)

into a working system. This weaving process can be performed at different

stages of the development, ranging from compile-time to run-time (dynamic

weaving) [26]. The dynamic approach (Dynamic AOP or d-AOP) implies that

the virtual machine or interpreter running the code must be aware of aspects

and control the weaving process. This represents a remarkable advantage over

static AOP approaches, considering that aspects can be applied and removed at

run-time, modifying application behaviour during the execution of the system

in a transparent way.

With conventional programming techniques, programmers have to explicitly

call methods available in other component interfaces in order to access their

functionality, whereas the AOSD approach offers implicit invocation mechanisms

for behavior in code whose writers were unaware of the additional concerns

(obliviousness). This implicit invocation is achieved by means of join points.

These are regions in the dynamic control flow of an application (method calls

or executions, exception handling, field setting, etc.) which can be intercepted

by an aspect-oriented program by using pointcuts (predicates which allow the

quantification of join points) to match with them. Once a join point has been

matched, the program can run the code corresponding to the new behavior


Facilitating Controlled Tests of Website Design Changes 125

(advices) typically before, after, instead of, or around (before and after) the

matched join point.

In order to test and illustrate our approach, we use PHP [25], one of the predominant

programming languages in Web-based application development. It is

an easy to learn language specifically designed for the Web, and has excellent

scaling capabilities. Among the variety of AOSD options available for PHP, we

have selected phpAspect [4], which is to our knowledge the most mature implementation

so far, providing AspectJ5-like syntax and abstractions. Although

there are other popular languages and platforms available for Web application

development (Java Servlets, JSF, etc.), most of them provide similar abstractions

and mechanisms. In this sense, our proposal is technology-agnostic and

easily adaptable to other platforms.

Aspects are especially suited to overcome many of the issues described in

Section 3. They are used for different purposes in our approach that will be

described below.

Variant implementation. The different alternatives that have been used so

far for variant implementation have important disadvantages, which we discussed

in Section 3. These detriments include the need to produce different versions of

the system code either by replicating and modifying it across several servers, or

using branching logic on the server or client sides.

Using aspects instead of the traditional approaches offers the advantage that

the original source code does not need to be modified, since aspects can be

applied as needed, resulting in different variants. In our approach, each feature

described in the product line is associated to one or more aspects which modify

the original system in a particular way. Hence, when a set of features is selected,

the appropriate variant is obtained by weaving with the base code6 the set of

aspects associated to the selected features in the variant, modifying the original

implementation.

To illustrate how these variations are achieved, consider for instance the features

labeled F1.3.1 and F1.3.2 in Figure 2. These two features are mutually

exclusive and state that in the total box of the checkout screen, text and amount

should appear in different boxes rather than in the same box, respectively. In

the original implementation (Figure 1.A), text and amount appeared in different

boxes, and hence there is no need to modify the behavior if F1.3.1 is selected.

When F1.3.2 is selected though, we merely have to replace the behavior that

renders the total box (implemented in the method Cart.printTotalBox()).

We achieve this by associating an appropriate aspect to this feature.

In Listing 1, by defining a pointcut that intercepts the execution of the total

box rendering method, and applying an around-type advice, we are able to

replace the method through which this particular element is being rendered at

the interface.

This approach to the generation of variants results in better code reusability

(especially in multivariate testing) as well as reduced costs and efforts, since

5

AspectJ [9,15] is the de-facto standard in aspect-oriented programming languages.

6

That is, the code of the original system.


126 J. Cámara and A. Kobsa

Listing 1. Rendering code replacement aspect

aspect replaceTotalBox{

pointcut render:exec(Cart::printTotalBox(*));

around(): render{

/* Alternative rendering code */

}

}

developers do not have to replicate nor generate complete variant implementations.

In addition, this approach is safer and cleaner since the system logic does

not have to be temporally (nor manually) modified, thus avoiding the resulting

risks in terms of security and reliability.

Finally, not only interface modifications such as the ones depicted in Figure 1,

but also backend modifications are easier to perform, since aspect technology allows

a behavior to be changed even if it is scattered throughout the system code.

The practical implications of using AOP for this purpose can be easily seen in

an example. Consider for instance Amazon’s recommendation algorithm, which

is invoked in many places throughout the website such as its general catalog

pages, its shopping cart, etc. Assume that Amazon’s development team wonders

whether an alternative algorithm that they developed would perform better than

the original. With traditional approaches they could modify the source code only

by (i) replicating the code on a different server and replacing all the calls 7 made

to the recommendation algorithm, or (ii) including a condition contingent on

the variant that is being executed in each call to the algorithm. Using aspects

instead enables us to write a simple statement (pointcut) to intercept every call

to the recommendation algorithm throughout the site, and replace it with the

call to the new algorithm.

Experimenting with variants may require going beyond mere behavior replacement

though. This means that any given variant may require for its implementation

the modification of data structures or method additions to some

classes. Consider for instance a test in which developers want to monitor how

customers react to discounts on products in a catalog. Assume that discounts

can be different for each product and that the site has not initially been designed

to include any information on discounts, i.e., this information needs to

be introduced somewhere in the code. To solve this problem we can use intertype

declarations. Aspects can declare members (fields, methods, and constructors)

that are owned by other classes. These are called inter-type members.

As can be observed in Listing 2, we introduce an additional discount field in

our item class, and also a getDiscountedPrice() method which will be used

whenever the discounted price of an item is to be retrieved. Note that we need to

7 In the simplest case, only the algorithm’s implementation would be replaced. However,

modifications on each of the calls may also be required, e.g., due to differences

in the signature with respect to the original algorithm’s implementation,.


Facilitating Controlled Tests of Website Design Changes 127

Listing 2. Item discount inter-type declarations

aspect itemDiscount{

private Item::$discount;

public function Item::getDiscountedPrice(){

return ($this->price - $this->discount);

}

}

introduce a new method, because it should still be possible to retrieve the original,

non-discounted price.

Data Collection and User Interaction. Thecodeinchargeofmeasuring

and collecting data for the experiment can also be written as aspects in a concise

manner. Consider a new experiment with our checkout example in which we want

to calculate how much customers spend on average when they visit our site. To

this end, we need to add up the amount of money spent on each purchase. One

way to implement this functionality is again inter-type declarations.

Listing 3. Data collection aspect

aspect accountPurchase{

private $dbtest;

pointcut commitTrans:exec(Cart::doCheckout(*));

function Cart::accountPurchase(DBManager $db){

$db->insert($this->getUserName(),

$this->total);

}

around($this): commitTrans{

if (proceed()){

$this->accountPurchase($thisAspect->dbtest);

}

}

}

When the aspect in Listing 3 intercepts the method that completes a purchase

(Cart.doCheckout()), the associated advice inserts the sales amount into a

database that collects the results from the experiment (but only if the execution

of the intercepted method succeeds, which is represented by proceed() in the

advice). It is worth noting that while the database reference belongs to the

aspect, the method used to insert the data belongs to the Cart class.

Aspects permit the easy and consistent modification of the methods that

collect, measure, and synthesize the OEC from the gathered data to be presented

to the test administrator in order to be analyzed. Moreover, data collection

procedures do not need to be replicated across the different variants, since the

system will weave this functionality across all of them.


128 J. Cámara and A. Kobsa

User Assignment. Rather than implementing user assignment in a proxy or

load balancer that routes requests to different servers, or including it in the implementation

of the base system, we experimented with two different alternatives

of aspect-based server-side selection:

– Dynamic aspect weaving: A user routing module acts as an entry point

to the base system. This module assigns the user to a particular variant by

looking up what aspects have to be woven to produce the particular variant

to which the current user had been assigned. The module then incorporates

these aspects dynamically upon each request received by the server, flexibly

producing variants in accordance with the user’s assignment. Although this

approach is elegant and minimizes storage requirements, it does not scale

well. Having to weave a set of aspects (even if they are only a few) on the base

system upon each request to the server is very demanding in computational

terms, and prone to errors in the process.

– Static aspect weaving: The different variants are computed offline, and

each of them is uploaded to the server. In this case the routing module

just forwards the user to the corresponding variant stored on the server

(the base system is treated just like another variant for the purpose of the

experiment). This method does not slow down the operation of the server

and is a much more robust approach to the problem. The only downside of

this alternative is that the code corresponding to the different variants has to

be stored temporarily on the server (although this is a minor inconvenience

since usually the amount of space required is negligible compared to the

average server storage capacity). Furthermore, this alternative is cheaper

than traffic splitting, since it does not require the use of a fleet of servers

nor the modification of the system’s logic. This approach still allows one to

spread the different variants across several servers in case of high traffic load.

5 Tool Support

The approach for online experiments on websites that we presented in this article

has been implemented in a prototype tool, called WebLoom. It includes a

graphical user interface, to build and visualize feature models that can be used

as the structure upon which controlled experiments on a website can be defined.

In addition, the user can write aspect code which can be attached to the different

features. Once the feature model and associated code have been built,

the tool supports both automatic and manual variant generation, and is able to

deploy aspect code which lays out all the necessary infrastructure to perform

the designed test on a particular website. The prototype has been implemented

in Python, usingthewxWidgets toolkit technology for the development of the

user interface. It both imports and exports simple feature models described in

an XML format specific to the tool.

The prototype tool’s graphical user interface is divided into three main working

areas:


Facilitating Controlled Tests of Website Design Changes 129

Fig. 5. WebLoom displaying the product line model depicted in Figure 2

–Featuremodel.This is the main working area where the feature model

can be specified (see Figure 5). It includes a toolbar for the creation and

modification of discriminants and a code editor for associated modifications.

This area also allows the selection of features in order to generate variants.

– Variant management. Variants generated on the site model area can be

added or removed from the current test, renamed or inspected. A compilation

of the description of all features contained in a variant is automatically

presented to the user based on feature selections when the variant is selected

(Figure 6, bottom).

– Overall Estimation Criteria. OneormoreOECtomeasureontheexperiments

can be defined in this section. Each of the OEC are labeled in

order to be identified later on, and the associated code for gathering and

processing data is directly defined by the test administrator.

In Figure 7, we can observe the interaction with our prototype tool. The user

enters a description of the potential modifications to be performed on the website,

in order to produce the different variants under WebLoom’s guidance. This

results in a basic feature model structure, which is then enriched with code associated

to the aforementioned modifications (aspects). Once the feature model

is complete, the user can freely select a number of features using the interface,


130 J. Cámara and A. Kobsa

Designer

1. Design

WebLoom

Fig. 6. Variant management screen in WebLoom

1.a. Specify Feature Model

1.b. Add Feature Code

1.c Define Variants 1..n

(by Selecting Features )

1.d Define OECs

2. Aspect Code Generation 3. Aspect Weaving

Aspect Code for Variants 1..n

Data Collection Aspect Code

Weaver

Fig. 7. Operation of WebLoom

System Logic

Test

Implementation

and take snapshots of the current selections in order to generate variants. These

variants are automatically checked for validity before being incorporated into

the variant collection. Alternatively, the user can ask the tool to generate all the

valid variants for the current feature model and then remove the ones which are

not interesting for the experiment.

Once all necessary input has been received, the tool gathers the code for each

particular variant to be tested in the experiment, by collecting all the aspects

associated with the features that were selected for the variant. It then invokes


Facilitating Controlled Tests of Website Design Changes 131

the weaver to produce the actual variant code for the designed test by weaving

the original system code with the collection of aspects produced by the tool.

6 Related Work

Software product lines and feature-oriented design and programming have already

been successfully applied in the development of Web applications, to significantly

boost productivity by exploiting commonalities and reusing as many

assets (including code) as possible. For instance, Trujillo et al. [29] present a case

study of Feature Oriented Model Driven Design (FOMDD) on a product line of

portlets (Web portal components). In this work, the authors expressed variations

in portlet functionality as features, and synthesized portlet specifications

by composing them conveniently. Likewise, Petersson and Jarzabek [24] present

an industrial case study in which their reuse technique XVCL was incrementally

applied to generate a Web architecture from the initial code base of a Web portal.

The authors describe the process that led to the development of the Web

Portal product line.

Likewise, aspect-oriented software development has been previously applied

to the development of Web applications. Valderas et al. [31] present an approach

for dealing with crosscutting concerns in Web applications from requirements to

design. Their approach aims at decoupling requirements that belong to different

concerns. These are separately modeled and specified using the task-based notation,

and later integrated into a unified requirements model that is the source of a

model-to-model and model-to-code generation process yielding Web application

prototypes that are built from task descriptions.

Although the aforementioned approaches meet their purpose of boosting productivity

by taking advantage of commonalities, and of easing maintenance by

properly encapsulating crosscutting concerns, they do not jointly exploit the advantages

of both approaches. Moreover, although they are situated in the context

of Web application development, they are not well suited to the specific characteristics

of online test design and implementation which have been described in

previous sections.

The idea of combining software product lines and aspect-oriented software development

techniques does already have some tradition in software engineering.

In fact, Lee et al. [18] present some guidelines on how feature-oriented analysis

and aspects can be combined. Likewise, Loughran and Rashid [19] propose

framed aspects as a technique and methodology that combines AOSD, frame

technology, and feature-oriented domain analysis in order to provide a framework

for implementing fine-grained variability. In [20], they extend this work

to support product line evolution using this technique. Other approaches such

as [32] aim at implementing variability, and the management and tracing of requirements

for implementation by integrating model-driven and aspect-oriented

software development. The AMPLE project [1] takes this approach one step further

along the software lifecycle and maintenance, aiming at traceability during

product line evolution. In the particular context of Web applications, Alférez and


132 J. Cámara and A. Kobsa

Suesaowaluk [8] introduce an aspect-oriented product line framework to support

the development of software product lines of Web applications. This framework

is similarly aimed at identifying, specifying, and managing variability from requirements

to implementation.

Although both the aforementioned approaches and our own proposal employ

software product lines and aspects, there is a key difference in the way these

elements are used. First, the earlier approaches are concerned with the general

process of system construction by identifying and reusing aspect-oriented components,

whereas our approach deals with the specific problem of online test design

and implementation, where different versions of a Web application with a limited

lifespan are generated to test user behavioral response. Hence, our framework is

intended to generate lightweight aspects whichareusedasaconvenientmeans

for the transient modification of parts of the system. In this sense, it is worth noting

that system and test designs and implementations are completely independent

of each other, and that aspects are only involved as a means to generate system

variants, but not necessarily present in the original system design. In addition,

our approach provides automatic support for the generation of all valid variants

within the product line, and does not require the modification of the underlying

system which stays online throughout the whole online test process.

To the extent of our knowledge, no research has so far been reported on

treating online test design and implementation in a systematic manner. A number

of consulting firms already specialized on analyzing companies’ Web presence

[2,6,3]. These firms offer ad-hoc studies of Web retail sites with the goal

of achieving higher conversion rates. Some of them use proprietary technology

that is usually focused on the statistical aspects of the experiments, requiring

significant code refactoring for test implementation 8 .

Finally, SiteSpect [5] is a software package which takes a proxy-based approach

to online testing. When a Web client makes a request to the Web server, it is first

received by the software and then forwarded to the server (this is used to track user

behavior). Likewise, responses with content are also routed through the software,

which injects the HTML code modifications and forwards the modified responses

to the client. Although the manufacturers claim that it does not matter whether

content is generated dynamically or statically by the server since modifications are

performed by replacing pieces of the generated HTML code, we find this approach

adequate for trivial changes to a site only, and not very suitable for user data collection

and measurement. Moreover, no modifications can be applied to the logic

of the application. These shortcomings severely impair this method which is not

able to go beyond simple visual changes to the site.

7 Concluding Remarks

In this paper, we presented a novel and systematic approach to the development

of controlled online tests for the effects of webpage variants on users, based

8 It is however not easy to thoroughly compare these techniques from an implementation

point of view, since firms tend to be quite secretive about them.


Facilitating Controlled Tests of Website Design Changes 133

on software product lines and aspect oriented software development. We also

described how the drawbacks of traditional approaches, such as high costs and

development effort, can be overcome with our approach. We believe that its benefits

are especially valuable for the specific problem domain that we address. On

one hand, testing is performed on a regular basis for websites in order to continuously

improve their conversion rates. On the other hand, a very high percentage

of the tested modifications are usually discarded since they do not improve the

site performance. As a consequence, a lot of effort is lost in the process. We

believe that WebLoom will save Web developers time and effort by reducing

the amount of work they have to put into the design and implementation of

online tests.

Although there is a wide range of choices available for the implementation of

Web systems, our approach is technology-agnostic and most likely deployable to

different platforms and languages. However, we observed that in order to fully

exploit the benefits of this approach, a website should first be tested whether

its implementation meets the modularity principle. This is of special interest at

the presentation layer, where user interface component placement, user interface

style elements, event declarations and application logic traditionally tend to be

mixedup[23].

Regarding future work, a first perspective aims at enhancing our basic prototype

with additional WYSIWYG extensions for its graphical user interface.

Specifically, developers should be enabled to immediately see the effects that

code modifications and feature selections will have on the appearance of their

website. This is intended to help them deal with variant generation in a more

effective and intuitive manner. A second perspective is refining the variant validation

process so that variation points in feature models that are likely to cause

significant design variations can be identified, thus reducing the variability.

References

1. Ample project, http://www.ample-project.net/

2. Offermatica, http://www.offermatica.com/

3. Optimost, http://www.optimost.com/

4. phpAspect: Aspect oriented programming for PHP, http://phpaspect.org/

5. Sitespect, http://www.sitespect.com

6. Vertster, http://www.vertster.com/

7. Software product lines: practices and patterns. Addison-Wesley Longman Publishing

Co., Boston (2001)

8. Alférez, G.H., Suesaowaluk, P.: An aspect-oriented product line framework to support

the development of software product lines of web applications. In: SEARCC

2007: Proceedings of the 2nd South East Asia Regional Computer Conference

(2007)

9. Colyer, A., Clement, A., Harley, G., Webster, M.: Eclipse AspectJ: Aspect-Oriented

Programming with AspectJ and the Eclipse AspectJ Development Tools. Pearson

Education, Upper Saddle River (2005)

10. Eisenberg, B.: How to decrease sales by 90 percent,

http://www.clickz.com/1588161


134 J. Cámara and A. Kobsa

11. Eisenberg, B.: How to increase conversion rate 1,000 percent,

http://www.clickz.com/showPage.html?page=1756031

12. Filman, R.E., Elrad, T., Clarke, S., Aksit, M. (eds.): Aspect-Oriented Software

Development. Addison-Wesley, Reading (2004)

13. Kang, K., Cohen, S., Hess, J., Novak, W., Peterson, S.: Feature-oriented domain

analysis (FODA) feasibility study. Technical Report CMU/SEI-90-TR-21, Software

Engineering Institute, Carnegie Mellon University (November 1990)

14. Kang, K.C., Kim, S., Lee, J., Kim, K., Shin, E., Huh, M.: FORM: A featureoriented

reuse method with domain-specific reference architectures. Ann. Software

Eng. 5, 143–168 (1998)

15. Kiczales, G., Hilsdale, E., Hugunin, J., Kersten, M., Palm, J., Griswold, W.G.: An

Overview of AspectJ. In: Knudsen, J.L. (ed.) ECOOP 2001. LNCS, vol. 2072, pp.

327–353. Springer, Heidelberg (2001)

16. Kohavi, R., Henne, R.M., Sommerfield, D.: Practical Guide to Controlled Experiments

on the Web: Listen to your Customers not to the HIPPO. In: Berkhin, P.,

Caruana, R., Wu, X. (eds.) Proceedings of the 13th ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining, San Jose, California, USA,

pp. 959–967. ACM, New York (2007)

17. Kohavi, R., Round, M.: Front Line Internet Analytics at Amazon.com (2004),

http://ai.stanford.edu/~ronnyk/emetricsAmazon.pdf

18. Lee, K., Kang, K.C., Kim, M., Park, S.: Combining feature-oriented analysis and

aspect-oriented programming for product line asset development. In: SPLC 2006:

Proceedings of the 10th International on Software Product Line Conference, Washington,

DC, USA, pp. 103–112. IEEE Computer Society, Los Alamitos (2006)

19. Loughran, N., Rashid, A.: Framed aspects: Supporting variability and configurability

for AOP. In: Bosch, J., Krueger, C. (eds.) ICSR 2004. LNCS, vol. 3107, pp.

127–140. Springer, Heidelberg (2004)

20. Loughran, N., Rashid, A., Zhang, W., Jarzabek, S.: Supporting product line evolution

with framed aspects. In: Lorenz, D.H., Coady, Y. (eds.) ACP4IS: Aspects,

Components, and Patterns for Infrastructure Software, March, pp. 22–26

21. Mannion, M., Cámara, J.: Theorem proving for product line model verification.

In: van der Linden, F.J. (ed.) PFE 2003. LNCS, vol. 3014, pp. 211–224. Springer,

Heidelberg (2004)

22. McGlaughlin, F., Alt, B., Usborne, N.: The power of small changes tested (2006),

http://www.marketingexperiments.com/improving-website-conversion/

power-small-change.html

23. Mikkonen, T., Taivalsaari, A.: Web applications – spaghetti code for the 21st century.

In: Dosch, W., Lee, R.Y., Tuma, P., Coupaye, T. (eds.) Proceedings of the

6th ACIS International Conference on Software Engineering Research, Management

and Applications, SERA 2008, Prague, Czech Republic, pp. 319–328. IEEE

Computer Society, Los Alamitos (2008)

24. Pettersson, U., Jarzabek, S.: Industrial experience with building a web portal product

line using a lightweight, reactive approach. In: Wermelinger, M., Gall, H. (eds.)

Proceedings of the 10th European Software Engineering Conference held jointly

with 13th ACM SIGSOFT International Symposium on Foundations of Software

Engineering, Lisbon, Portugal, pp. 326–335. ACM, New York (2005)

25. PHP: Hypertext preprocessor, http://www.php.net/

26. Popovici, A., Frei, A., Alonso, G.: A Proactive Middleware Platform for Mobile

Computing. In: Endler, M., Schmidt, D. (eds.) Middleware 2003. LNCS, vol. 2672.

Springer, Heidelberg (2003)


Facilitating Controlled Tests of Website Design Changes 135

27. Roy, S.: 10 Factors to Test that Could Increase the Conversion Rate of your Landing

Pages (2007),

http://www.wilsonweb.com/conversion/suman-tra-landing-pages.htm

28. Taguchi, G.: The role of quality engineering (Taguchi Methods) in developing automatic

flexible manufacturing systems. In: Proceedings of the Japan/USA Flexible

Automation Symposium, Kyoto, Japan, July 9-13, pp. 883–886 (1990)

29. Trujillo, S., Batory, D.S., Díaz, O.: Feature oriented model driven development: A

case study for portlets. In: Proceedings of the 30th International Conference on

Software Engineering (ICSE 2007), Leipzig, Germany, pp. 44–53. IEEE Computer

Society, Los Alamitos (2007)

30. Usborne, N.: Design choices can cripple a website (2005),

http://alistapart.com/articles/designcancripple

31. Valderas, P., Pelechano, V., Rossi, G., Gordillo, S.E.: From crosscutting concerns

to web systems models. In: Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini,

C., Sadiq, W., Godart, C. (eds.) WISE 2007. LNCS, vol. 4831, pp. 573–582.

Springer, Heidelberg (2007)

32. Voelter, M., Groher, I.: Product line implementation using aspect-oriented and

model-driven software development. In: SPLC 2007: Proceedings of the 11th International

Software Product Line Conference, Washington, DC, USA, pp. 233–242.

IEEE Computer Society, Los Alamitos (2007)


Frontiers of

Structured Business Process Modeling

Dirk Draheim

Central IT Services Department

University of Innsbruck

draheim@acm.org

Abstract. In this article we investigate in how far a structured approach

can be applied to business process modelling. We try to contribute to a better

understanding of the driving forces on business process specifications.

1 Introduction

Isn’t it compelling to apply the structured programming arguments to the field

of business process modelling? Our answer to this question is ‘no’.

The principle of structured programming emerged in the computer science

community. From today’s perspective, the discussion of structured programming

rather had the characteristics of a maturing process than the characteristics of

a debate, although there have also been some prominent sceptic comments on

the unrestricted validity of the structured programming principle. Structured

programming is a well-established design principle in the field of program design

as the third normal form is in the field of database design. It is common sense

that structured programming is better than unstructured programming – or

let’s say structurally unrestricted programming – and this is what is taught as

foundational knowledge in many standard curricula of many software engineering

study programmes. With respect to business process modelling, in practice, you

find huge business process models that are arbitrary nets. How comes? Is it

somehow due to some lack of knowledge transfer from the programming language

community to the information system community? For computer scientists, it

might be tempting to state that structured programming is a proven concept

and it is therefore necessary to eventually promote a structured business process

modelling discipline, however, care must be taken.

In this article, we want to contribute to the understanding in how far a structured

approach can be applied to business process modelling and in how far such

an approach is naive. We attempt to clarify that the arguments of structured

programming are about the pragmatics of programming and that they often

relied on evidence in the past. Consequentially, our reasoning is at the level of

pragmatics of business process modelling. We try to avoid getting lost in superficial

comparisons of modelling language constructs but trying to understand

the core problems of structuring business process specifications. As an example,

A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 136–155, 2009.

c○ Springer-Verlag Berlin Heidelberg 2009


Frontiers of Structured Business Process Modeling 137

so to speak as a taster to our discussion, we take forward one of our arguments

here, which is subtle but important, i.e., that there are some diagrams expressing

behaviour that cannot be transformed into a structured diagram expressing the

same behaviour solely in terms of the same primitives as the original structurally

unrestricted diagram. These are all those diagrams that contain a loop which

is exited via more than one exit point, which is a known result from literature,

encountered [1] by Corrado Böhm and Guiseppe Jacopini, proven for a special

case [5] by Donald E. Knuth and Robert W. Floyd and proven in general [6] by

S. Rao Kosaraju.

2 Basic Definitions

In this Section we explain the notions of program, structured program, flowchart,

D-flowchart, structured flowchart, business process model and structured business

process model as used in this article. The section is rather on syntactical

issues. You might want to skip this Section and use it as a reference, however, you

should at least glimpse over the formation rules of structured flowcharts defined

in Fig. 1, which are also the basis for structured business process modelling.

In the course of this article, programs are imperative programs with go-tostatements,

i.e., they consist of basic statements, sequences, case constructs,

loops and go-to-statements. Structured programs are those programs that abstain

from go-to-statements. In programs with go-to-statements loops do not

add to the expressive power, in presence of go-to-statements loops are syntactic

sugar. Flowcharts correspond to programs. Flowcharts are directed graphs with

nodes being basic activities, decision points or join points. A directed circle in

a flowchart can be interpreted as a loop or as the usage of a go-to-statement.

In general flowcharts it allowed to place join points arbitrarily, which makes it

possible to create spaghetti structures, i.e., arbitrary jump structures, like the

go-to-statements allows for the creation of spaghetti code.

It is a matter of taste whether to make decision and joint points explicit nodes

or not. If you strictly use decision and joint points the basic activities always have

exactly one incoming and one outgoing edge. In concrete modelling languages

like event-driven process chains, there are usually some more constraints, e.g.,

a constraint on decision points not to have more than one incoming edge or

a constraint on join points to have not more than one outgoing edge. If you

allow basic activities to have more than one incoming edge you do not need join

points any more. Similarly, you can get rid of a decision point by using several

outgoing edges by directly connecting the several branches of the decision point

as outgoing edges to a basic activity and labelling the several branches with

appropriate flow conditions. For example, in formcharts [3] we have choosen the

option not to use explicit decision and join points. The discussion of this article

is independent from the detail question of having explicit or implicit decision

and join points, because both concepts are interchangeable. Therefore, in this

article, we feel free to use both options.


138 D. Draheim

2.1 D-Charts

It is possible to define formation rules for a restricted class of flowcharts that

correspond to structured programs. In [6] these diagrams are called Dijkstraflowcharts

or D-flowcharts for short, named after Edgser W. Dijkstra. Figure 1

summarizes the semi-formal formation rules for D-flowcharts.

(i)

basic activity

(ii)

sequence

(iii)

case

(iv)

do-while

(v)

repeat-until

C

D

C

D

C


C

� y

n

n

C C �

y

Fig. 1. Semi-formal formation rules for structured flowcharts

Actually, the original definition of D-flowcharts in [6] consists of the formation

rules (i) to (iv) with one formation rule for each programming language

construct of a minimal structured imperative programming language with basic

statements, sequences, case-constructs and while-loops with basic activities in

the flowchart corresponding to basic statements in the programming language.

We have added a formation rule (v) for the representation of repeat-until-loops

and call flowcharts resulting from rules (i) to (v) structured flowcharts in the

sequel.

The flowchart in Fig. 2 is not a structured flowchart, i.e., it cannot be derived

from the formation rules in Fig. 1. The flowchart in Fig. 2 can be interpreted

as consisting of a repeat-until-loop exited via the α-decision point and followed

by further activities ‘C’ and ‘D’. In this case, the β-decision point can lead to

a branch that jumps into the repeat-until-loop in addition to the regular loop

n

A B �

y

C �

y

n

D

Fig. 2. Example flowchart that is not a D-flowchart

A

C

D

D

C


Frontiers of Structured Business Process Modeling 139

entry point via activity ‘A’, which infringes the structured programming and

structured modelling principle and gives raises to spaghetti structure. This way,

the flowchart in Fig. 2 visualizes the program in Listing 1

Listing 1

01 REPEAT

02 A;

03 B;

04 UNTIL alpha;

05 C;

06 IF beta THEN GOTO 03;

07 D;

The flowchart in Fig. 2 can also be interpreted as consisting of a while-loop exited

via the β-decision point, whereat the while-loop is surrounded by a preceding

activity ‘A’ and a succeeding activity ‘D’. In this case, the α-decision point can

lead to a branch that jumps out of the while-loop in addition to the regular loop

exit via the β-decision point, which again infringes the structured modelling

principle. This way, the flowchart in Fig. 2 visualizes the program in Listing 2

Listing 2

01 A;

02 REPEAT

03 B;

04 IF NOT alpha THEN GOTO 01

05 C;

06 UNTIL NOT beta;

07 D;

Flowcharts are visualization of programs. In general, a flowchart can be interpreted

ambiguously as the visualization of several different program texts,

because, for example, an edge from a decision point to a join point can be interpreted

as go-to-statement on the one hand side or the back branch from an

exit point of a repeat-until loop to the start of the loop. Structured flowcharts

are visualizations of structured programs. Loops in structured programs and

structured flowcharts enjoy the property that they have exactly one entry point

and exactly one exit point. Whereas the entry point and the exit point of a

repeat-until loop are different, the entry point and exit point of a while-loop are

the same, so that a while-loop in a structured flowchart has exactly one contact

point. That might be the reason that structured flowcharts that use only

while-loops instead of repeat-until loops appear more normalized. Similarly, in a


140 D. Draheim

structured program and flowchart all case-constructs has exactly one entry point

and one exit point. In general, additional entry and exit points can be added

to loops and case constructs by the usage of go-to-statements in programs and

by the usage of arbitrary decision points in flowcharts. In structured flowcharts,

decision points are introduced as part of the loop constructs and part of the case

construct. In structured programs and flowcharts, loops and case-constructs are

strictly nested along the lines of the derivation of their abstract syntax tree.

Business process models extend flowcharts with further modelling elements

like a parallel split, parallel join or non-deterministic choice. Basically, we discuss

the issue of structuring business process models in terms of flowcharts in

this article, because flowcharts actually are business process model diagrams,

i.e., flowcharts form a subset of business process models. As the constructs in

the formation rules of Fig. 1 further business process modelling elements can

also be introduced in a structured manner with the result of having again only

such diagrams that are strictly nested in terms of their looping and branching

constructs. For example, in such definition the parallel split and the parallel

join would not be introduced separately but as belonging to a parallel modelling

construct.

2.2 A Notion of Equivalence for Business Processes

Bisimilarity has been defined formally in [15] as an equivalence relation for infinite

automaton behaviour, i.e., process algebra [12,13]. Bisimilarity expresses

that two processes are equal in terms of their observable behaviour. Observable

behaviour is the appropriate notion for the comparison of automatic processes.

The semantics of a process can also be understood as opportunities of one process

interacting with another process. Observable behaviour and experienced

opportunities are different viewpoints on the semantics of a process, however,

whichever viewpoint is chosen, it does not change the basic concept of bisimilarity.

Business processes can be fully automatic; however, business processes can

also be descriptions of human actions and therefore can also be rather a protocol

of possible steps undertaken by a human. We therefore choose to explain

bisimilarity in terms of opportunities of an actor, or, as a metaphor, from the

perspective of a player that uses the process description as a game board – which

neatly fits to the notions of simulation and bisimulation, i.e., bisimilarity.

In general, two processes are bisimilar if starting from the start node they

reveal the same opportunities and each pair of same opportunities lead again to

bisimilar processes. More formally, bisimilarity is defined on labelled transition

systems as the existence of a bisimulation, which is a relationship that enjoys

the aforementioned property, i.e., nodes related by the bisimilarity lead via same

opportunities to nodes that are related again, i.e., recursively, by the bisimilarity.

In the non-structured models in this article the opportunities are edges leading

out of an activity and the two edges leading out of a decision point. For our

purposes, bisimilarity can be characterized by the rules in Fig. 3.


Frontiers of Structured Business Process Modeling 141

(i) A

A

(ii)

(iii)

A C A D

y


n

C

D


� iff �


y


n

E

F

iff

C D

C � E



D F

Fig. 3. Characterization of bisimilarity for business process models

3 Resolving Arbitrary Jump Structures

Have a look at Fig. 4. As well as Fig. 2 it shows a business process model that is

not a structured business process model. The business process described by the

business process model in Fig. 4 can also be described in the style of a program

text as we did in Listing 3. In this interpretation the business process model

consists of a while-loop followed by a further activity ‘B’, a decision point that

might branch back into the while-loop and eventually an activity ‘C’. Alternatively,

the business process can also be described by structured business process

models. Fig. 5 shows two examples of such structured business process models

and Listings 4 and 5 show the corresponding program text representations that

are visualized by the business process models in Fig. 5.

Listing 3

01 WHILE alpha DO

02 A;

03 B;

04 IF beta THEN GOTO 02;

05 C;

y


n

y

A

B �

n

C

Fig. 4. Example business process model that is not structured


142 D. Draheim

y


n

A

n

B �

y

A

C

(i) �

A

(ii)

y

n

B

y


n

A

n

B �

y

A

n

��

y

Fig. 5. Structured business process models that replace a non-structured one

The business process models in Figs. 4 and 5 resp. Listings 3, 4 and 5 describe

the same business process. They describe the same business process, because

they are bisimilar, i.e., in terms of their nodes, which are, basically, activities

and decision points, they describe the same observable behaviour resp. same

opportunities to act for an actor – we have explained the notion of equality and

the more precise approach of bisimilarity in more detail in Sect. 2.2.

The derivation of the business process models in Fig. 5 from the formation

rules given in Fig. 1 can be understood by the reader by having a look at its

abstract syntax tree, which appears at tree ψ in Fig. 6. The proof that the process

models in Figs. 4 and 5 are bisimilar is left to the reader as an exercise. The reader

is also invited to find structured business process models that are less complex

than the ones given in Fig. 5, whereas complexity is an informal concept that

depends heavily on the perception and opinion of the modeller. For example,

the model (ii) in Fig. 4 results from an immediate simple attempt to reduce

the complexity of the model (i) in Fig. 5 by eliminating the ‘A’-activity which

follows the α-decision point and connecting the succeeding ‘yes’-branch of the

α-decision point directly back with the ‘A’-activity preceding the decision point,

i.e., by reducing a while-loop-construct with a preceding statement to a repeatuntil-construct.

Note, that the model in Fig. 5 has been gained from the model in

Fig. 4 by straightforwardly unfolding it behind the β-decision point as much as

necessary to yield a structured description of the business process. In what sense

the transformation from model (i) to model (ii) in Fig. 5 has lowered complexity

and whether it actually or rather superficially has lowered the complexity has to

be discussed in the sequel. In due course, we will also discuss another structured

business process model with auxiliary logic that is oriented towards identifying

repeat-until-loops in the original process descriptions.

B

C


Listing 4

01 WHILE alpha DO

02 A;

03 B;

04 WHILE beta DO BEGIN

05 A;

06 WHILE alpha DO

07 A;

08 B;

09 END;

10 C;

Listing 5

01 WHILE alpha DO

02 A;

03 B;

04 WHILE beta DO BEGIN

05 REPEAT

06 A;

07 UNTIL NOT alpha;

08 B;

09 END;

10 C;

Frontiers of Structured Business Process Modeling 143

The above remark on the vagueness of the notion of complexity is not just a

side-remark or disclaimer but is at the core of the discussion. If the complexity

of a model is a cognitive issue it would be a straightforward approach to let

people vote which of the models is more complex. If there is a sufficiently precise

method to test whether a person has understood the semantics of a process

specification, this method can be exploited in testing groups of people that have

been given different kinds of specifications of the same process and concluding

from the test results which of the process specifications should be considered as

more complex. Such an approach relies on the preciseness of the semantics and

eventually on the quality of the test method.

It is a real challenge to search for a definition of complexity of models or their

representations. What we expect is that less complexity has something to do

with better quality, and before we undertake efforts in defining complexity of

models we should first understand possibilities to measure of quality of models.

The usual categories in which modellers and programmers often judge about

complexity of models like understandability or readability are vague concepts

themselves. Other categories like maintainability or reusability are more telling

than understandability or readability but still vague. Of course, we can define

metrics for the complexity of diagrams. For example, it is possible to define


144 D. Draheim

that the number of activity nodes used in a business process model increases

the complexity of a model. The problem with such metrics is that it follows

immediately that the model in Fig. 5 is more complex than the model in Fig. 4.

Actually, this is what we believe.

4 Immediate Arguments for and against Structure

We believe that the models in Fig. 4 are more complex than the models in

Fig. 5. A structured approach to business process models would make us believe

that structured models are somehow better than non-structured models in the

same way that the structured programming approach believes that structured

programs are somehow better than non-structured programs. So either less complexity

must not be always better or the credo of the structured approach must

be loosened to a rule of thumb, i.e., the believe that structured models are in

general better than non-structured models, despite some exceptions like our current

example. An argument in favour of the structured approach could be that

our current example is simply too small, i.e., that the aforementioned exceptions

are made of small models or, to say it differently, that the arguments of a structured

approach become valid for models beyond a certain size. We do not think

so. We rather believe that our discussion scales, i.e., that the arguments that we

will give in the sequel are also working or even more working for larger models.

We want to approach these questions more systematically.

In order to do so, we need to answer why we do believe that the models in

Fig. 4 are more complex than the model in Fig. 5. Of course, the immediate

answer is simply because they are larger and therefore harder to grasp, i.e., a

very direct cognitive argument. But there is another important argument why

we believe this. The model in Fig. 4 shows an internal reuse that the models

in Fig. 4 do not show. The crucial point is the reuse of the loop consisting of

the ‘A’-activity and the α-decision point in Fig. 4. We need to delve into this

important aspect and will actually do this later. First, we want to discuss the

dual question, which is of equal importance, i.e., we must also try to understand

or try answer the question, why modellers and programmers might find that the

models in Fig. 5 are less complex than the models in Fig. 4.

A standard answer to this latter question could typically be that the edge

from the β-decision point to the ‘A’-activity in Fig. 4 is an arbitrary jump, i.e.,

a spaghetti, whereas the diagrams in Fig. 5 do not show any arbitrary jumps

or spaghetti phenomena. But the question is whether this vague argument can

be made more precise. A structured diagram consists of strictly nested blocks.

All blocks of a structured diagram form a tree-like structure according to their

nesting, which corresponds also to the derivation tree in terms of the formation

rules of Fig. 1. The crucial point is that each block can be considered a semantic

capsule from the viewpoint of its context. This means, that ones the semantics of

a block is understood by the analyst studying the model, the analyst can forget

about the inner modelling elements of the block. This is not so for diagrams in

general. This has been the argument of looking from outside onto a block in the


Frontiers of Structured Business Process Modeling 145

case a modeller want to know its semantics in order to understand the semantics

of the context where it is utilized. Also, the dual scenario can be convincing. If

an analyst is interested in understanding the semantics of a block he can do this

in terms of the inner elements of a block only. Once the analyst has identified the

block he can forget about its context to understand it. This is not so easy in a

non-structured language. When passing an element, in general you do not know

where you end up in following the several paths behind it. It is also possible

to subdivide a non-structured diagram into chunks that are smaller than the

original diagram and that make sense to understand as capsules. For example,

this can be done, if possible, by transforming the diagram into a structured one,

in which you will find regions of your original diagram. However, it is extra effort

to do this partition.

With the current set of modelling elements, i.e., those introduced by the formulation

rules in Fig. 1, all this can be seen particularly easy, because each block

has exactly one entry point, i.e., one edge leading into it. Fortunately, standard

building blocks found in process modelling would have one entry point in a structured

approach. If you have, in general, also blocks with more than one entry

points, it would make the discussion interesting. The above argument would not

be completely infringed. Blocks still are capsules, which a semantics that can be

understood locally with respect to their appearance in a strictly nested structure

of blocks. The scenario itself remains neat and tidy; the difference lays in the

fact, that a block with more than one entry has a particular complex semantics

in a certain sense. The semantics of a block with more than one entry is manifold,

e.g., the semantics of a block with two entries is threefold. Given that, in

general, we also have concurrency phenomena in a business process model, the

semantics of block with two entry points, i.e., its behaviour or opportunities,

must be understood for the case that the block is entered via one or the other

entry point and for the case that the block is entered simultaneously. But this is

actually not a problem; it just means a more sophisticated semantics and more

documentation.

Despite a more complex semantics, a block with multiple entries still remains

an anchor in the process of understanding a business process model, because

it is possible, e.g., to understand the model from inside to outside following

strictly the tree-like nesting, which is a canonical way to understand the diagram,

i.e., a way that is always defined. It is also always possible to understand the

diagram sequentially from the start node to the end node in a controlled manner.

The case constructs make such sequential proceeding complex, because they

open alternative paths in a tree-like manner. The advantage of a structured

diagram with respect to case-constructs is that each of the alternative paths

that are spawned is again a block and it is therefore possible to understand

its semantics isolated from the other paths. This is not so in a non-structured

diagram, in general, where might have arbitrary jumps between the alternative

paths. Similarly, if analyzing a structured diagram in a sequential manner, you

do not get into arbitrary loops and therefore have to deal with a minimized risk

to loose track.


146 D. Draheim

The discussion of the possibility to have blocks with more entry points immediately

reminds us of the discussion we have seen within the business process

community on multiple versus unique entry points for business processes in a

setting of hierarchical decomposition. The relationship between blocks in a flat

structured language and sub-diagrams in a hierarchical approach and how the

play together in a structured approach is an important strand of discussion that

we will come back to in due course. For the time being, we just want to point out

the relationship of the discussion we just had on blocks with multiple entries and

sub-diagrams with multiple entries. A counter-argument against sub-diagrams

with multiple entries would be that they are more complex. Opponents of the

argument would say, that it is not a real argument, because the complexity of

the semantics, i.e., its aforementioned manifoldness, must be described anyhow.

With sub-diagrams that may have no more than one entry point, you would

need to introduce a manifoldness of diagrams each with a single entry point.

We do not discuss here how to transform a given diagram with multiple entries

into a manifoldness of diagrams – all we want to remark here that it easily

becomes complicated because of the necessity to appropriately handle the aforementioned

possibly existing concurrency phenomena. Eventually it turns out to

be a problem of transforming the diagram together with its context, i.e., transforming

a set of diagrams and sub-diagrams with possibly multiple entry points

into another set of diagrams and sub-diagrams with only unique entry points.

Defenders of diagrams with unique entry points would state that it is better to

have a manifoldness of such diagrams instead of having a diagram with multiple

entries, because, the manifoldness of diagrams documents better the complexity

of the semantics of the modelled scenario.

For a better comparison of the discussed models against the above statements

we have repainted the diagram from Fig. 4 and diagram (ii) from Fig. 5 with

the blocks they are made of and their abstract syntax trees resp. quasi-abstract

syntax tree in Fig. 6. The diagram of Fig. 4 appears to the left in Fig. 6 as

diagram Φ and diagram (ii) from Fig. 5 appears to the right as diagram Ψ.

According to that, the left abstract syntax tree φ in Fig. 6 corresponds to the

diagram from Fig. 4 and the right abstract syntax tree ψ corresponds to the

diagram (ii) from Fig. 5. Blocks are surrounded by dashed lines in Fig. 6.

If you proceed in understanding the model Phi in Fig. 6 you first have to

understand a while-loop that encompasses the ‘A’-activity – the block labelled

with number ‘5’ in model Phi. After that, you are not done with that part of

the model. Later, after the β-decision point you are branched back to the ‘A’activity

and you have to re-understand the loop it belongs to again, however,

this time in a different manner, i.e., as a repeat-until loop – the block labelled

with number ‘1’ in model Phi. It is possible to argue that, in some sense, this

makes the model Φ harder to read than model Ψ. To say it differently, it is

possible to view model Ψ as an instruction manual on how to read the model Φ.

Actually, model Ψ is a bloated version of model Φ. It contains some modelling

elements of model Φ redundantly, however, it enjoys the property that each

modelling element has to be understood only in the context of one block and


6



y


n

B

ii

6 ii


iv

1

ii

2

A �

C

A

1

Frontiers of Structured Business Process Modeling 147

y

2


n

C

B


7

6 5

A


B


3

1

4

2

A �

C

B

7

6

5

y


n

A

n

B �

y

A

y


n

1

Fig. 6. Block-structured versus arbitrary business process model

its encompassing blocks. We can restate these arguments a bit more formal in

analyzing the abstract syntax trees φ and ψ in Fig. 6. Blocks in Fig. 6 correspond

to constructs that can be generated by the formation rules in Fig. 1. The abstract

syntax tree ψ is an alternate presentation of the nesting of blocks in model Ψ.

A node stands for a block and for the corresponding construct according to the

formation rules. The graphical model Φ cannot be derived from the formation

rules in Fig. 1. Therefore it does not posess an abstract syntax tree in which

each node represent a unique graphical block and a construct the same time.

The tree φ shows the problem. You can match the region labelled ‘1’ in model Φ

as a block against while-loop-rule (iv) and you can subsequentially match the

region labelled ‘2’ against the sequence-rule (iii). But then you get stuck. You

can form a further do-while loop with rule (iv) out of the β-decision point and

block ‘2’ as in model Ψ but the resulting graphical model cannot be interpreted

as a part of model Φ any more. This is because the edge from activity ‘B’ to the

β-decision point graphically serves both as input branch to the decision point and

as back branch to the decision point. This graphical problem is resolved in the

abstract syntax tree φ by reusing the activity ‘B’ in the node that corresponds

to node ‘5’ in tree ψ in forming a sequence according to rule (ii) with the results

that the tree φ is actually no tree any longer. Similarly, the reuse of the modelling

elements in forming node ‘6’ in the abstract syntax tree φ visualizes the double

interpretation of this graphical region as both a do-while loop and repeat-until

loop.

B

C

2


4

3


148 D. Draheim

5 Structure for Text-Based versus Graphical

Specifications

In Sect. 4 we have said that an argument for a structured business process

specification is that it is made of strictly nested blocks and that each identifiable

block forms a semantic capsule. In the argumentation we have looked at the

graphical presentation of the models only and now we will have a look also at

the textual representations.

This section needs a disclaimer. We are convinced that it is risky in the discussion

of quality of models to give arguments in terms of cognitive categories

like understandability, readability, cleanness, well-designedness, well-definedness.

These categories tend to have a insufficient degree definedness themselves so that

argumentations based on them easily suffer a lack of falsifiability. Nevertheless,

in this Section, in order abbreviate, we need to speak directly about the reading

ease of specifications. The judgements are our very own opinion, an opinion that

expresses our perception of certain specifications. The reader may have a different

opinion and this would be interesting in its own right. At least, the expression

of our own opinion may encourage the reader to judge about the reading ease of

certain specifications.

As we said in terms of complexity, we think that the model in Fig. 4 is easier to

understand than the models in Fig. 5. We think it is easier to grasp. Somehow

paradoxically, we think the opposite about the respective text representation,

at least at a first sight, i.e., as long as we have not internalized to much all

the different graphical models in listings. This means, we think that the text

representation of the models in Fig. 4, i.e., Listing 3, is definitely harder to

understand than the text representation of both models in Fig. 5, i.e., Listings 4

and 5. How comes? Maybe, the following observation helps, i.e., that we also

think that the graphical model in Fig. 5 is also easier to read than the models

textual representation in Listing 3 and also easier to read than the two other

Listings 4 and 5. Why is Listing 5 so relatively hard to understand? We think,

because there is no explicitly visible connecting between the jumping-off point in

line ‘04’ and the jumping target in line line ‘02’. Actually, the first thing we would

recommend in order to understand Listing 5 better is to draw its visualization,

i.e., the model in Fig. 5, or to concentrate and to visualize it in our mind. By

the way, we think that drawing some arrows in Listing 3 as we did in Fig. 7 also

help. The two arrows already help despite the fact that they make explicit only a

part of the jump structure – one possible jump from line ‘01’ to line ‘03’ in case

the α-condition becomes invalid must still be understood by the indentation of

the text.

All this is said for such a small model consisting of a total of five lines. Imagine,

if you had to deal with a model consisting of several hundreds lines with arbitrary

goto-statements all over the text. If it is true that the model in Fig. 4 is easier to

understand than the models in Fig. 5 and at the same time Listing 3 is harder

to understand than Listings 4 and 5 this may lead us to the assumption that

the understandability of graphically presented models follows other rules than

the understandability of textual representation. Reasons for this may be, on the


Frontiers of Structured Business Process Modeling 149

01 WHILE alpha DO

02 A;

03 B;

04 IF beta THEN GOTO 02;

05 C;

Fig. 7. Listing enriched with arrows for making jump structure explicit

one hand side, the aforementioned lack of explicit visualizations of jumps, and

on the other hand side, the one-dimensional layout of textual representations.

The reason for why we have given all of these arguments in this section is not

in order to promote visual modelling. The reason is that we see a chance that

they might explain why the structural approach has been so easily adopted in

the field of programming.

The field of programming was and still is dominated by text-based specifications

– despite the fact that we have seen many initiatives from syntax-directed

editors over computer-aided software engineering to model-driven architecture.

It is fair to remark that the crucial characteristics of mere textual specification

in the discussion of this Section, i.e., lack of explicit visualization of jumps, or,

to say it in a more general manner, support for the understanding of jumps, is

actually addressed in professional coding tools like integrated development environments

with their maintenance of links, code analyzers and profiling tools.

The mere text-orientation of specification has been partly overcome by today’s

integrated development environments. Let us express once more that we are no

promoters of visual modelling or even visual programming. In [3] we have deemphasized

visual modelling. We strictly believe that visualizations add value,

in particular, if it is combined with visual meta-modelling [10,11]. But we also

believe that mere visual specification is no silver bullet, in particular, because

it does not scale. We believe in the future of a syntax-direct abstract platform

with visualization capabilities that overcomes the gap between modelling and

programming from the outset as proposed by the work on AP1 [8,9] of the Software

Engineering research group at the University of Auckland.

6 Structure and Decomposition

The models in Fig. 5 are unfolded versions of the model in Fig. 4. Some modelling

elements of the diagram in Fig. 5 occur redundantly in each model in Fig. 4.

Such unfolding violate the reuse principle. Let us concentrate on the comparison

of the model in Fig. 5 with model (i) in Fig. 5. The arguments are similar for

diagram (ii) in Fig. 5. The loop made of the α-decision point and the activity ‘A’

occurs twice in model (i). In the model in Fig. 5 this loop is reused by the jump

from the β-decision point albeit via an auxiliary entry point. It is important to

understand that reuse is not about the cost-savings of avoiding the repainting

modelling elements but about increasing maintainability.


150 D. Draheim

Imagine, in the lifecycle of the business process a change to the loop consisting

of the activity ‘A’ and the α-decision point becomes necessary. Such changes

could be the change of the condition to another one, the change of the activity ‘A’

to another one or the refinement of the loop, e.g., the insertion of a further activity

into it. Imagine that you encounter the necessity for changes by reviewing

the start of the business process. In analyzing the diagram, you know that the

loop structure is not only used at the beginning of the business process but also

later by a possible jump from the β-decision point to it. You will now further

analyze whether the necessary changes are only appropriate at the beginning of

the business process or also later when the loopisreusedfromotherpartsofthe

business process. In the latter case you are done. This is the point where you can

get into trouble with the other version of the business process specification as

diagram (i) in Fig. 5. You can more easily overlook that the loop is used twofold

in the diagram; this is particularly true for similar examples in larger or even

distributed models. So, you should have extra documentation for the several

occurrences of the loop in the process. Even in the case that the changes are

relevant only at the beginning of the process you would like to review this fact

and investigate whether the changes are relevant for other parts of the process.

It is fair to remark, that in the case that the changes to the loop in question

are only relevant to the beginning of the process, the diagram in Fig. 5 bears

the risk that this leads to an invalid model if the analyst oversees its reuse

from later stages in the process, whereas the model (i) in Fig. 5 does not bear

that risk. But we think this kind of weird fail-safeness can hardly be sold as

an advantage of model (i) in Fig. 5. Furthermore, it is also fair to remark, that

the documentation of multiple occurrences of a model part can be replaced

by appropriate tool-support or methodology like a pattern search feature or

hierarchical decomposition as we will discuss in due course. All this amounts to

say that maintainability of a model cannot be reduced to its presentation but

depends on a consistent combination of presentational issues, appropriate tool

support and defined maintenance policies and guidelines in the framework of a

mature change management process.

We now turn the reused loop consisting of the activity ‘A’ and the α-decision

point in Fig. 5 into an own sub-diagram in the sense of hierarchical decomposition,

give it a name – let us say ‘DoA’ – and replace the relevant regions in diagram (i)

in Fig. 5 by the respective, expandable sub-diagram activity. The result is shown

in Fig. 8. Now, it is possible to state that this solution combines the advantages

from both kinds of models in question, i.e., it consists of structured models at

all levels of the hierarchy and offers an explicit means of documentation of the

places of reuse. But a caution is necessary. First, the solution does not free the

analyst to actually have a look at all the places a diagram is used after he or she has

made a change to the model, i.e., an elaborated change policy is still needed. In the

small toy example, such checking is provoked, but in a tool you usually do not see

all sub-diagrams at once, but rather step through the levels of the hierarchy and

the sub-diagrams with links. Remember that the usual motivation to introduce

hierarchical decomposition and tool-support for hierarchical decomposition is the


DoA

+

Frontiers of Structured Business Process Modeling 151

DoA

n

B �

y

y

� A

A

C

DoA

+

Fig. 8. Example business process hierarchy

DoA

+

n

B �

y

DoA

Ado

+

y

� A

Ado

C

B

A DoA

+

Fig. 9. Example for a deeper business process hiarchy

desire to deal with the complexity of large and very large models. Second, the tool

should not only support the reuse-direction but should also support the inverse

use-direction, i.e., it should support the analyst with a report feature that lists all

places of reuse for a given sub-diagram.

Now let us turn to a comparative analysis of the complexity of the modelling

solution in Fig. 8 and the model in Fig. 5. The complexity of the top-level

diagram in the model hierarchy in Fig. 8 is not any more significantly higher

than the one of the model in Fig. 5. However, together with the sub-diagram,

the modelling solution in Fig. 8 again shows a certain complexity. It would be

B


152 D. Draheim

possible to neglect a reduction of complexity by the solution in Fig. 8 completely

with the hint that the disappearance of the edge representing the jump from the

β-decision point into the loop in Fig. 5 is bought by another complex construct

in Fig. 8, wit to the dashed line from the activity ‘DoA’ to the targeted subdiagram.

The jump itself can be still seen in Fig. 8, somehow, unchanged as

an edge from the β-decision point to the activity ‘A’. We do not think so. The

advantage of the diagram in Fig. 8 is that the semantic capsule made of the loop

in question is already made explicit as a named sub diagram, which means an

added documentation value.

Also, have a look at Fig. 9. Here the above explanations are even more substantive.

The top-level diagram is even less complex than the top-level diagram

in Fig. 8, because the activity ‘A’ now has moved to an own level of the hierarchy.

However, this comes at the price now, that the jump from the β-decision

point to the activity ‘A’ in Fig. 5 now re-appears in Fig. 9 as the concatenation

of the ‘yes’-branch in the top-level diagram, the dashed line leading from the

activity ‘Ado’ to the corresponding sub-diagram at the next level and the entry

edge of this sub-diagram.

7 On Business Domain-Oriented versus Documentation-

Oriented Modeling

In Sects. 3 through 6 we have discussed structured business process modelling

for those processes that actually have a structured process specification in terms

of a chosen fixed set of activities. In this Section we will learn about processes

that do not have a structured process specification in that sense. In the running

example of Sects. 3 through 6 the fixed set of activities was given by the activities

of the initial model in Fig. 4 and again we will explain the modelling challenge

addressed in this Section as a model transformation problem.

reject workpiece

due to defects

handle

workpiece

quality must

be improved

y quality

insurance

y

n n

amount exceeds

threshold

prepare

purchase

order

(i) dispose

(ii)

finish

deficient

workpiece

workpiece

submit

purchase

order

revision is

necessary

y

approve

purchase

order

y

n n

Fig. 10. Two example business processes without structured presentation with respect

to no other than their own primitives


Frontiers of Structured Business Process Modeling 153

y y

A � B


n n

C D

Fig. 11. Business process with cycle that is exited via two distinguishable paths

Consider the example business process models in Fig. 10. Each model contains

a loop with two exits to paths that lead to the end node without the opportunity to

come back to the originating loop before reaching the end state. It is known [1,5,6]

that the behaviours of such loops cannot be expressed in a structured manner, i.e.,

by a D-chart as defined in Fig. 1 solely in terms of the same primitive activities

as those occurring in the loop. Extra logic is needed to formulate an alternative,

structured specification. Fig. 11 shows this loop-pattern abstractly and we proceed

to discuss this issues with respect to this abstract model.

Assume that there is a need to model the behaviour of a business process in

terms of a certain fixed set of activities, i.e., the activities ‘A’ through ‘D’ in

Fig. 11. For example, assume that they are taken from an accepted terminology

of a concrete business domain. Other reasons could be that the activities stem

from existing contract or service level agreement documents. You can also assume

that they are simply the natural choice as primitives for the considered work to

be done. We do not delve here into the issue of natural choice and just take for

granted that it is the task to model the observed or desired behaviour in terms of

these activities. For example, we could imagine an appropriate notion of cohesion

of more basic activities that the primitives we are restricted to, or let’s say selfrestricted

to, adhere to. Actually, as it will turn out, for the conclusiveness of

our current argumentation there is no need for an explanation how a concrete

fixed set of activities arises. What we need for the conclusiveness of our current

argumentation is the demand on the activities, that they are only about actions

and objects that are relevant in the business process.

Fig. 12 shows a structured business process model that is intended to describe

the same process as the specification in 11. In a certain sense it fails. The extra

logic introduced in order to get the specification into a structured shape do

not belong to the business process that the specification aims to describe. The

model in Fig. 12 introduces some extra state, i.e., the Boolean variable δ, extra

activities to set this variable so that it gets the desired steering effect and an

extra δ-decision point. Furthermore, the original δ-decision point in the model of

Fig. 11 has been changed to a new β∧δ-decision point. Actually, the restriction of

the business process described by Fig. 11 onto those particles used in the model

in Fig. 11 is bisimilar to this process. The problem is that the model in Fig. 12 is a

hybrid. It is not only a business domain-oriented model any more, it now has also

some merely documentation-related parts. The extra logic and state only serve


154 D. Draheim

A �:=true A

y y

��� B


n


C

�:=false

Fig. 12. Resolution of business process cycles with multiple distinguishable exits by

the usage of auxiliary logic and state

the purpose to get the diagram into shape. It needs clarification of the semantics.

Obviously, it is not intended to change the business process. If the auxiliary

introduced state and logic would be also about the business process, this would

mean, for example, that in the workshop a mechanism is introduced, for example

a machine or a human actor that is henceforth responsible for tracking and

monitoring a piece of information δ. So, at least what we need is to explicitly

distinguish those elements in such a hybrid model. The question is whether the

extra complexity of a hybrid domain- and documentation-oriented modelling

approach is justified by the result of having a structured specification.

8 Conclusion

On a first impression, structured programs and flowcharts appear neat and programs

and flowcharts with arbitrary jumps appear obfuscated, muddle-headed,

spaghetti-like etc. But the question is not to identify a subset of diagrams and

programs that look particularly fine. The question is, given a behaviour that

needs description, whether it makes always sense to replace a description of this

behaviour by a new structured description. What efforts are needed to search for

a good alternative description? Is the resulting alternative structured description

as nice as the original non-structured description?

Furthermore, we need to gain more systematic insight into which metrics we

want to use to judge the quality of a description of a behaviour, because categories

like neatness or prettiness are not satisfactory for this purpose if we take

for serious that our domain of software development should be oriented rather

towards engineering [14,2] than oriented towards arts and our domain of business

management should be oriented rather towards science [4], though, admittedly,

both fields are currently still in the stage of pre-paradigmatic research [7]. All

these issues form the topic of investigation of this article.

For us, the definitely working theory of quality of business process models

would be strictly pecuniary, i.e., it would enable us to define a style guide for

D

n


Frontiers of Structured Business Process Modeling 155

business process modelling that eventually saves costs in system analysis and

software engineering projects. The better the cost-savings realized by the application

of such style-guide the better such theory. Because our ideal is pecuniary,

we deal merely with functionality. There is no cover, no aesthetics, no mystics.

This means there is no form in the sense of Louis H. Sullivan [16] – just function.

References

1. Böhm, C., Jacopini, G.: Flow Diagrams, Turing Machines and Languages With

Only Two Formation Rules. Communications of the ACM 3(5) (1966)

2. Buxton, J.N., Randell, B.: Software Engineering – Report on a Conference Sponsored

by the NATO Science Committee, Rome, October 1969. NATO Science Committee

(April 1970)

3. Draheim, D., Weber, G.: Form-Oriented Analysis – A New Methodology to Model

Form-Based Applications. Springer, Heidelberg (2004)

4. Gulick, L.: Management is a Science. Academy of Management Journal 1, 7–13

(1965)

5. Knuth, D.E., Floyd, R.W.: Notes on Avoiding ‘Go To’ Statements. Information

Processing Letters 1(1), 23–31, 177 (1971)

6. Rao Kosaraju, S.: Analysis of Structured Programs. In: Proceedings of the 5th

Annual ACM Symposium on Theory of Computing, pp. 240–252 (1973)

7. Kuhn, T.S.: The Structure of Scientific Revolutions. University of Chicago Press

(December 1996)

8. Lutteroth, C.: AP1 – A Platform for Model-based Software Engineering. In: Draheim,

D., Weber, G. (eds.) TEAA 2006. LNCS, vol. 4473, pp. 270–284. Springer,

Heidelberg (2007)

9. Lutteroth, C.: AP1 – A Platform for Model-based Software Engineering. PhD

thesis, University of Auckland (March 2008)

10. Himsl, M., Jabornig, D., Leithner, W., Draheim, D., Regner, P., Wiesinger, T.,

Küng, J.: A Concept of an Adaptive and Iterative Meta- and Instance Modeling

Process. In: Proceedings of DEXA 2007 - 18th International Conference on

Database and Expert Systems Applications. Springer, Heidelberg (2007)

11. Himsl, M., Jabornig, D., Leithner, W., Draheim, D., Regner, P., Wiesinger, T.,

Küng, J.: Intuitive Visualization-Oriented Metamodeling. In: Proceedings of DEXA

2009 - 20th International Conference on Database and Expert Systems Applications.

Springer, Heidelberg (2009)

12. Milner, R.: A Calculus of Communication Systems. LNCS, vol. 92. Springer, Heidelberg

(1980)

13. Milner, R.: Communication and Concurrency. Prentice-Hall, Englewood Cliffs

(1989)

14. Naur, P., Randell, B. (eds.): Software Engineering – Report on a Conference Sponsored

by the NATO Science Committee, Garmisch, October 1968. NATO Science

Committee (January 1969)

15. Park, D.: Concurrency and Automata on Infinite Sequences. In: Deussen, P. (ed.)

GI-TCS 1981. LNCS, vol. 104, pp. 167–183. Springer, Heidelberg (1981)

16. Sullivan, L.H.: The Tall Office Building Artistically Considered. Lippincott’s Magazine

57, 403–409 (1896)


Information Systems for Federated Biobanks ⋆

Johann Eder 1 , Claus Dabringer 1 , Michaela Schicho 1 , and Konrad Stark 2

1 Alps Adria University Klagenfurt, Department of Informatics Systems

{Johann.Eder,Claus.Dabringer,Michaela.Schicho}@uni-klu.ac.at

2 University of Vienna, Department of Knowledge and Business Engineering

Konrad.Stark@univie.ac.at

Abstract. Biobanks store and manage collections of biological material

(tissue, blood, cell cultures, etc.) and manage the medical and biological

data associated with this material. Biobanks are invaluable resources

for medical research. The diversity, heterogeneity and volatility of the

domain make information systems for biobanks a challenging application

domain. Information systems for biobanks are foremost integration

projects of heterogenous fast evolving sources.

The European project BBMRI (Biobanking and Biomolecular Resources

Research Infrastructure) has the mission to network European

biobanks, to improve resources for biomedical research, an thus contribute

to improve the prevention, diagnosis and treatment of diseases.

We present the challenges for interconnecting European biobanks and

harmonizing their data. We discuss some solutions for searching for biological

resources, for managing provenance and guaranteeing anonymity

of donors. Furthermore, we show how to support the exploitation of such

a resource in medical studies with specialized CSCW tools.

Keywords: biobanks, data quality and provenance, anonymity, heterogeneity,

federation, CSCW.

1 Introduction

Biobanks are collections of biological material (tissue, blood, cell cultures, etc.)

together with data describing this material and their donors and data derived

from this material. Biobanks are of eminent importance for medical research -

for discovering the processes in living cells, the causes and effects of diseases, the

interaction between genetic inheritance and life style factors, or the development

of therapies and drugs. Information systems are an integral part of any biobank

and efficient and effective IT support is mandatory for the viability of biobanks.

For an example: A medical researcher wants to find out why a certain liver

cancer generates a great number of metastasis in some patients and in others not.

This knowledge would help to improve the prognosis, the therapy, the selection

⋆ The work reported here was partially supported by the European Commission 7th

Framework program - project BBMRI and by the Austrian Ministry of Science and

Research within the program Gen-Au - project GATIB.

A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 156–190, 2009.

c○ Springer-Verlag Berlin Heidelberg 2009


Information Systems for Federated Biobanks 157

of therapies and drugs for a particular patient, and help to develop better drugs.

For such a study the researcher needs besides biological material (cancer tissue)

an enormous amount of data: clinical records of the patients donating the tissue,

lab analysis, microscopic images of the diseased cells, information about the

life style of patients, genotype information (e.g. genetic variations), phenotype

information (e.g. gene expression profiles), etc. Gathering all these data in the

course of a single study would be highly inefficient and costly. A biobank is

supposed to deliver the data needed for this type of research and share the data

and material among researchers.

From the example above it is clear that information systems for biobanks

are huge applications. The challenge is to integrate data stemming from very

different autonomous sources. So biobanks are foremost integration and interoperability

projects. Another important issue is the dynamics of the field: new

insight leads to more differentiated diagnosis, new analysis methods allow the assessment

of additional measurements, or improve the accuracy of measurements.

So an information system for biobanks will be continuously evolving. And last

but not least, biobanks store very detailed personal information about donors.

To protect the privacy and anonymity of the donors is mandatory and misuse

of the stored information has to be precluded.

In recent years biobanks have been set up in various organizations, mainly

hospitals and medical and pharmaceutical research centers. Since the availability

of material and data is a scarce resource for medical research, the sharing of the

available material within the research community increased. This leads to desire

to organize the interoperation of biobanks in a better way.

The European project BBMRI (Biobanking and Biomolecular Resources Research

Infrastructure) has the mission to network European biobanks to improve

resources for biomedical research an thus contribute to improve the prevention,

diagnosis and treatment of diseases. BBMRI is organized in the framework of

European Strategy Forum on Research Infrastructures (ESFRI).

In this paper we give a broad overview of the requirements for IT systems for

biobanks, present the architecture of information systems supporting biobanks,

discuss possible integration strategies for connecting European biobanks and

discuss the challenges for this integration. Furthermore, we show how such an

infrastructure can be used and present a support system for medical research

using data from biobanks. The purpose of this paper is rather painting the whole

picture of challenges of data mangement for federated biobanks than presenting

detailed technical solutions. This paper is an extended version of [20]

2 What Are Biobanks?

A biobank, also known as a biorepository, can be seen as an interdisciplinary

research platform that collects, stores, processes and distributes biological materials

and the data associated with those materials.

In short: biobank = biological material + data. Typically, those biological

materials are human biospecimens such as tissue, blood or body fluids - and


158 J. Eder et al.

the data are the donor-related clinical information of that biological material.

Human biological samples in combination with donor-related clinical data are

essential resources for the identification and validation of biomarkers and the development

of new therapeutic approaches (drug discovery), especially in the development

of systems for biological approaches to study the disease mechanisms.

Further on, they are used to explore and understand the function and medical

relevance of human genes, their interaction with environmental factors and the

molecular causes of diseases [16]. Besides human-driven biobanks, a biobank can

also include samples from animals, cell and bacterial cultures, or even environmental

samples. Biobanks became a major issue in the field of genomics and

biotechnology, in recent years. According to the type of stored samples and the

medical-scientific domain biobanks can differ in many forms.

2.1 The Variety of Biobanks and Medical Studies

The development of biobanks results in a very heterogeneous concept. Each

biobank pursues its own strategy and specific demands on quality and annotation

of the collected samples. According to [39] we distinguish between three major

biobank types considering exclusively human-driven biobanks:

1. Population based biobanks. Population based cohorts are valuable for assessing

the natural occurrence and progression of common diseases. They contain

a huge number of biological samples from healthy or diseased donors, representative

for a concrete region or ethic cohort (population isolated) or from

the general population over a large period of time (longitudinal population).

Examples for large population based biobanks are the Icelandic DeCode

Biobank and UK Biobank.

2. Disease-oriented biobanks. Their mission is to obtain biomarkers of disease

through prospective and/or retrospective collections of e.g. tumour and nontumour

samples with derivatives such as DNA/RNA/proteins. The collected

samples are associated to clinical data and clinical trials [39]. This groups of

biobanks are typically pathology archives like the MUG Biobank Graz.

A special kind of biobanks are twin registries such as GenomeEUtwin biobank,

which contain approximately equal numbers of monozygotic (MZ) and dizygotic

(DZ) twins. With such biobanks the parallel dissection of effects of genetic variation

in a homogeneous environment (DZ twins) and of environmental effects

against an identical genetic background (MZ twins) is possible. [52] These registries

are also partially suited to distinguish between the genetic and non-genetic

basis of diseases.

In [18] Cambon-Thomsen shows, that biobanks can vary in size, access mode,

status of institution or scientific sector in which the samples were collected:

1. Medical and academic research. In medical genetic studies of disease usually

small case- or family-based repositories are involved. Population-based collections,

which are also usually small, have been used for academic research


Information Systems for Federated Biobanks 159

for long period of time. Some large epidemiological studies have also involved

the collection of a large number of samples.

2. Clinical case/control studies. The primarily use of collected samples in hospitals

is for informing diagnosis, for the clinical or therapeutic follow up as

well as for the discovery or validation of genetic and non-genetic risk factors.

Large numbers of tissue sections have been collected by pathology departments

over the years. Transplantations using cells, tissues or even organs

from unrelated donors also led to the development of tissue and cell banks.

3. Biotechnology domain. Within this domain collections of reference cell lines

(e.g cancer cell lines or antibody-producing cell lines) and stem cell lines of

various origin are obtained. They are mainly used in biotechnology research

and development.

4. Judiciary domain. Biobanks host large collections of different sources of biological

material, data and DNA fingerprints, which have very restricted

usage.

2.2 Collected Material and Stored Data

Biobanks are not something new in the world of medicine and biological research.

The systematic collection of human samples goes back to 19th century,

including formaldehyde-fixed, paraffin-embedded or frozen material [27]. Most

biobanks are developed in order to support a research program in a specific type

of disease or to collect samples from a particular group of donors. Due to the

large resource requirements, biobanks within an institution usually conglomerate

to reduce high costs. This merging typically results in the fact that biobanks

have many kinds of samples and many different types (also called domains) of

data.

Material / Sample Types. Samples can include any kind of tissue, fluid or

other material that can be obtained from an individual. Usually, biospecimens in

a biobank are blood and blood components (serum), solid tissues such as small

biopsies and so on. An important collection in biobanks are the so-called normal

samples. These are that kind of tissue samples which are free of diagnosed disease.

For instance in some cases of the medical research (e.g case/control studies)

it is an important issue that there exist corresponding normal samples to several

diseased diagnosed samples, which can be used as controls, in order to get

the bottom of specific diseases or gene mutations. The biological samples can

be collected in various ways. Samples may be taken in the course of diagnostic

investigations as well as during treatment of diseases. For instance, in biopsies

small human tissue specimen are obtained in order to determine the type of a

cancer. Surgical resections of tumours provide larger tissue samples, which may

be used to specify the type of disease treatment. Autopsies are another valuable

source of human tissues, where specimen maybetakenfromvariouslocations

which reflect the effects of a disease in different organs of a patient. The obtained

biological materials are special preserved to keep them durable over a long time.


160 J. Eder et al.

Sample Data. The stored data from a donor, which come along with the collected

sample can be very extensive and various. According to [1] this data

includes:

– General information (e.g. race, gender, age, ...)

– Lifestyle and environmental information (e.g. smoker - non smoker, living in

a big city with high environmental pollution or living in rural areas)

– History of present illnesses, treatments and responses (e.g prescribed drugs

and the reactions of adverse)

– Longitudinal information (e.g. a sequence of blood tests after tissue collection

in order to test the progress behavior of diseases)

– Clinical outcomes (e.g. success of the treatment: Is the donor still living?)

– Data from gene expression profiles, laboratory data,...

Technically, the types of data range from typical record keeping, over text and

various forms of images to gene vectors and 3-D chemical structures.

Ethical and Legal Issues. Donors of biological materials must be informed

about purpose and intended use of their samples. Typically, the donor signs an

informed consent [10] which allows the use of samples for research and obliged

the biobank institution to guarantee privacy of the donor. The usage of material

in studies usually requires approval by ethics boards. Since the identity of donors

must be protected, the relationship between a donor and its sample must not

be revealed. Technical solutions for guaranteeing privacy issues are discussed in

section 6.

2.3 Samples as a Scarce Resource

Biobanks prove to be a central key resource to increase the effectivity of medical

research. Besides the high costs with human biological samples, they are

available in a limited amount. E.g. Ones a piece of a liver-tissue is cut off and

is used for a study, this piece of tissue is expended. Therefore, it is important

to avoid redundant analysis and achieve the most efficient and effective use of

non-renewable biological material [17]. A common and synergetic usage of this

resource will enable lots of research projects especially in case of rare diseases

with very limited material available. In silico experiments [46] play an important

role in the context of biobanks. The aim is to answer as many research questions

as possible without access to the samples themselves. Therefore, already

acquired data of samples are stored in databases and shared among interested

researchers. So modern biobanks offer the possibility to decrease long-term costs

of research and development as well as effective data acquisition and usage.

3 Biobanks as Integration Project

In section 2 we already mentioned that biobanks may contain various types of

collections of biological materials. Depending on the type of biobank, its organizational

and research environment, human tissue, blood, serum, isolated


Information Systems for Federated Biobanks 161

RNA/DNA, cell lines or others can be archived. Apart from the organizational

challenges, an elaborated information system is required for capturing all relevant

information of samples, managing borrow and return activities and supporting

complex search enquiries.

3.1 Sample Management in Biobanks

An organizational entity, which is commissioned to establish a biobank and its

operations, requires suitable storage facilities (e.g. cryo-tanks, paraffin block storage

systems) as well as security measures for protecting samples from damage

and for preventing unauthorized access. If a biobank is built on the basis of

existing resources (material and data), a detailed evaluation is essential. The

collection process, the inventory and documentation of samples has to be assessed,

evaluated and optimized. The increasing number of biobanks all over the

world has drawn the attention of international organizations, encouraging the

standardization of processes, sample and data management of biobanks.

Standardization of Processes. Managing a biobank is a dynamic process. The

biological collections may grow continuously, additional collections may be integrated

and samples may be used in medical studies and research projects. Standard

operating procedures are required for the most relevant processes, defining

competencies, roles, control mechanisms and documentation protocols. E.g. The

comparison of two or more gene expression profiles computed by different institutes

is only applicable if all gene expressions were determined by the same

standardized process. The Organization for Economic Cooperation and Development

(OECD) released the definition of Biological Resource Centers (BRC)

which ”must meet the high standards of quality and expertise demanded by the

international community of scientists and industry for the delivery of biological

information and materials” [8]. BRCs are certified institutions providing high

quality biological material and information. The model of BRCs may assist the

consolidation process of biobanks defining quality management and quality assurance

measures [38]. Guidelines for implementing standards for BRCs may be

found in recent works such as [12,35].

Data and Data Sources. Clinical records and diagnoses are frequently available

as semi-structured data or even stored as plain-text in legacy medical information

system. The data available from diverse biobanks do not only include numeric

and alphabetic information but also complex images such as microphotographs of

pathology sections, pictures generated by medical imaging procedures as well as

graphical representations of the results of analytical diagnostic procedures [11].

It is a matter of very large volumes of data. Somewhere data or findings are even

archived only in printed version or stored in heterogenous formats.

3.2 Different Kinds of Heterogeneity

Since biobanks may involve many different datasources it is obvious that heterogeneity

is ever-present. Biobanks may comprise interfaces to sample management


162 J. Eder et al.

systems, labor information systems, research information systems, etc. The origin

of the heterogeneity lies in different data sources (clinical, laboratory systems,

etc), hospitals, research institutes and also in the evolution of involved disciplines.

Heterogeneity appearing in biobanks comes in various forms and thus can be divided

into two different classes. The first class of heterogeneity can be found between

different datasources. This kind of heterogeneity is mostly caused by the

independent development of the different datasources. Here we have to deal with

several different types of mismatches which all lead to heterogeneity between the

systems as shown in [43,32]. Typical mismatches can be found in:

– Attribute namings (e.g. disease vs DiseaseCode)

– Different attribute encodings (e.g. weight in kg vs lbm)

– Content of attributes (e.g. homonyms, synonyms, ...)

– Precision of attributes (e.g. sample size: small, medium, large vs mm 3 ,cm 3 )

– Different attribute granularity

– Different modeling of schemata

– Multilingualism

– Quality of the data stored

– Semi-structured data (incompleteness, plain-text,...)

The second class of heterogeneity is the heterogeneity within one datasource.

This kind of heterogeneity may not be recognized at a first glance. But as [39]

show the scientific value of biobank content increases with the amount of clinical

data linked to certain samples (see figure 1). The longer data will be kept in

biobanks the greater its scientific value is. On the other hand keeping data in

biobanks for a long time leads to heterogeneity because medical progress leads

to changes in database structures and the modeled domain. Modern biobanks

support time series analysis and use of material harvested over a long period of

time. A particular difficulty for these uses of material and data is the correct

Fig. 1. The correlation of scientific value of biobank content and availability of certain

content is shown. One can clearly see that the scientific value increases where the

availability of data decreases [39].


Information Systems for Federated Biobanks 163

representation and treatment of changes. Some exemplary changes that arise in

this context are:

– Changes in disease codes (e.g ICD-9 to ICD-10 [6] in the year 2000)

– Progress in biomolecular methods results in higher accuracy of measurements

– Extension of relevant knowledge (e.g. GeneOntology [4] is changed daily)

– Treatments and standard procedures change

– Quality of sample conservation increases, etc.

Furthermore, also the technical aspects within one biobank are volatile: data

structures, semantics of data, parameters collected, etc. When starting a biobank

project one must be aware of the above mentioned changes. Biobanks should be

augmented to represent these changes. This representation then can be used

to reason about the appropriateness for using a certain set of data or material

together in specific studies.

3.3 Evolution

Wherever possible biobanks should provide transformations to map data between

different versions. Using ontologies to annotate content of biobanks can

be quite useful. By providing mapping support between different ontologies the

longevity problem can be addressed. Further on, versioning and transformation

approaches can help to support the evolution of biobanks. Techniques from

temporal databases and temporal data warehouses can be used for the representation

of volatile data together with version mappings to transform all data

to a selected version [19,26,51,21,22]. This knowledge can be directly applied to

biobanks as well.

3.4 Provenance and Data Quality

From the perspective of medical research, the value of biological material is

tightly coupled with the amount, type and quality of associated data. Though,

medical studies usually require additional data that can not be directly provided

by biobanks or is not available at all. The process of collecting relevant data for

biospecimens is denoted as sample annotation and is usually done in context

of prospective studies based on specified patient cohorts. For instance, if the

family anamnesis is to be collected for a cohort of liver cancer patients, various

preprocessing and filtering steps are necessary. In some cases, different medical

institutes or even hospitals have to be contacted. Patients have to be identified

correctly in external information systems, and anamnesis data is extracted

and collected in predefined data structures or simple spreadsheets. The collected

data is combined with the data supplied by the biobank and constitutes the

basis for hypotheses, analyses and experiments. Thus, additional data is created

in context of studies: predispositions for diseases, gene expression profiles, survival

analyses, publications etc. The collected and created data represents an

added value for biospecimens, since it may be used in related or retrospective

studies. Therefore, if a biobank is committed to support collaborative research


164 J. Eder et al.

activities, an appropriate research platform is required. Generally, the aspects of

contextualization and provenance have to be considered by such a system.

Contextualization is the ability to link and annotate an object (sample) in different

contexts. A biospecimen may be used in various studies and projects and

may be assigned different annotations in each of it. Further, these annotations

have to be processable allowing efficient retrieval. Further, contextualization allows

to organize and interrelate contexts. That is, study data is accessible only

to selected groups and persons having predefined access rights. Related studies

and projects may be linked to each other, whereas collaboration and data

sharing is supported. The MUG biobank uses a service-oriented CSCW system

as an integrated research platform. More details about the system are given in

section 5.1.

Data provenance is used to document the origin of data and tracks all transformation

steps that are necessary to reaccess or reproduce the data. It may

be defined as the background knowledge that enables a piece of data to be interpreted

and used correctly within context [37]. Alternatively, data provenance

may be seen from a process-oriented perspective. Data may be transformed by

a sequence of processing steps which could be small operations (SQL joins, aggregations),

the result of tools (analysis services) or the product of a human

annotation. Thus, these transformations form a “construction plan” of a data

object, which could be reasonably used in various contexts. Traceable transformation

processes are useful for documentation purposes, for instance, for the

materials and methods section of publications. Generally, the data quality may

be improved due to the transparency of data access and transformation. Moreover,

relevant processes may be marked as standard or learning processes. That

is, new project participants may be introduced to research methodology or processes

by using dedicated processes. Another import aspect is the repeatability

of processes. If a data object is the result of an execution sequence of services,

and all input parameters are known, the entire transformation process may be

repeated. Further, processes may be restarted with slightly modified parameters

or intermediate results are preserved to optimize re-executions [14]. Stevens et al.

[46] point out an additional type of provenance: organizational provenance. Organizational

provenance comprises information about who has transformed which

data in which context. This kind of provenance is closely related to collaborative

work and is applicable to medical cooperation projects. As data provenance has

attracted more and more attention, it was integrated in well established scientific

workflow systems [44]. For instance, a provenance recording was integrated

for the Kepler [14], Chimera [24] and Taverna [34] workflow systems. Therefore,

scientific research platform for biobanks could learn from the progress of data

provenance research and incorporate suitable provenance recording mechanisms.

In the context of biobanks, provenance recording can be applied to sample

or medical-case data. If a data object is integrated from external sources (for

instance, the follow up data of a patient), its associated provenance record may

include an identification of its source as well as its access method. Additionally,

the process of data generation during research activities may be documented.


Information Systems for Federated Biobanks 165

That is, if a data object is the result of a certain analysis, or is based on the

processing of several other data objects, the transformation is captured in corresponding

provenance records. If all data transformations are collected by recording

the input and output data in relations, data dependency graphs may be built.

From these graphs, data derivation graphs may be easily computed to answer

provenance queries like: which input data was used to produce result X? Data

provenance is closely related to the above-mentioned contextualization of objects.

While contextualization enables to combine and structure objects from

different sources, data provenance provides the inverse operation. It allows to

trace the origins of used data objects. Thus, provenance has an important role

regarding data quality assurance as it documents the process of data integration.

3.5 Architecture - Local Integration

Sample Management Systems. The necessity of elaborated sample management

systems for biobanks was recognized in several biobank initiatives. Though,

depending on the type of biobank and its research focus different systems have been

implemented. For instance, the UK biobank adapted a laborartory information

system (LIMS) supporting high throughput data capturing and automated quality

control of blood and urine samples [23]. Since the UK biobank is a based on prospective

collections of biological samples (more than 500,000) participants, the focus

is clearly on optimized data capturing of samples and automatization techniques

such as barcode reading. The UK Biorepository Information system [31] strives for

supporting multicenter studies in context of lung cancer research. A manageable

amount of samples is captured and linked to various types of data (life style, anamnesis

data). As commercial systems lack flexibility and customization capabilities,

a propriertary information system was designed and implemented. Another interesting

system was presented in [15], supporting blood sample management in

context of cancer research studies. The system clearly separates donor-related information

(informed consents, contact information) from storage information of

blood specimen and data extracted from epidemiologic questionnaires. In the context

of the European Human Frozen Tissue Bank (TuBaFrost), a central tissue

database for storing patient-related, tissue-related and image data was established

[30]. Since biobanks typically provide material for medical research projects and

studies, they are confronted with very detailed requirements from the medical domain.

For instance, the following enquiry was sent to the MUG Biobank:

A researcher requires about 10 samples with the following characteristics:

– male patients

– paraffin tissue

– with diagnose liver cancer

– including follow-up data (e.g. therapy description) from the oncology

This example illustrates the diversity of search criteria that may be combined

in a single enquiry. Criteria is defined on the type of sample (= paraffin tissue),

the availability of material (= quantity available), on the medical case


166 J. Eder et al.

of diagnose (= liver cancer) and on the patient, as patient sex (= male) and

follow-up data (= therapy description) are required. A catalogue of example

enquiries should be included in the requirements analysis of the sample management

system, as the system specification and design is to be strongly tailored to

medical domain requirements. An other challenge exists in an appropriate representations

of courses of disease. Patients suffering from cancer may be treated

over several years and various tissue samples may be extracted and diagnosed.

If a medical study on cancer is based on tissues of primary tumors and corresponding

metastasis, it is important that the temporal dependency between

the diagnosis of primary tumors and metastasis is captured correctly. Further,

the causal dependency between the primary tumor and the metastasis need to

be represented. Otherwise, a query would also return tissues with metastasis

of secondary tumors. However, the design of a sample management system is

strongly determined by the type, the quality and structure of available data.

As already mentioned clinical records and diagnoses are frequently available as

semi-structured data or even stored as plain-text in legacy medical information

systems.

SampleDB MUG. In the following we give a brief overview of the sample management

system (SampleDB) of the MUG biobank. We present this system as an

exemplary solution for a biobank integration project. For the sake of simplicity

we only present the core-elements of the UML database schema in Figure 2. Generally,

we may distinguish between three main perspectives: sample-information

perspective, medical case perspective and sample-management perspective. The

sample-information perspectives comprises information immediately related to

the stored samples. All samples have a unique identifier, which is a histological

number assigned by the instantaneous section of the pathology. Depending

on the type of sample, different quality and quantity-related attributes may be

stored. For instance, cryo tissue samples are conserved in vials in liquid nitrogen

tanks. By contrast, paraffin blocks may be stored in large-scale robotic storage

systems. Further, different attributes specifying the quality of samples exist such

as the ischemic time of tissue or the degree of contamination of cell lines. That is,

a special table exists for each type of sample (paraffin, cryo tissue, blood samples

etc.). Samples may be used as basic material for further processing. For instance,

RNA and DNA may be extracted from paraffin-embedded tissues. The different

usage types of samples are modelled by the bottom classes in the schema.

Operating a biobank requires an efficient sample management including inventory

changes, documentation of sample usage and storing of cooperation contracts.

The classes Borrow, Project and Coop Partner document which samples

have left the biobank in which cooperation project and how many samples were

returned. Since samples are a limited resource, they may be used up in context of

a research project. For instance, paraffin-embedded tissues may be used to construct

a tissue microarray, a high-throughput analysis platform for epidemiologybased

studies or gene expression profiling of tumours [28]. Thus, for ensuring

the sustainability of sample collections, appropriate guidelines and policies are

required. In this context, samples of rare diseases are of special interest, since


Information Systems for Federated Biobanks 167

Fig. 2. Outline of the CORE schema from the SampleDB (MUG biobank)

they represent invaluable resources that may be used in multicenter studies [17].

The medical case perspective allows for assessing the relevant diagnostic information

of samples. In the case of the MUG biobank, pathological diagnoses are

captured and assigned to the corresponding samples. Since the MUG biobank

has a clear focus on cancerous diseases, tumour-related attributes such as tumour

grading and staging or the International Classification of Diseases for Oncology,

ICD-O-3 classification are used [7]. Patient-related data is split in two tables:

the sensitive data such as the personal data is stored in a separate table while

an anonymous patient table contains a unique patient identifier. Personal data

of patients are not accessible for staff members of biobanks. However, medical

doctors may access sample and person-related data as part of their diagnostic

or therapeutic work.

3.6 Data Integration

Data integration in the biomedical areas is an emerging topic, as the linkage of

heterogeneous research, clinical and biobank information systems become more

and more important. Generally, several integration architectures may be applied


168 J. Eder et al.

for incorporating heterogeneous data sources. Data warehouses extract and consolidate

data from distributed sources in a global database that is optimized for

fast access. On the other hand, in database federations data is left at the sources,

and data access is accomplished by wrappers that map a global schema to distributed

local schemas [33]. Although database federations deliver data that is

up-to-date, they do not provide the same performance as data warehouses. However,

they do not require redundant data storage and expensive periodic data

extraction.

In the context of the MUG biobank several types of information systems are

accessed, as illustrated in figure 3. The different data sources are integrated in

a database federation, whereas interface wrappers have been created for the relevant

data. On the one hand, there are large clinical information systems which

are used for routine diagnostic and therapeutical activities of medical doctors. Patient

records from various medical institutes are stored in the OpenMedocs sytem,

pathological data in the PACS system and laboratory data in the laboratory information

system LIS. On the other hand research databases from several institutes

(e.g. the Archimed system) containing data about medical studies are incorporated

as well as the biological sample management system SampleDB and diverse

robot systems. Further, survival data of patients is provided by the external institution

Statistics Austria. Clinical and routine information systems (at the bottom

of figure 3) are strictly seperated from operational information systems of

the biobank. That is, sensitive patient-related data is only accessible for medical

Fig. 3. Data Integration in context of the MUG Biobank


Information Systems for Federated Biobanks 169

staff and anonymized otherwise. The MUG Biobank operates an own documentation

system in order to protocol and coordinate all cooperation projects. The

CSCW system at the top of the figure provides a scientific workbench for internal

and external project partners, allowing to share data, documents, analysis results

and services. A more detailed description of the system is given in section 5.1. A

modified version of the CSCW workbench will be used as user interface for the

European Bionbank initiative BBMRI, described in section 4.1.

3.7 Related Work

UK-Biobank. The aim of UK Biobank is to store health information about

500.000 people from all around the UK who are aged between 40-69. UK Biobank

has evolved over several years. Many subsystems, processes and even the system

architecture have been developed from experience gathered during pilot operations

[5]. UK Biobank integrated many different subsystems to work together. To

ensure access to a broad range of third party data sets it was essential that UK

Biobank meets the needs of other relevant groups (e.g. Patient Information Advisory

Group). Many external requirements had to be taken into consideration

to fulfil that needs [13]. Figure 4 shows a system overview of the UK Biobank

and its most important components.

The recruitment system is responsible to process patient invitation data received

from the National Health Service. The received data has to be cleaned

and passed to the Participant Booking System. The Booking System securely

transfers appointment data (name, date of birth, gender, address, ...) to the

Fig. 4. System architecture showing the most important system components of the

UK Biobank [13]


170 J. Eder et al.

Assessment Data Collection System. The Assessment Center also handles the

informed consent of each participant. The task of LIMS is to store identifiers

for all received samples without any participant identifying data such as name,

address, etc. The UK Biobank also provides interfaces to clinical and non-clinical

external data repositories. The Core Data Repository containing different data

repositories forms the basis for several different Data Warehouses. These Data

Warehouses provide all parameters needed to generate appropriate datasets for

answering validated research requests. Also disclosure control which prevents

patient de-identification is performed on these Data Warehouses. The research

community is able to post requests with the help of a User Portal which is positioned

right on the top of the Data Warehouses. Additional Query Tools allow

investigating the Data Warehouses as well as the Core Data Repository [13].

caBIG. The cancer Biomedical Informatics Grid (caBIG) has been initiated by

the National Cancer Institute (NCI) as a national-scale effort in order to develop

a federation of interoperable research information systems. The approach

to reach federated interoperability is a grid middleware infrastructure, called ca-

Grid. It is designed as a service-oriented architecture. Resources are exposed to

the environment as grid services with well-defined interfaces. Interaction between

services and clients is supported by grid communication and service invocation

protocols. The caGrid infrastructure consists of data, analytical and coordination

services which are required by clients and services for grid-wide functions.

According to [40] coordination services include services for metadata management,

advertisement and discovery, query and security. A key characteristic of

the framework is its focus on metadata and model driven service development

and deployment. This aspect of caGrid is particularly important for the support

of syntactic and semantic interoperability across heterogeneous collections

of applications. For more information see [40,3].

CRIP. The concept of CRIP (Central Research Infrastructure for molecular

Pathology) enables biobanks to annotate projects with additional necessary data

and to transfer them into valuable research resources. CRIP has been started in

the beginning of 2006 by the departments of Pathology of Charit and the Medical

University of Graz (MUG) [41]. CRIP offers a virtual simultaneous access to

tissue collections of participating pathology archives. Annotated valuable data

comes from different heterogeneous datasources and is stored in a central CRIP

database. Academics and researchers with access rights are allowed to search for

interesting material. Workflows and data transfers of CRIP projects are regulated

in a special contract between CRIP partners and Fraunhofer IBMT.

4 Federation of Biobanks

Currently established national biobanks and biomolecular resources are a unique

European strength, valuable collections typically suffer from fragmentation of the

European biobanking-related research community. This hampers the collation of


Information Systems for Federated Biobanks 171

biological samples and data from different biobanks required to achieve sufficient

statistical power. Moreover, it results in duplication of effort and jeopardises sustainability

due to the lack of long-term funding. To overcome the issues stated

above a federation of biobanks can be used to provide access to comprehensive

data and sample sets thus achieving results with better statistical power. Further

on, it is possible to investigate rare and highly diverse diseases as well as saving

high costs caused by duplicate analysis of the same material or data. To benefit

European health-care, medical research, and ultimately, the health of the citizens

of the European Union the European Commission is funding a biobank integration

project called BBMRI (Biobanking and Biomolecular Resources Infrastructure).

4.1 Biobanking and Biomolecular Resources Infrastructure

The aim of BBMRI is to build a coordinated, large scale European infrastructure

of biomedically relevant, quality-assessed mostly already collected samples

as well as different types of biomolecular resources (antibody and affinity binder

collections, full ORF clone collections, siRNA libraries). In addition to biological

materials and related data, BBMRI will facilitate access to detailed and

internationally standardised data sets of sample donors (clinical data, lifestyle

and environmental exposure) as well as data generated by analysis of samples

using standardised analysis platforms. A large number of platforms (such as

high-throughput sequencing, genotyping, gene expression profiling technologies,

proteomics and metabolomics platforms, tissue microarray technology etc.) will

be available through BBMRI infrastructure [2].

Benefits. The benefits of BBMRI are versatile. Talking in short-terms BBMRI

leads to an increased quality of research as well as to a reduction of costs. The

mid-term impacts of BBMRI can be seen in an increased efficacy of drug discovery/development.

Long-term benefits of BBMRI are improved health care

possibilities in the area of personalized medicine/health care [11].

Data Harmonisation and IT-infrastructure. An important part of BBMRI

is responsible for designing the IT-infrastructure and database harmonisation,

which includes also solutions for data and process standardization. The harmonization

of data deals with the identification of the scope of needed information

and data structures. Further on, it analysis how available nomenclature and

coding systems can be used for storing and retrieving (heterogenous) biobank

information. Several controlled terminologies and coding systems may be used

for organizing the information about biobanks [11,35]. Since not all medical information

is fully available in the local databases of biobanks the retrieval of

data involves big challenges. That implies the necessity of flexible data sharing

and collaboration between centers.

4.2 Enquiries in a Federation of Biobanks

Within an IT-infrastructure for federated biobanks authorized researchers should

have the possibility to search and obtain required material and data from all


172 J. Eder et al.

participating biobanks, necessary e.g. to perform biomedical studies. Furthermore,

it should be possible to even update or link data from already performed

studies in the European federation. In the following we distinguish between five

different kinds of use cases:

1. Identification of biobanks. Retrieves a list with contact data from participating

biobanks which have desired material for a certain study.

2. Identification of cases. Retrieves the pseudonym identifiers of cases 1 stored

in biobanks which correspond to a given set of parameters.

3. Retrieval of data. Obtains available information (material, data, etc.) directly

from a biobank for a given set of parameters.

4. Upload or linking of data. Connecting samples with data generated from this

sample internally and externally.

5. Statistical queries. Performs analytical queries on a set of biobanks.

Extracting, categorizing and standardizing almost semi-structured records is a

strenuous task postulating medical domain knowledge and a strong quality control.

Therefore, an automated retrieval of data involves big challenges. Further on,

to enable the retrieving, upload and linking of data a lot of research, harmonization

and integration has to be done. For the moment we assume that researchers

use the contact information and pseudonym identifiers to retrieve data from a

biobank. An important issue within the upload and linking of data is the question

how the data has been generated. The generation of new data must follow a standardized

procedure with preferably uniform tools, ontologies etc. In this context

also data quality as well as data provenance play an important role. In section 3 we

discussed issues within one biobank as integration project, now we are concerned

with a set of heterogenous biobanks as integration project. There exist several

different proposals for the handling of enquiries within the BBMRI project. Our

approach for enquiries is to ascertain where desired material or data is located.

Subsequently the researchers can get in contact with that biorepository by themselves.

It comprises the first two use cases mentioned above.

Workflow for Enquiries. In figure 5 we have modeled a possible workflow

for the identification of biobanks and cases, separated into different responsibility

parts. The most important participants within this workflow are the

requestor (researcher,), the requestor’s BBMRI host, other BBMRI hosts and

biobanks. Hosts act as global coordinators within the federation. The registration

of biobanks on BBMRI hosts takes place via a hub and spoke structure.

In the first step of our workflow an authenticated researcher chooses a service

request from a list of available services. Since a request on material or medical

data can have different conditions, a suggestion is to provide a list of possible

request templates like:

– Biobanks with diseased samples (cancer)

– Biobanks with diseased samples (metabolic)

1 In our context a case is a set of jointly harvested samples of one donor.


Information Systems for Federated Biobanks 173

– Cases with behavioral progression of a specific kind of tumor

– Cases with commonalities of two or more tumors

– ...

After the selection of an appropriate service request the researcher can declare

service-specific filter criteria to constrain the result according to the needs. Additionally,

the researcher is able to specify a level of importance for each filter

Fig. 5. Workflow for identification of biobanks and cases separated into different responsibility

parts


174 J. Eder et al.

criteria. This level of importance is an interval between 1 and 5 with 1-lowest

relevance and 5-highest relevance. Without any specification the importance of

the filter criteria is treated as default-value 3-relevant. The level of importance

has direct effects on the output of the query. It is used for three major purposes:

– Specifying must-have values for the result. If the requestor defines the highest

level of importance for an attribute the query only returns databases that

match exactly.

– Specifying nice-to-have values for the result. This feature relaxes query formulations

in order to incorporate the aspect of semi-structured data.

– Ranking the result to show the best matches at the topmost position. The

ranking algorithm takes the resulting data of the query invocation process

and sorts the output according to the predefined levels of importance.

Researchers formulate their requests with the use of query by example. BBMRI

then dissembles to act as one single system, performing query processing and

disclosure of information from the participating biobanks transparent. According

to this the formulated query of the researcher is sent to the requestor’s national

host as xml-document (see xml-document below). Afterwards the national host

(1) distributes the query to the other participating hosts in the federation using

disclosure information, (2) queries its own meta data repository as well as the

local databases from all registered biobanks, (3) applies a disclosure filter and (4)

ranks the result. Each invoked BBMRI-Host in the federation performs the same

procedure as the national host, but without distributing the incoming query. All

distributed ranked query results are sent back to the requestor’s host and are

merged on it. Depending on how the policy is specified the researcher gets a list

of biobanks or individual cases of biobanks as the final result for the enquiry.

Afterwards the researcher can get in contact with the desired biobanks. In case

of an insufficient result set the researcher has the opportunity to constrain the

result set or even to refine the query.

In the following we discuss different scenarios for enquiries in a federation of

biobanks. The scenarios differ in the

– kind of data accessed (only accessing the host’s meta-database or additionally

accessing the local databases from registered biobanks)

– information contained in the result (list of biobanks and its contact data or

individual anonymized cases from biobanks).

We assume that the meta-database stored on each host within the federation

only contains the uploaded schema information of the registered biobanks, in

order to avoid enormous data redundancies and rigidity in the system.

Scenario 1 - The Identification of Biobanks. Within the identification of

biobanks enquiries from authorized researchers are performed only by searching

the meta-databases from the federated hosts. Unfortunately this may lead

to very sketchy requests. An idea is to provide a small set of attributes in the


Information Systems for Federated Biobanks 175

meta-database with the opportunity to specify a certain content, given that this

content is an enumeration. A good candidate, for example, is the attribute ”diagnose”

standardized as ICD-Code (ICD-9, ICD-10, ICDO-3) because it may be

very useful to know which biobank(s) store information about specific diseases.

Another good canditate may be ”Sex” with the values ”Female”, ”Male”, ”Unknown”,etc..

A considerable point of view within enquires for the identification

of biobanks is supporting the possibility to specify an order of magnitude for the

desired material.

Example enquiry 1: A researcher wants to know which biobanks store 10 samples

with the following characteristics:

– male patients

– paraffin tissue

– with diagnose liver cancer

– including follow-up data (e.g. therapy description) from the oncology

XML Output for Example Enquiry 1 after definition of Filter Criteria






















Scenario 2 - The Identification of Cases. Since the meta-database located

on the hosts does not store any case or donor related information it is necessary

to additionally query the local databases. Take note that no information


176 J. Eder et al.

of the local databases will be sent to the requestor except unique identifiers of

the appropriate cases. The querying of the local databases is used for searching

more detailed and therefore to get a more exact result list. A special case within

the identification of cases is determined by a slight variance in the result set.

Depending on a policy the result set can also contain only a list of biobanks with

their contact information as discussed in scenario 1.

Example enquiry 2: A researcher requires the ID of about 20 cases and their

location (biobank) with the following characteristics:

– paraffin tissue

– with diagnose breast cancer

– staging T1 N2 M0

– from donors of age 40-50 years

– including follow-up data (e.g. therapy) from the oncology

There are two types of relationships between the samples: donor-related and

case-related relationships. Donor-related means that two or more samples have

been taken from the same donor. Though, the samples may have been taken in

various medical contexts (different diseases, surgeries, etc.). In contrast, samples

are case-related when the associated diagnoses belong to the same disease.

4.3 Data Sharing and Collaboration between Different Biobanks

A desirable point of view in our discussions was to build an IT-infrastructure

which provides the ability to easily adapt to different research needs. In regard

to this we designed a feasible environment for the collaboration between different

biobanks within BBRMI as a hybrid of peer to peer and a hub and spoke

structure. In our approach a BBMRI-Host (figure 6) represents a domain hub

in the IT-infrastructure and uses a meta structure to provide data sharing. Several

domain hubs are connected via a peer-to-peer structure and communicate

with each other via standardized and shared Communication Adapters. Each

participating European biobank is connected with its specific domain hub resp.

BBMRI-Host via hub and spoke-structure.

Biobanks provide their obtainable attributes and contents as well as their

contact data and biobank specific information via the BBMRI Upload Service

of the associated BBMRI-Host. A Mediator coordinates the interoperability issues

between BBMRI-Host and the associated biobanks. The information about

uploaded data from each associated biobank is stored in the BBMRI Content-

Meta-Structure. Permissions related to the uploaded data as well as contracts

between a BBMRI-Host and a specific biobank are managed by the Disclosure

Filter. A researcher can use the BBMRI Query Service for sending requests to

the federated system. The BBMRI Query Service is the entry point for such

requests. The BBMRI Query Service can be accessed via the local workbenches

of connected biobanks as well as via the BBMRI Scientific Workbenches.


4.4 Data Model

Information Systems for Federated Biobanks 177

Our proposed data models for the BBMRI Content-Meta-Structure (in figure 6)

have the ability to hold a sufficient (complete) set of needed information structures.

The idea is that obtainable attributes (schema information) from local databases

of biobanks can be mapped with the BBMRI Content-Meta-Structure in order to

provide a federated knowledge-base for life-science-research. To avoid data overkill

we designed a kind of lower-bound schema that contains attributes usually occurring

in most or even all of the participating biobanks.

Our approach was to accomplish a hybrid-solution of a federated system and

an additional data warehouse as a kind of index to primarily reduce the query

overhead. This decision led to the design of a class (named ContentInformation,

Fig. 6. Architecture of IT-infrastructure for BBMRI


178 J. Eder et al.

figure 7) which contains attributes with different meanings, similar to online

analytical processing (Olap), including:

– Content-attributes. Are a small set of attributes, which provide information

about their content in the local database (cf. 4.2). All content-attributes must

be provided by each participating biobank. The type of attributes stored

in the meta-dataset with content information must be an enumeration like

ICD-Code, patient sex or BMI-category.

– Number of Cases (NoC). Is an order of magnitude for all available cases of

a specific disease in combination with all content-attributes.

– Existence-attributes accept two different kinds of characteristics:

1. Value as quantity. This kind of existence attributes tell how many

occurrences of a given attribute are available in a local database for a

specific instance tuple of all content-attributes. The knowledge is represented

by a numeric value greater than a defined k-value for an aggregated

set of cases. Conditions on this kind of existence-attributes are

OR-connected because they are independent from each other. They do

not give information on their inter-relationship.

2. Value as availability. With that kind of existence attributes the storage

of values does not take place in an aggregated form like mentioned above,

but as bitmap (0 / 1), with 0 not available and 1 available. This has the

consequence that each row in the relation contains one specific case.

Due to this fact AND-connected conditions on existence attributes can

be answered.

We compared two different approaches for the data model of the BBMRI Content-

Meta-Structure, a static and a dynamic one.

The Static Approach. In comparison to an Olap data-cube our class ContentInformation

(see figure 7) acts as the fact-table with the content-attributes

as dimensions and existence-attributes (including the Number of Cases) as measures.

We call this approach static because it enforces a data model which includes

a common set of attributes on that all participating biobanks have to

agree. Biobanks have the opportunity to register their schema information via

an attribute catalogue and provide their content information as shown in figure

7. Within the static approach the splitting of the set of obtainable attributes

from the participating biobanks into content-attributes and existence-attributes

Fig. 7. Example data of class ContentInformation as static approach


Information Systems for Federated Biobanks 179

makes data analysis more performant because queries do not get too complex.

Also Olap operations like roll-up and drill-down are enabled. A serious handicap

is the missing flexibility to store information of biobanks that work in different

areas. In this case all biobanks must use the same content information.

The Dynamic Approach. The previous stated data model for the BBMRI-

Content-Meta-Structure works on a static defined set of attributes coupled as

ContentInformation. Thus, all participating biobanks have the same ContentInformation.

However a metabolic based biobank does not necessarily need resp.

have a TNM-Classification and otherwise a cancer based biobank does not necessarily

need resp. have metabolic specific attributes. Because of this assumption

this datamodel deals with a dynamic generation of the ContentInformation (figure

9). I.e. Each biobank first declares the attributes they store in their local

databases. Especially they declare which of them are content-attributes and

which of them are existence-attributes. However this could affect requests on

material, therefore one must be careful with the declaration.

– Requests on existence of attributes. For a request on the existence of several

attributes it does not matter whether the requested attributes are declared

as content-attribute or existence-attribute. The only fact to get a query-hit

for that request is that the searched attributes are declared by a biobank.

– Requests on content of attributes. For a request on the content of several

attributes, all requested attributes must be declared as content-attribute by

a biobank in order to get a query-hit. E.g. A request on female patients

who suffer from C50.8 (breast cancer) would not get a query-hit from BB-y

(figure 8) because the attribute PatientSex is declared as existence-attribute

and thus has no information about its content.

Fig. 8. Explicit declaration of attributes in the dynamic approach


180 J. Eder et al.

Fig. 9. Dynamic generation for content information of BBMRI content meta structure

With the dynamic data model it is possible to support different kinds of content

information depending on the needs of the biobanks. Besides once a new attribute

is introduced, this does not lead to changes in the database schema.

In the following table 1 we compare the static data model with the dynamic

data model.

Table 1. Comparison between static and dynamic approach

Approach Flexibility in

nancemainte-

Simplicity in query Anonymity issues

static - + +

dynamic + - +

4.5 Disclosure Filter

The disclosure filter is a software component that helps the BBMRI-Hosts to

answer the following question: Who is allowed to receive what from whom under

which circumstances? E.g: Since it is planned to provide information exchange

across national boarders the disclosure filter has to ensure that no data (physical

or electronic) leaves the country illegally. The disclosure filter takes into account

laws, contracts between participants, policies of participants and even rulings

(e.g: by courts, ethics boards,...).

The disclosure filter on a BBMRI-Host plays three different roles:

1. Provider host and local biobank remove items from query answers that are

not supposed to be seen by requestors.


Information Systems for Federated Biobanks 181

2. Technical Optimization: Query system optimizes query processing using disclosure

information.

3. Requestor host removes providers which do not provide sufficient information

to the requestor. This role can be switched on / off.

The disclosure filter plays a central role in the workflow (figure 5) for use cases 1

and 2 as well as in the architecture (figure 6). Depending on the role of the disclosure

filter the location within the workflow can change. During requirements

analysis the possibility to switch the disclosure filter off on demand turned out

to be an important feature. With the help of this feature it is possible to operate

a more relaxed system.

5 Working with Biobanks

In this chapter, we want to point out some application areas of IT infrastructures

in the context of biobanks. We mainly focus on support of medical research,

since there is a strong demand for assisting, documenting and interconnecting

research activities. Additionally, these activities are tightly coupled with the data

management of a biobank, providing research results for samples and thereby

enhancing the scientific value of samples.

5.1 Computer Supported Collaborative System (CSCW) for

Medical Research

Medical research is a collaborative process in an interdisciplinary environment

that may be effectively supported by a CSCW system. Such a system imposes

specific requirements in order to allow flexible integration of data, analysis

services and communication mechanisms. Persons with different expertise and

access rights cooperate in mutually influencing contexts (e.g. clinical studies,

research cooperations). Thus, appropriate virtual environments are needed to

facilitate context-aware communication, deployment of biomedical tools as well

as data and knowledge sharing. In cooperation with the University of Paderborn

we were able to leverage a CSCW system, that covers our demands, on the flexible

service-oriented architecture Wasabi, a reimplementation of Open sTeam

(www.open-steam.org) which is widely used in research projects to share data

and knowledge and cooperate in virtual knowledge spaces. We use Wasabi as

a middleware integrating distributed data sources and biomedical services. We

systematically elaborated the main requirements of a medical CSCW system

and designed a conceptual model, as well as an architectural proposal satisfying

our demands [42,45]. Finally we implemented a virtual workbench to support

collaboration activities in medical research and routine work. This workbench

had to fulfill several important requirements, in particular:

– R(1) User and Role Management. The CSCW has to be able to cope

with the organisational structure of the institutes and research groups of the

hospital. Data protection directives have to fit in the access right model of


182 J. Eder et al.

the system. Though, the model has to be flexible to allow the creation of

new research teams and information sharing across organisational borders.

– R(2) Transparency of physical Storage. Although data may be stored

in distributed locations, data retrieval and data storage should be solely

dependant on access rights, irrespective of the physical location. That is,

the complexity of data structures is hidden from the end user. The CSCW

system has to offer appropriate search, join and transformation mechanisms.

– R(3) Flexible Data Presentation. Since data is accessed by persons having

different scientific background (biological, medical, technical expertise)

in order to support a variety of research and routine activities, flexible capabilities

to contextualise data are required. Collaborative groups should be

able to create on-demand views and perspectives, annotate and change data

in their contexts without interfering with other contexts.

– R(4) Flexible Integration and Composition of Services. A multitude of

data processing and data analysis tools exist in the biomedical context. Some

tools act as complementary parts in a chain of processing steps. For example, to

detect genes correlated with a disease, gene expression profiles are created by

measuring and quantifying gene activities. The resulting gene expression ratios

are normalised and candidate genes are preselected. Finally, significance

analysis is applied to identify relevant genes [49]. Each function may be provided

by a separate tool - for example by Genespring R○ and Genesis R○ [9,47]. In

some cases tools provide equal functionality and may be chosen as alternatives.

Through flexible integration of tools as services with standardised input and

output interfaces a dynamic composition of tools may be accomplished. From

the systems perspective services are technology neutral, loosely coupled and

support location transparency [36]. The execution of services is not limited

to proprietary operation systems and service callers do not know the internal

structure of a service. Further, services may be physically distributed over departments

and institutes, e.g. image scanning and processing is executed in an

own laboratory where the gene expression slides reside.

– R(5) Support of cooperative Functions. In order to support collaborative

work suitable mechanisms have to be supplied. One of the main aspects

is the common data annotation. Thus, data is augmented and shared within

a group and new content is created cooperatively. Therefore, Web 2.0 technologies

like wikis and blogs procure a flexible framework for facilitating

intra- and inter-group activities.

– R(6) Data-coupled Communication Mechanisms. Cooperative working

is tightly coupled with excessive information exchange. Appropriate communication

mechanisms are useful to coordinate project activities, organise

meetings and enable topic-related discussions. On the one hand, a seamless

integration of email exchange, instant messaging and VoIP tools facilitates

communication activities. We propose to reuse the organisational data defined

in R(1) within the communication tools. On the other hand, persons

should be able to include data objects in their communication acts. E.g. Images

of diseased tissues may be diagnosed cooperatively, whereas marking

and annotating of image sections supports the decision making process.


Information Systems for Federated Biobanks 183

– R(7) Knowledge Creation and Knowledge Processing. Cooperative

medical activities frequently comprise the creation of new knowledge. Data

sources are linked with each other, similarities and differences are detected,

and involved factors are identified. Consider a set of genes that is assumed

to be strongly correlated with the genesis of a specific cancer subtype. If the

hypothesis is verified the information may be reused in subsequent research.

Thus, methods to formalise knowledge, share it in arbitrary contexts and

deduce new knowledge are required.

In the following Figure 10, the concept of virtual knowledge spaces is illustrated.

The main idea is to contextualize documents, data, services, annotations in virtual

knowledge spaces and make them accessible for cooperating individuals.

Rooms may be linked to each other or nested in each other. Communication is

tightly coupled to the shared resources, enabling discussion, quality control and

collaborative annotations.

Fig. 10. Wasabi virtual room

5.2 Workflow for Gene Expression Analysis for the Breast Cancer

Project

A detailed breast cancer data set was annotated at the Pathology Graz. In

this context much emphasis is put on detecting deviations in the behaviour

of gene groups. We support the entire analysis workflow by supplying an IT

research platform allowing to select and group patients arbitrarily, preprocess

and link the related gene expressions and finally perform state-of-the-art analysis

algorithms. We developed an appropriate database structure with import/export

methods allowing to manage arbitrary medical data sets and gene expressions.

We also implemented web service interfaces to various gene expression analysis

algorithms.

Currently, we are able to support the following steps of the research workflow:

(1) Case Selection: In the first step relevant medical cases are selected. The set

of avalaible breast cancer cases with associated cryo tissue samples is selected by

querying the SampleDB. Since only cases with follow-up data from the oncology


184 J. Eder et al.

Fig. 11. Workflow for the support of gene expression analysis

are included in the project, those cases are filtered. Further filter criteria are:

metastasis and therapeutic documentation. Case selection is a composed activity,

as two separate databases of two different institutes (pathology and oncology)

are accessed, filtered and joined. After selecting the breast cancer cases, gene

expression profiles may be created.

Output description: Set of medical cases.

Output type: File or list of appropriate medical cases identified by unique keys

like patient ID.

(2) Normalization of Gene Expression Profiles: AsetofGPRfilesisdefined

as input source and the preferred normalisation method is applied. We use the

normalisation methods offered by the bioconductor library of the R-project. The

result of the normalisation is stored for further processing.

Output description: Normalised gene expression matrix

Output type: The result of the normalisation is a matrix where rows correspond

to genes and columns to medical cases. The matrix may be stored as file or table.

(3) Gene Annotation: In order to link genes with other resources, unique gene

identifiers are required (e.g. Ensemble GeneID, RefSeq). Therefore, we integrated

mapping data supplied by the Operon chip producer.

Output description: Annotated gene expression matrix

Output type: Gene expression matrix with chosen gene identifiers. The matrix

may be stored as file or table.


Information Systems for Federated Biobanks 185

(4) Link Gene Ontologies: We use gene ontologies (www.geneontology.org) in

order to map single genes to functional groups. Therefore, we imported the most

recent gene ontologies into our database. Alternatively we also plan to integrate

pathway data from the KEGG database (www.genome.jp/kegg/) to allow

grouping of genes into functional groups.

Output description: Mapping from gene groups (gene ontologies, KEGG functional

groups) to single genes.

Output type: List of mappings, where in each mapping a group is mapped to a

list of single genes.

(5)Link annotated Patient Data: Each biological sample corresponds to a medical

case and has to be linkable to the gene expression matrix. A file containing

medical parameters is imported whereas the parameters may be used to define

groups of interest for the analysis.

Output description: A table storing all medical parameters for all cases.

Output type: A database table is created allowing to link medical parameters to

cases of the annotated gene expression matrix.

(6) Group Samples: A hypothesis is formulated by defining groups of medical

cases that are compared in the analysis. The subsequent analysis tries to detect

significant differences in gene groups between the medical groups.

Output description: The medical cases are grouped according to the chosen medical

parameters.

Output type: A list of mappings whereas each mapping consists of a unique case

identifier mapping to a group identifier.

(7) Analysis: We implemented web service interfaces to the Bioconductor packages

’Global Test’ [25] and ’Global Ancova’[29]. We use the selected medical

parameters for sample grouping and the GO categories for gene grouping together

with the gene expression matrix as input parameters for both algorithms.

After the analysis is finished the results are written into an analysis database

and exported as Excel files. We also plan to integrate an additional analysis tool

called Matisse from Tel Aviv University [50].

Output description: A list of significant gene groups. The number of returned

gene groups may be customized. For instance, only the top 10 significant gene

groups are returned.

Output type: A list of significant gene groups, together with a textual description

of the group and its p-value.

(8) Plotting: Gene plots may be created, visualizing the influence of single genes

in a significant gene group. The plots are created using bioconductor libraries

which are encapsulated in a web service.

Output description: Gene plots of significant gene groups.

Output type: PNG image files, that may be downloaded and saved into an analysis

database.

We are able to show that a service-oriented CSCW system provides the functionality

to build a workbench for medical research supporting the collaboration


186 J. Eder et al.

of researchers, allowing the definition of workflows and gathering all necessary

data for maintaining provenance information.

6 Data Privacy and Anonymization

When releasing patient-specific data (e.g. in medical research cooperations) privacy

protection has to be guaranteed for ethical and legal reasons. Even when

immediately identifying attributes like name, address or day of birth are eliminated,

other attributes (quasi-identifying attributes) may be used to link the

released data with external data to re-identify individuals. In recent research

much effort has been put on privacy preserving and anonymization methods. In

this context, k-anonymity [48] was introduced allowing to protect sensitive data

by generating a sufficient number of k data twins. These data twins prevent that

sensitive data is linkable to individuals.

K-anonymity may be accomplished by:

– transforming attribute values to more general values - nominal and categorical

attributes may be transformed by taxonomy trees or user-defined

generalization hierarchies

– mapping numerical attributes to intervals (for instance, age 45 may be transformed

to age interval 40-50)

– replacing a value with a less specific but semantically consistent value (e.g.

replace numeric (continuous) data for blood pressure with categorical data

like ’high’ blood pressure)

– combining several attributes making them more coarse grain (e.g. replace

height and weight with bmi)

– fragmentation of the attribute vector

– data blocking (i.e. by replacing certain attributes of some data items with a

null value)

– dropping a sample from the selection set

– dropping an attribute

For a given data set, several k-anonymous anonymizations may be created depending

on how attributes are generalized. Transformations of attribute values

are always accompanied by an information loss, which may be used as a quality

criteria for an anonymization. That is, an optimal anonymization may be defined

as the k-anonymous anonymization with the minimal information loss. Information,

value of information and the significance of information loss is in the eye

of the beholder, i.e. it depends on the requirements of the intended analysis.

Only the purpose can tell which of the generalizations is more suited and gives

more accurate results. Therefore, we developed a tool called Open anonymizer

(see https://sourceforge.net/projects/openanonymizer) which is based on

individual attribution of information loss. We implemented the anonymization

algorithm as a Java web application that may be deployed on a web application

server and accessed by a web browser. Open anonymizer is a highly customizable

anonymization tool providing the best anonymization for a certain


Information Systems for Federated Biobanks 187

context. The anonymization process is strongly influenced by data quality requirements

of users. We allow users to specify the importance of attributes as

well as transformation limits for attributes. These parameters are considered in

the anonymization process, which delivers a solution that is guaranteed to fulfil

the user requirements and has a minimal information loss. Open anonymizer

provides a wizard-based, intuitive user interface which guides the user through

the anonymization process. Instead of anonymizing the entire data set of a data

repository, a simple query interface allows to extract relevant subsets of data

to be anonymized. For instance, in a biomedical context, diagnoses of a certain

carcinoma type may be selected, anonymizsed and released without considering

the rest of the diagnoses.

7 Conclusion

Biobanks are challenging application areas for advanced information technology.

The foremost challenges for the information system support in a network of

biobanks as envisioned in the BBMRI project are the following:

– Partiality: A biobank is intended to be one node in a federation (cooperative

network) of biobanks. It needs the descriptive capabilities to be useful

for other nodes in the network and it needs the capability to make use of

other biobanks. This needs careful design of metadata about the contents

of the biobank, the acceptance and interoperability of heterogeneous partner

resources. On the other hand a biobank will rely on data generated and

maintained in other systems (other centres, hospital information systems,

etc.).

– Auditability: Managing the provenance of data will be essential for advanced

biomedical studies. Documenting the origins and the quality of data and

specimens, documenting the sources used for studies and the methods and

tools and results of studies is essential for the reproducibility of results.

– Longevity: A biobank is intended to be a long lasting research infrastructure

and thus many changes will occur during its lifetime: new diagnostic codes,

new therapies, new analytical methods, new legal regulations, and new IT

standards. The biobank needs to be ready to incorporate such changes and

to be able to make best use of already collected data in spite of such changes.

– Confidentiality: A biobank stores or links to patient related data. Personal

data and genomic data are considered highly sensitive in many countries.

The IT-infrastructure must on the one hand provide means to protect the

confidentiality of protected data and on the other enable the best possible

use of data for studies respecting confidentiality constraints.

We presented biobanks and discussed the requirements for biobank information

systems. We have shown that many different research areas within the Databases

and Information Systems field contribute to this endeavor. We were just able

to show some examples: advanced information modeling, (semantic) interoperability,

federated databases, approximate query answering, result ranking, computer

supported cooperative work (CSCW), and security and privacy. Some well


188 J. Eder et al.

known solutions from different application areas have to be revisited given the

size, heterogeneity, diversity dynamics, and complexity of data to be organized

in biobanks.

References

1. Biobankcentral, http://www.biobankcentral.org

2. Biobanking and biomolecular resources research infrastructure (bbmri),

http://www.bbmri.eu

3. Cabig - cancer biomedical informatics grid, https://cabig.nci.nih.gov

4. Geneontology, http://www.geneontology.org

5. Uk-biobank, http://www.ukbiobank.ac.uk

6. Who: International statistical classification of diseases and related health problems.

10th revision version for 2007 (2007)

7. Who: International classification of diseases for oncology, 3rd edn., icd-o-3 (2000)

8. Organisation for economic cooperation and development: Biological resource centres:

Underpinning the future of life sciences and biotechnology (2001)

9. Genespring: Cutting-edge tools for expression analysis (2005),

http://www.silicongenetics.com

10. Nih guide: Informed consent in research involving human participants (2006)

11. Bbmri: Construction of new infrastructures - preparatory phase. INFRA–2007–

2.2.1.16: European Bio-Banking and Biomolecular Resources (April 2007)

12. Organisation for economic cooperation and development. best practice guidelines

for biological resource centres (2007)

13. Uk biobank: Protocol for a large-scale prospective epidemiological resource. Protocol

No: UKBB-PROT-09-06 (March 2007)

14. Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the

kepler scientific workflow system, pp. 118–132 (2006)

15. Ambrosone, C.B., Nesline, M.K., Davis, W.: Establishing a cancer center data bank

and biorepository for multidisciplinary research. Cancer epidemiology, biomarkers

& prevention: a publication of the American Association for Cancer Research,

cosponsored by the American Society of Preventive Oncology 15(9), 1575–1577

(2006)

16. Asslaber, M., Abuja, P., Stark, K., Eder, J., Gottweis, H., Trauner, M., Samonigg,

H., Mischinger, H., Schippinger, W., Berghold, A., Denk, H., Zatloukal, K.: The

genome austria tissue bank (gatib). Pathobiology 2007 74, 251–258 (2007)

17. Asslaber, M., Zatloukal, K.: Biobanks: transnational, european and global networks.

Briefings in functional genomics & proteomics 6(3), 193–201 (2007)

18. Cambon-Thomsen, A.: The social and ethical issues of post-genomic human

biobanks. Nat. Rev. Genet. 5(11), 866–873 (2004)

19. Chamoni, P., Stock, S.: Temporal structures in data warehousing. In: Mohania, M.,

Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676, pp. 353–358. Springer, Heidelberg

(1999)

20. Eder, J., Dabringer, C., Schicho, M., Stark, K.: Data management for federated

biobanks. In: Proc. DEXA 2009 (2009)

21. Eder, J., Koncilia, C.: Changes of dimension data in temporal data warehouses.

In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2001. LNCS,

vol. 2114, pp. 284–293. Springer, Heidelberg (2001)


Information Systems for Federated Biobanks 189

22. Eder, J., Koncilia, C., Morzy, T.: The comet metamodel for temporal data warehouses.

In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE

2002. LNCS, vol. 2348, pp. 83–99. Springer, Heidelberg (2002)

23. Elliott, P., Peakman, T.C.: The uk biobank sample handling and storage protocol

for the collection, processing and archiving of human blood and urine. International

Journal of Epidemiology 37(2), 234–244 (2008)

24. Foster, I., Vöckler, J., Wilde, M., Zhao, Y.: Chimera: A virtual data system for

representing, querying, and automating data derivation. In: Proceedings of the 14th

Conference on Scientific and Statistical Database Management, pp. 37–46 (2002)

25. Goeman, J.J., van de Geer, S.A., de Kort, F., van Houwelingen, H.C.: A global

test for groups of genes: testing association with a clinical outcome. Bioinformatics

20(1), 93–99 (2004)

26. Goos, G., Hartmanis, J., Sripada, S., Leeuwen, J.V., Jajodia, S.: Temporal

Databases: Research and Practice. Springer, New York (1998)

27. Gottweis, H., Zatloukal, K.: Biobank governance: Trends and perspectives. Pathobiology

2007 74, 206–211 (2007)

28. Hewitt, S.: Design, construction, and use of tissue microarrays. Protein Arrays:

Methods and Protocols 264, 61–72 (2004)

29. Hummel, M., Meister, R., Mansmann, U.: Globalancova: exploration and assessment

of gene group effects. Bioinformatics 24(1), 78–85 (2008)

30. Isabelle, M., Teodorovic, I., Morente, M., Jaminé,D.,Passioukov,A.,Lejeune,S.,

Therasse, P., Dinjens, W., Oosterhuis, J., Lam, K., Oomen, M., Spatz, A., Ratcliffe,

C., Knox, K., Mager, R., Kerr, D., Pezzella, F.: Tubafrost 5: multifunctional central

database application for a european tumor bank. Eur. J. Cancer 42(18), 3103–3109

(2006)

31. Kim, S.: Development of a human biorepository information system at the university

of kentucky markey cancer center. In: International Conference on BioMedical

Engineering and Informatics, vol. 1, pp. 621–625 (2008)

32. Litwin, W., Mark, L., Roussopoulos, N.: Interoperability of multiple autonomous

databases. ACM Comput. Surv. 22(3), 267–293 (1990)

33. Louie, B., Mork, P., Martin-Sanchez, F., Halevy, A., Tarczy-Hornoch, P.: Methodological

review: Data integration and genomic medicine. J. of Biomedical Informatics

40(1), 5–16 (2007)

34. Missier, P., Belhajjame, K., Zhao, J., Roos, M., Goble, C.: Data lineage model for

taverna workflows with lightweight annotation requirements. In: Freire, J., Koop,

D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 17–30. Springer, Heidelberg

(2008)

35. Muilu, J., Peltonen, L., Litton, J.: The federated database - a basis for biobankbased

post-genome studies, integrating phenome and genome data from 600 000

twin pairs in europe. European Journal of Human Genetics 15, 718–723 (2007)

36. Papazoglou, M.P.: Service-oriented computing: concepts, characteristics and directions.

In: Proceedings of the Fourth International Conference on Web Information

Systems Engineering, WISE 2003, pp. 3–12 (2003)

37. Ram, S., Liu, J.: A semiotics framework for analyzing data provenance research.

Journal of computing Science and Engineering 2(3), 221–248 (2008)

38. Rebulla, P., Lecchi, L., Giovanelli, S., Butti, B., Salvaterra, E.: Biobanking in the

year 2007. Transfusion Medicine and Hemotherapy 34, 286–292 (2007)

39. Riegman, P., Morente, M., Betsou, F., de Blasio, P., Geary, P.: Biobanking for better

healthcare. In: The Marble Arch International Working Group on Biobanking

for Biomedical Research (2008)


190 J. Eder et al.

40. Saltz, J., Oster, S., Hastings, S., Langella, S., Kurc, T., Sanchez, W., Kher, M.,

Manisundaram, A., Shanbhag, K., Covitz, P.: Cagrid: design and implementation

of the core architecture of the cancer biomedical informatics grid. Bioinformatics

22(15), 1910–1916 (2006)

41. Schroeder, C.: Vernetzte gewebesammlungen fuer die forschung crip. Laborwelt 5,

26–27 (2007)

42. Schulte, J., Hampel, T., Stark, K., Eder, J., Schikuta, E.: Towards the next generation

of service-oriented flexible collaborative systems – a basic framework applied

to medical research. In: Cordeiro, J., Filipe, J. (eds.) ICEIS 2008 - Proceedings

of the Tenth International Conference on Enterprise Information Systems, number

978-989-8111-36-4, Barcelona, Spain, June 2008, pp. 232–239 (2008)

43. Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed,

heterogeneous, and autonomous databases. ACM Comput. Surv. 22(3), 183–236

(1990)

44. Simmhan, Y.L., Plale, B., Gannon, D.: A framework for collecting provenance in

data-centric scientific workflows. In: ICWS 2006: Proceedings of the IEEE International

Conference on Web Services, Washington, DC, USA, pp. 427–436. IEEE

Computer Society, Los Alamitos (2006)

45. Stark, K., Schulte, J., Hampel, T., Schikuta, E., Zatloukal, K., Eder, J.: GATiB-

CSCW, medical research supported by a service-oriented collaborative system. In:

Bellahsène, Z., Léonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 148–162.

Springer, Heidelberg (2008)

46. Stevens, R., Zhao, J., Goble, C.: Using provenance to manage knowledge of in silico

experiments. Briefings in bioinformatics 8(3), 183–194 (2007)

47. Sturn, A., Quackenbush, J., Trajanoski, Z.: Genesis: cluster analysis of microarray

data. Bioinformatics 18(1), 207–208 (2002)

48. Sweeney, L., Samarati, P.: Protecting privacy when disclosing information: kanonymity

and its enforcement through generalization and suppression. In: Proceedings

of the IEEE Symposium on Research in Security and Privacy (1998)

49. Tusher, V.G., Tibshirani, R., Chu, G.: Significance analysis of microarrays applied

to the ionizing radiation response. Proc. Natl. Acad. Sci. U S A 98(9), 5116–5121

(2001)

50. Ulitsky, I., Shamir, R.: Identification of functional modules using network topology

and high-throughput data. BMC Systems Biology 1(1) (2007)

51. Yang, J.: Temporal Data Warehousing. Stanford University (2001)

52. Zatloukal, K., Yuille, M.: Information on the proposal for european research infrastructure.

In: European Bio-Banking and Biomolecular Resources (2007)


Exploring Trust, Security and Privacy in Digital Business

Simone Fischer-Hübner 1 , Steven Furnell 2 , and Costas Lambrinoudakis 3,*

1 Department of Computer Science

Karlstad University,

Karlstad, Sweden

simone.fischer-huebner@kau.se

2 School of Computing & Mathematics,

University of Plymouth,

Plymouth, United Kingdom

sfurnell@plymouth.ac.uk

3 Department of Information and Communication Systems Engineering,

University of the Aegean,

Samos, Greece

clam@aegean.gr

Abstract. Security and privacy are widely held to be fundamental requirements

for establishing trust in digital business. This paper examines the relationship

between the factors, and the different strategies that may be needed in order to

provide an adequate foundation for users’ trust. The discussion begins by recognising

that users often lack confidence that sufficient security and privacy

safeguards can be delivered from a technology perspective, and therefore require

more than a simple assurance that they are protected. One contribution in

this respect is the provision of a Trust Evaluation Function, which supports the

user in reaching more informed decisions about the safeguards provided in different

contexts. Even then, however, some users will not be satisfied with technology-based

assurances, and the paper consequently considers the extent to

which risk mitigation can be offered via routes, such as insurance. The discussion

concludes by highlighting a series of further open issues that also require

attention in order for trust to be more firmly and widely established.

Keywords: Trust, Security, Privacy, Digital Business.

1 Introduction

The evolution in the way information and communication systems are currently utilised

and the widespread use of web-based digital services drives the transformation of

modern communities into modern information societies. Nowadays, personal data are

available or/and can be collected at different sites around the world. Even though the

utilisation of personal information leads to several advantages, including improved

customer services, increased revenues and lower business costs, it can be misused in

several ways and may lead to violation of privacy. For instance, in the framework of

* Authors are listed in alphabetical order.

A. Hameurlain et al. (Eds.): Trans. on Large-Scale Data- & Knowl.-Cent. Syst. I, LNCS 5740, pp. 191–210, 2009.

© Springer-Verlag Berlin Heidelberg 2009


192 S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis

e-commerce, several organisations in order to identify the preferences of their customers

and adapt their products accordingly develop new methods for collecting and

processing personal data. Modern data mining techniques can then be utilised in order

to further process the collected data, generating databases of the consumers’ profiles

through which each person’s preferences can be uniquely identified.

Therefore, such information can be utilised for invading user’s privacy and thereby

compromising the 95/46 European Union directive on the protection of individuals

with regard to the processing of personal and sensitive data. In order to avoid confusion,

it is important to stress the difference between privacy and security; a piece of

information is secure when its content is protected, whereas it is private when the

identity of its owner is protected. It is true that, irrespective of the application domain

(i.e. e-commerce, e-health etc), the major conservation of the users in using the Internet

is due to the lack of privacy rather than cost, difficulties in using the service or

undesirable marketing messages. Considering that conventional security mechanisms,

like encryption, cannot ensure privacy protection (encryption for instance, can only

protect the message’s confidentiality), new Privacy-Enhancing Technologies (PETs)

have been developed. However, the sole use of technological countermeasures is not

enough. For instance, even if a company that collects personal data stores them in an

ultra-secure facility, the company may at any point in time decide to sell or otherwise

disseminate the data, thus violating the privacy of the individuals involved. Therefore

security and privacy are intricately related.

Privacy as an expression of the human dignity is considered as a core value in democratic

societies and is recognized either explicitly or implicitly as a fundamental

human right by most constitutions of democratic societies. Today, in many legal systems,

privacy is in fact defined as the right to informational self-determination, i.e. the

right of individuals to determine for themselves when, how, to what extent and for

what purposes information about them is communicated to others. For reinforcing

their right to informational self-determination, users need technical tools that allow

them to manage their (partial) identities and to control what personal data about them

is revealed to others under which conditions. Identity Management (IDM) can be defined

to subsume all functionality that supports the use of multiple identities, by the

identity owners (user-side IDM) and by those parties with whom the owners interact

(services-side IDM). According to Pfitzmann and Hansen, identity management

means managing various partial identities (i.e. set of attributes, usually denoted by

pseudonyms) of a person, i.e. administration of identity attributes including the development

and choice of the partial identity and pseudonym to be (re-)used in a specific

context or role (Pfitzmann and Hansen 2008). Privacy-enhancing identity management

technology enforcing legal privacy principles of data minimisation, purpose

binding and transparency have been developed within the EU FP6 project PRIME 1

(Privacy and Identity Management for Europe) and the EU FP7 project PrimeLife 2

(Privacy and Identity Management for Life). Trust has been playing an important role

in PRIME and PrimeLife, because users do not only need to trust their own platforms

(i.e. the user-side IDM) to manage their data accordingly but also need to trust the

services sides that they process their data in a privacy-friendly and secure manner and

according to the business agreements with the users.

1 https://www.prime-project.eu/

2 http://www.primelife.eu/


Exploring Trust, Security and Privacy in Digital Business 193

In considering the forms of protection that are needed, it is important to recognise

that user actions will often be based upon their perceptions of risk, which may not

always align very precisely with the reality of the situation. For example, they may

under- or over-estimate the extent of the threats facing them, or be under- or overassured

by the presence of technical safeguards. For example, some people simply

need to be told that they a service is secure in order to use it with confidence. Meanwhile,

others will only be reassured by seeing an abundance of explicit safeguards in

use. As such, if trust is to be established, the security and privacy measures need to be

provided in accordance with what users expect to see and are comfortable to use in a

given context.

Furthermore, how much each person values her privacy is a subjective issue. When

a bank uses the credit history of a client without her consent, in order to issue a presigned

credit card then it is subjective whether the client will feel upset about it and

press charges for breech of the personal data protection Act or not. Providing a way

to model this subjective nature of privacy would be extremely useful for organisations

in the sense that they will be able to estimate the financial losses that they may experience

after a potential privacy violation incident. This will allow them to reach

cost-effective decision in terms of the money that they will invest for security and

privacy protection reasons.

This paper examines the relationship between the factors, and the different strategies

that may be needed in order to provide an adequate foundation for users’ trust. It

has been recognised that users often lack confidence that sufficient security and privacy

safeguards can be delivered from a technology perspective, and therefore require

more than a simple assurance that they are protected. In this respect, Section 2 first

investigates social trust factors for establishing reliable end user trust and then presents

a Trust Evaluation Function, which utilises these trust factors and supports the

user in reaching more informed decisions about the trustworthiness of online services.

Even then, however, some users will not be satisfied with technology-based assurances.

As a consequence Section 3 considers the extent to which risk mitigation can

be offered via routes, such as insurance. The discussion concludes with Section 4 that

highlights a series of further open issues that also require attention in order for trust to

be more firmly and widely established.

2 Trust in Online Services

2.1 Users’ Perception of Security and Privacy and Lack of Trust

“Trust is important because if a person is to use a system to its full potential, be it an

e-commerce site or a computer program, it is essential for her to trust the system”

(Johnston et al. 2004).

For establishing trust, a significant issue will be the user’s perception of security

and privacy within a given context. Indeed, the way users feel about a given site or

service is very likely to influence their ultimate decision about whether or not to use

it. While some may be fully reassured by the presence of security technology, others

may be more interested in other facts, such as the mitigation and restitution available

to them in the event of breaches.


194 S. Fischer-Hübner, S. Furnell, and C. Lambrinoudakis

Research conducted in the UK as part of the Trustguide project has investigated

citizens’ trust in online services and their resultant views on the risks present in this