12.03.2015 Views

EN100-web

EN100-web

EN100-web

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Special theme: Scientific Data Sharing and Re-use<br />

Beyond Data: Process Sharing and Reuse<br />

by Tomasz Miksa and Andreas Rauber<br />

Sharing and reuse of data is just an intermediate step on the way to reproducible computational<br />

science. The next step, sharing and reuse of processes that transform data, is enabled by process<br />

management plans, which benefit multiple stakeholders at all stages of research.<br />

Nowadays almost every research<br />

domain depends on data that is accessed<br />

and processed using computers. Data<br />

processing may range from simple calculations<br />

made in a spreadsheet editor,<br />

to distributed processes that transform<br />

data using dedicated software and hardware<br />

tools. The crucial focus is on the<br />

data, because it underlies new scientific<br />

breakthroughs and in many cases is irreproducible,<br />

e.g. climate data. Funding<br />

institutions have recognized the value of<br />

data, and as a result data management<br />

plans (DMPs) have become obligatory<br />

for many scientists who receive public<br />

funding. Data management plans are initiated<br />

before the start of a project, and<br />

evolve during its course. They aim not<br />

only to ensure that the data is managed<br />

properly during the project, e.g. by performing<br />

backups, using file naming convention,<br />

etc., but also that it is preserved<br />

and available in the future.<br />

In order to understand data, as well as<br />

research results, data acquisition and<br />

manipulation processes must also be<br />

curated. Unfortunately, the underlying<br />

processes are not included in DMPs. As<br />

a consequence, information needed to<br />

document, verify, preserve or re-execute<br />

the experiment is lost. For this reason,<br />

we extend DMPs to “process management<br />

plans” (PMPs) [1] which complement<br />

the description of scientific data<br />

taking a process-centric view, viewing<br />

data as the result of underlying<br />

processes such as capture, (pre-) processing,<br />

transformation, integration and<br />

analyses. A PMP is a living document,<br />

which is created at the time of process<br />

design and is maintained and updated<br />

during the lifetime of the experiment by<br />

various stakeholders. Its structure as<br />

outlined below necessarily appears relatively<br />

abstract due to a wide range of<br />

requirements and different practices in<br />

scientific domains in which PMPs<br />

should be used. The proposed structure<br />

of a PMP is depicted in Figure 1.<br />

Figure 1: Structure of a Process Management Plan.<br />

The implementation of PMPs is to some<br />

extent domain dependent, because it has<br />

to incorporate already existing best practices.<br />

During the course of the EUfunded<br />

FP7 project TIMBUS, our team<br />

at SBA Research in Vienna investigated<br />

well-structured Taverna workflows, but<br />

also unstructured processes from the<br />

civil-engineering domain. In all cases,<br />

the PMPs can be implemented by integrating<br />

already existing tools. For<br />

example, the automatically generated<br />

TIMBUS Context Model [2] can be<br />

used to describe the software and hardware<br />

dependencies. The process of verification<br />

and validation can be performed<br />

using the VFramework.<br />

Research Objects [3] can be used to<br />

aggregate the high level information on<br />

Figure 2: Stakeholders impacted by Process Management Plan.<br />

the process, and the existing data management<br />

plan templates and tools can be<br />

refined to incorporate information on<br />

processes.<br />

Figure 2 depicts stakeholders impacted<br />

by PMPs. Project applicants will benefit<br />

by being able to better identify and plan<br />

the resources needed for the research.<br />

For example, if the process of transforming<br />

the experimental data assumes<br />

use of proprietary software with an<br />

expensive license, this information will<br />

be revealed at an early stage and can be<br />

taken into the account when applying<br />

for a grant.<br />

18<br />

ERCIM NEWS 100 January 2015

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!