EN100-web
EN100-web
EN100-web
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Special theme: Scientific Data Sharing and Re-use<br />
Beyond Data: Process Sharing and Reuse<br />
by Tomasz Miksa and Andreas Rauber<br />
Sharing and reuse of data is just an intermediate step on the way to reproducible computational<br />
science. The next step, sharing and reuse of processes that transform data, is enabled by process<br />
management plans, which benefit multiple stakeholders at all stages of research.<br />
Nowadays almost every research<br />
domain depends on data that is accessed<br />
and processed using computers. Data<br />
processing may range from simple calculations<br />
made in a spreadsheet editor,<br />
to distributed processes that transform<br />
data using dedicated software and hardware<br />
tools. The crucial focus is on the<br />
data, because it underlies new scientific<br />
breakthroughs and in many cases is irreproducible,<br />
e.g. climate data. Funding<br />
institutions have recognized the value of<br />
data, and as a result data management<br />
plans (DMPs) have become obligatory<br />
for many scientists who receive public<br />
funding. Data management plans are initiated<br />
before the start of a project, and<br />
evolve during its course. They aim not<br />
only to ensure that the data is managed<br />
properly during the project, e.g. by performing<br />
backups, using file naming convention,<br />
etc., but also that it is preserved<br />
and available in the future.<br />
In order to understand data, as well as<br />
research results, data acquisition and<br />
manipulation processes must also be<br />
curated. Unfortunately, the underlying<br />
processes are not included in DMPs. As<br />
a consequence, information needed to<br />
document, verify, preserve or re-execute<br />
the experiment is lost. For this reason,<br />
we extend DMPs to “process management<br />
plans” (PMPs) [1] which complement<br />
the description of scientific data<br />
taking a process-centric view, viewing<br />
data as the result of underlying<br />
processes such as capture, (pre-) processing,<br />
transformation, integration and<br />
analyses. A PMP is a living document,<br />
which is created at the time of process<br />
design and is maintained and updated<br />
during the lifetime of the experiment by<br />
various stakeholders. Its structure as<br />
outlined below necessarily appears relatively<br />
abstract due to a wide range of<br />
requirements and different practices in<br />
scientific domains in which PMPs<br />
should be used. The proposed structure<br />
of a PMP is depicted in Figure 1.<br />
Figure 1: Structure of a Process Management Plan.<br />
The implementation of PMPs is to some<br />
extent domain dependent, because it has<br />
to incorporate already existing best practices.<br />
During the course of the EUfunded<br />
FP7 project TIMBUS, our team<br />
at SBA Research in Vienna investigated<br />
well-structured Taverna workflows, but<br />
also unstructured processes from the<br />
civil-engineering domain. In all cases,<br />
the PMPs can be implemented by integrating<br />
already existing tools. For<br />
example, the automatically generated<br />
TIMBUS Context Model [2] can be<br />
used to describe the software and hardware<br />
dependencies. The process of verification<br />
and validation can be performed<br />
using the VFramework.<br />
Research Objects [3] can be used to<br />
aggregate the high level information on<br />
Figure 2: Stakeholders impacted by Process Management Plan.<br />
the process, and the existing data management<br />
plan templates and tools can be<br />
refined to incorporate information on<br />
processes.<br />
Figure 2 depicts stakeholders impacted<br />
by PMPs. Project applicants will benefit<br />
by being able to better identify and plan<br />
the resources needed for the research.<br />
For example, if the process of transforming<br />
the experimental data assumes<br />
use of proprietary software with an<br />
expensive license, this information will<br />
be revealed at an early stage and can be<br />
taken into the account when applying<br />
for a grant.<br />
18<br />
ERCIM NEWS 100 January 2015