27.03.2014 Views

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

SciprovMiner: Provenance Capture using the OPM<br />

Model<br />

Tatiane O. M. Alves, Wander Gaspar, Regina M. M.<br />

Braga, Fernanda Campos<br />

Master's Program in Computer Science<br />

Department of Computer Science<br />

Federal University of Juiz de Fora, Brazil<br />

{tatiane.ornelas, regina.braga,<br />

fernanda.campos}@ufjf.edu.br, andergaspar@gmail.com<br />

Abstract—Provide historical scientific information to deal with<br />

loss of knowledge about scientific experiment has been the focus<br />

of several researches. However, the computational support for<br />

scientific experiment on a large scale is still incipient and is<br />

considered one of the challenges set by the Brazilian Computer<br />

Society for the period 2006 to 2016. This work aims to contribute<br />

in this area, presenting the SciProvMiner architecture, which<br />

main objective is to collect prospective and retrospective<br />

provenance of scientific experiments as well as apply data mining<br />

techniques in the collected provenance data to enable the<br />

discovery of unknown patterns in these data.<br />

Keywords-web services; ontology; provenance; OPM; OPMO;<br />

data mining.<br />

I. INTRODUCTION<br />

The large-scale computing has been widely used as a<br />

methodology for scientific research: there are several success<br />

stories in many domains, including physics, bioinformatics,<br />

engineering and geosciences [1].<br />

Although scientific knowledge continue to be generated in<br />

a traditional way, in-vivo and in-vitro, in recent decades<br />

scientific experiments began to use computational procedures<br />

to simulate their own execution environments, giving origin to<br />

in-virtuo scientific experimentation. Moreover, even the<br />

objects and participants in an experiment can be simulated,<br />

resulting in in-silico experimentation category. These largescale<br />

computations, which support a scientific process, are<br />

generally referred to as e-Science [1].<br />

The work presented in this article, SciProvMiner, is part of<br />

a project developed in Federal University of Juiz de Fora<br />

(UFJF), Brazil, with emphasis on studies to building an<br />

infrastructure to support e-Science projects and has as its main<br />

goal the specification of an architecture for the collection and<br />

management of provenance data and processes in the context<br />

of scientific experiments processed through computer<br />

simulations in collaborative research environments<br />

geographically dispersed and interconnected through a<br />

computational grid. The proposed architecture must provides<br />

an interoperability layer that can interact with SWfMS<br />

(Scientific Workflows Management <strong>Systems</strong>) and aims to<br />

capture retrospective and prospective provenance information<br />

generated from scientific workflows and provides a layer that<br />

Marco Antonio Machado, Wagner Arbex<br />

National Center for Research in Dairy Cattle<br />

Brazilian Enterprise for Agricultural Research -<br />

EMBRAPA<br />

Juiz de Fora, Brazil<br />

{arbex, machado}@cnpgl.embrapa.br<br />

allows the use of data mining techniques to discover important<br />

patterns in provenance data.<br />

The rest of the paper is structured as follows: In Section 2<br />

we present the theoretical foundations that underpin this work.<br />

Section 3 describes some related work. Section 4 presents the<br />

contributions proposed in this work, the architecture of<br />

SciprovMiner is detailed, showing each of its layers and<br />

features. Finally, section 5 presents final considerations and<br />

future work.<br />

II. CONCEPTUAL BACKGROUND<br />

Considering scientific experimentation, data provenance<br />

can be defined as information that helps to determine the<br />

historical derivation of a data product with regards to its<br />

origins. In this context, data provenance is being considered an<br />

essential component to allow reproducibility of results, sharing<br />

and reuse of knowledge by the scientific community [2].<br />

Considering the necessity of providing interoperability in<br />

order to exchange provenance data, there is an initiative to<br />

provide a standard to facilitate this interoperability. The Open<br />

Provenance Model (OPM) [3] defines a generic representation<br />

for provenance data. The OPM assumes that the object (digital<br />

or not) provenance can be represented by an annotated<br />

causality graph, which is a directed acyclic graph, enriched<br />

with notes captured from other information related to the<br />

execution [3].<br />

In the OPM model, provenance graphs are composed of<br />

three types of nodes [3]:<br />

• Artifacts: represent an immutable state data, which<br />

may have a physical existence, or have a digital<br />

representation in a computer system.<br />

• Processes: represent actions performed or caused by<br />

artifacts, and results in new artifacts.<br />

• Agents: represent entities acting as a catalyst of a<br />

process, enabling, facilitating, controlling, or affecting<br />

its performance.<br />

Figure 1 adapted from [3], illustrates these three entities<br />

and their possible relationships, also called dependencies. In<br />

Figure 1, the first and second edge in the right column express<br />

that a process used an artifact and that an artifact was<br />

generated by a process. These two relationships represent data<br />

491

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!