07.08.2014 Views

Integration of Data and Publications - Alliance for Permanent Access

Integration of Data and Publications - Alliance for Permanent Access

Integration of Data and Publications - Alliance for Permanent Access

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

REPORT ON INTEGRATION OF DATA AND<br />

PUBLICATIONS<br />

October 17 th , 2011<br />

Susan Reilly a, * , Wouter Schallier a , Sabine Schrimpf b , Eefke Smit c<br />

Max Wilkinson d<br />

a<br />

LIBER – Association <strong>of</strong> European Research Libraries, Koninklijke Bibliotheek,<br />

National Library Of The Netherl<strong>and</strong>s. Po Box 90407. 2509 Lk The Hague. The<br />

Netherl<strong>and</strong>s<br />

b<br />

Deutsche Nationalbibliothek In<strong>for</strong>mationstechnik, Adickesallee 1. D-60322 Frankfurt<br />

am Main. Germany<br />

c<br />

The International Association <strong>of</strong> STM Publishers, Prama House, 267 Banbury Road.<br />

Ox<strong>for</strong>d OX2 7HT. United Kingdom<br />

d<br />

The British Library, 96 Euston Road. LONDON NW1 2DB. United Kingdom<br />

* Corresponding author: Susan.Reilly@KB.nl<br />

Abstract<br />

Scholarly communication is the foundation <strong>of</strong> modern research where empirical evidence<br />

is interpreted <strong>and</strong> communicated as published hypothesis driven research. Many<br />

current <strong>and</strong> recent reports highlight the impact <strong>of</strong> advancing technology on modern<br />

research <strong>and</strong> consequences this has on scholarly communication. As part <strong>of</strong> the ODE<br />

project this report sought to coalesce current though <strong>and</strong> opinions from numerous <strong>and</strong><br />

diverse sources to reveal opportunities <strong>for</strong> supporting a more connected <strong>and</strong> integrated<br />

scholarly record. Four perspectives were considered, those <strong>of</strong> the Researcher who<br />

generates or reuses primary data, Publishers who provide the mechanisms to<br />

communicate research activities <strong>and</strong> Libraries & <strong>Data</strong> enters who maintain <strong>and</strong> preserve<br />

the evidence that underpins scholarly communication <strong>and</strong> the published record. This<br />

report finds the l<strong>and</strong>scape fragmented <strong>and</strong> complex where competing interests can<br />

sometimes confuse <strong>and</strong> confound requirements, needs <strong>and</strong> expectations. Equally the<br />

report identifies clear opportunity <strong>for</strong> all stakeholders to directly enable a more joined up<br />

<strong>and</strong> vital scholarly record <strong>of</strong> modern research.<br />

This work is licensed under a Creative Commons Attribution 3.0 Unported License


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

TABLE OF CONTENTS<br />

EXECUTIVE SUMMARY ................................................................<br />

................................................................<br />

.................4<br />

0. INTRODUCTION<br />

................................................................<br />

................................................................<br />

........................................................<br />

10<br />

1. INTEGRATION OF DATA AND PUBLICATIONS: GENERAL ................................................<br />

12<br />

1.1. INTRODUCTION AND SUMMARY ................................................................................................... 12<br />

1.2. CURRENT PRACTICE – SOME NUMBERS RELATING RESEARCH DATA WITH PUBLICATIONS .......... 13<br />

1.3. WHY THE RELATIONSHIP BETWEEN DATA AND PUBLICATIONS IS SO IMPORTANT........................ 15<br />

1.4. THE DATA PUBLICATION PYRAMID ............................................................................................. 18<br />

2. RESEARCHER PERSPECTIVE OF DATA/PUBLISHING INTEGRATION: ............................ 21<br />

2.1. RESEARCHERS ARE THE SOURCE OF DATA ................................................................................... 21<br />

2.2. WHAT IS THE CURRENT PRACTICE ............................................................................................... 22<br />

2.3. CONCLUSIONS ON CURRENT PRACTICE ....................................................................................... 31<br />

2.4. IS THERE A NEED OR CASE FOR CHANGE? .................................................................................... 31<br />

2. How to control data(?) sharing <strong>and</strong> access ............................................................................ 33<br />

4. How to get credit ..................................................................................................................... 33<br />

5. Who pays <strong>for</strong> what .................................................................................................................. 34<br />

2.5. OPPORTUNITIES IN DATA EXCHANGE RELATING TO RESEARCHERS ............................................. 34<br />

3. INTEGRATION OF DATA AND PUBLICATIONS: THE PUBLISHERS’ PERSPECTIVE<br />

PECTIVE.......<br />

36<br />

3.1. HOW SCHOLARLY JOURNALS HANDLE THE INCREASING AMOUNT OF DATA ALONGSIDE THE<br />

ARTICLE ............................................................................................................................................. 36<br />

3.2. COMMON PRACTICE: SUPPLEMENTARY MATERIAL TO JOURNAL ARTICLES ................................ 42<br />

3.3. NEW LIMITS ON SUPPLEMENTAL FILES TO JOURNAL ARTICLES: RESTRICTIONS TO SUPPLEMENTS<br />

........................................................................................................................................................... 45<br />

3.4. HOW SAFE IS DATA IN SUPPLEMENTARY JOURNAL ARTICLE FILES? (OR: QUALITY AND<br />

PRESERVATION OF SUPPLEMENTARY JOURNAL ARTICLE FILES) ......................................................... 46<br />

3.5. DATA IN COMMUNITY-ENDORSED PUBLIC DATABASES, LINKED TO JOURNAL ARTICLES. ............. 49<br />

3.6. DATA STORAGE AS A SERVICE BY THE JOURNAL .......................................................................... 52<br />

3.7. ARTICLES WITH INTERACTIVE DATA ............................................................................................ 53<br />

3.8. SPECIAL DATA PUBLICATIONS AND DATA PAPERS ...................................................................... 54<br />

3.9. GAP ANALYSIS. ............................................................................................................................ 56<br />

3.10. FROM RAW DATA TO PROCESSED DATA TO DATA INTERPRETATIONS. ......................................... 58<br />

3.11. DIVERGING AND CONVERGING TRENDS. ................................................................................... 59<br />

3.12. OPPORTUNITIES FOR PUBLISHERS IN DATA EXCHANGE ............................................................ 60<br />

4. DATA CENTRE AND LIBRARY L<br />

PERSPECTIVE ................................................................<br />

...... 62<br />

4.1. LIBRARIES AND DATA CENTRES AS CUSTODIANS OF DATA ........................................................... 62<br />

4.2. COMMON PRACTICE AND RATIONALE FOR ACTION....................................................................... 64<br />

4.3. IMPLICATIONS OF DATA INTEGRATION FOR LIBRARIES AND DATA CENTRES ................................ 69<br />

4.4. LIBRARIES AND DATA CENTRES ENGAGEMENT IN NEW SERVICES AND ALLIANCES ..................... 70<br />

4.5. GAPS AND DILEMMAS .................................................................................................................. 74<br />

4.6. OPPORTUNITIES FOR LIBRARIES AND DATA CENTRES .................................................................. 78<br />

5. REPORT EPILOGUE: MAPPING THE ROAD AHEAD<br />

AD ............................................................<br />

81<br />

5.1. CAN LIBRARIES AND DATA CENTRES FILL THE MISSING LINK? ................................................... 81<br />

5.2. WHAT DOES THE DATA PUBLICATION PYRAMID MEAN FOR ROLES AND RESPONSIBILITIES ?...... 82<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 2


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

5.3. IN DIALOGUE WITH RESEARCH LIBRARIANS: LIBER WORKSHOP 2011 ...................................... 83<br />

5.4. EMERGING ISSUES ...................................................................................................................... 85<br />

5.5. THE NEXT STEP: SURVEY TO DOCUMENT CURRENT AND PROJECT FUTURE ROLES .................. 87<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 3


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

EXECUTIVE SUMMARY<br />

This report sets out to identify examples <strong>of</strong> integration between datasets <strong>and</strong><br />

publications. Findings from existing studies carried out by PARSE.Insight, RIN, SURF<br />

<strong>and</strong> various recent publications are synthesized <strong>and</strong> examined in relation to three<br />

distinct disciplinary groups in order to identify opportunities in the integration <strong>of</strong> data.<br />

These groups are Researchers, Publishers <strong>and</strong> Libraries/<strong>Data</strong> centres. Opportunities<br />

identified <strong>for</strong> each group have been scoped against seven criteria:<br />

1. Availability<br />

2. Findability<br />

3. Interpretability<br />

4. Reusability<br />

5. Citability<br />

6. Curation<br />

7. Preservation<br />

Opportunities to improve the linking <strong>of</strong> data <strong>and</strong> publications have been identified <strong>for</strong><br />

each stakeholder group <strong>and</strong> mapped against each <strong>of</strong> the criteria in tables at the end <strong>of</strong><br />

this summary.<br />

Based on an examination <strong>of</strong> the available research <strong>and</strong> literature, incentives <strong>and</strong><br />

barriers relating to data exchange are identified <strong>for</strong> each disciplinary group.<br />

The content <strong>of</strong> a draft <strong>of</strong> this report <strong>for</strong>med the basis <strong>of</strong> a workshop in June 2011 with<br />

pr<strong>of</strong>essionals from research libraries. The workshop served to validate this opportunities<br />

<strong>and</strong> issues identified in this report.<br />

From a researcher perspective, the value <strong>of</strong> data is that <strong>of</strong> a first class research object<br />

which represents the basis <strong>of</strong> their research. Researchers discover <strong>and</strong> use data <strong>and</strong><br />

analyses from others to <strong>for</strong>mulate new <strong>and</strong> testable hypothesis be<strong>for</strong>e extending the<br />

evidence base with empirical data. The implications <strong>of</strong> first class research objects are<br />

that they require preservation, recognition, validation, curation <strong>and</strong> dissemination which<br />

then improve their availability, findability, interpretability <strong>and</strong> re-usability.<br />

Researchers perceive <strong>and</strong> en<strong>for</strong>ce their creator right over the data, choose when <strong>and</strong> with<br />

whom they share it <strong>and</strong> wish to maintain this control. This need <strong>for</strong> control is based on<br />

perceived legal barriers <strong>and</strong> misuse, or absence <strong>of</strong> a trust network common in other<br />

<strong>for</strong>ms <strong>of</strong> scholarly communication; it may be a mixture <strong>of</strong> both. Researchers want<br />

somewhere safe to put their data while maintaining control in order to avoid legal<br />

redress <strong>and</strong> pr<strong>of</strong>essional misuse, but expect some central organisational structure to pay<br />

<strong>for</strong> these infrastructures. They recognise that many lack sufficient skills to manage their<br />

data appropriately, but, importantly, are enthusiastic to change this situation.<br />

Researchers see the benefit in joining publications with data in a more <strong>for</strong>mal <strong>and</strong><br />

agreed convention, but there must be a recognition <strong>and</strong> credit mechanism <strong>for</strong> this. They<br />

accept this joining as good pr<strong>of</strong>essional practice <strong>and</strong> agree that data supporting<br />

traditional publication should be available with the publication. Technology can reduce<br />

the latency to joining data to publications but there is a lack <strong>of</strong> common best practice<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 4


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

conventions <strong>for</strong> scholarly publications. Distilled into statements, our desk research has<br />

revealed five abstract researcher requirements <strong>for</strong> integrating data <strong>and</strong> publication.<br />

1. Researchers need somewhere to put data <strong>and</strong> make it safe <strong>for</strong> reuse<br />

2. Researchers need to control its sharing <strong>and</strong> access<br />

3. Researchers need the ability to integrate data <strong>and</strong> publication<br />

4. Researchers need to get credit <strong>for</strong> data as a first class research object<br />

5. Researchers need someone to pay <strong>for</strong> the costs <strong>of</strong> data availability <strong>and</strong> re-use<br />

Publishers are beginning to embrace the opportunity to integrate data with publications<br />

but barriers to the sustainability <strong>of</strong> this practice include the sheer volume <strong>of</strong> data, the<br />

huge variety <strong>of</strong> data <strong>for</strong>mats <strong>and</strong> a question mark over exactly what data should be<br />

made available within, be made supplemental to or be linked with the publication. Also<br />

the quality <strong>of</strong> the data <strong>and</strong> attached metadata may not be consistent, lacking peer<br />

review, or is not being made transparent.<br />

The relationship between data <strong>and</strong> publications can be illustrated with a modified<br />

version <strong>of</strong> Jim Gray’s e-science pyramid which in this report is presented as the <strong>Data</strong><br />

Publication Pyramid, see the Graph 1 below. As we descend the pyramid the exclusive<br />

relationship between data <strong>and</strong> publication diminishes. At the top, <strong>for</strong> example, the<br />

journal (<strong>and</strong> author/researcher) takes full responsibility <strong>for</strong> the publication including the<br />

aggregated data embedded in it <strong>and</strong> the way it is presented. For data published in the<br />

second layer, as supplementary files to articles, the link to the published Record <strong>of</strong><br />

Science remains strong, but it is not always clear at what level the data is curated <strong>and</strong><br />

preserved <strong>and</strong> if the criteria <strong>for</strong> discoverability <strong>and</strong> re-usability are met. At the <strong>Data</strong><br />

Collections <strong>and</strong> Structured <strong>Data</strong>base layer, the publication includes a citation <strong>and</strong> links<br />

to the data, but the data resides in <strong>and</strong> is the responsibility <strong>of</strong> a separate repository. The<br />

publication <strong>of</strong> data becomes collaborative.<br />

At the bottom layer <strong>of</strong> the pyramid, most datasets remain unpublished <strong>and</strong> hence<br />

unfindable <strong>and</strong> not re-usable.<br />

As Jim Gray already made clear, the data published now within or with publications, is<br />

only the tip <strong>of</strong> the data iceberg.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 5


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Graph 1: The <strong>Data</strong> Publication Pyramid, developed on the basis <strong>of</strong> the Jim Gray pyramid, to<br />

express the different manifestation <strong>for</strong>ms that research data can have in the publication process.<br />

See Chapter 1 <strong>for</strong> a full explanation.<br />

As more publishers respond to increasing author dem<strong>and</strong> to making research data<br />

available they are focused on:<br />

1. establishing cross publisher best practice to make data available <strong>and</strong> retrievable<br />

in a persistent way<br />

2. collaboration with publicly endorsed community archives to make data <strong>and</strong><br />

publications interlinkable<br />

3. presenting data in more sophisticated <strong>for</strong>mats to increase reuse<br />

Libraries <strong>and</strong> data centres have overlapping <strong>and</strong> complementary roles in terms <strong>of</strong> data<br />

integration. Barriers to integration <strong>of</strong> data include a lack <strong>of</strong> policies to address the<br />

concerns <strong>of</strong> researchers when it comes to making their data available, the lack <strong>of</strong><br />

uni<strong>for</strong>mity in data preservation <strong>and</strong> curation strategies <strong>and</strong> practices.<br />

New publishing models linking data <strong>and</strong> articles require that libraries <strong>and</strong> data centres<br />

need to address particular concerns:<br />

1. preservation <strong>and</strong> persistence <strong>of</strong> data to ensure continued access to linked data<br />

2. making data findable <strong>and</strong> reusable though the use <strong>of</strong> metadata <strong>and</strong> integration<br />

into retrieval services<br />

3. working closely with researchers to encourage data sharing <strong>and</strong> best practice in<br />

data management<br />

In general, the need <strong>for</strong> action has been recognized in the library <strong>and</strong> data centre<br />

community. Noteworthy initiatives like the ones selected <strong>for</strong> description in this report<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 6


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

(<strong>Data</strong>Cite, PANGAEA, DRYAD, <strong>and</strong> <strong>Data</strong>verse) illustrate this as well as how libraries<br />

<strong>and</strong> data centres support data integration. However, the degree <strong>of</strong> preparedness to take<br />

on the challenge varies between disciplines <strong>and</strong> between individual institutions.<br />

It is important that libraries <strong>and</strong> data centres act in con<strong>for</strong>mance with the requirements<br />

<strong>of</strong> the research community, which they serve. They also need to be involved in the<br />

research process from the very beginning in order to ensure high data quality, which<br />

facilitates retrieval, usability, <strong>and</strong> preservation.<br />

An examination <strong>of</strong> the research <strong>and</strong> noteworthy initiatives highlights that opportunities<br />

exist across the three disciplines. These opportunities exist particularly in the areas <strong>of</strong><br />

availability, findability, interpretability, <strong>and</strong> re-usability.<br />

<strong>Data</strong> Issue: Researchers opportunities (Chapter 2):<br />

Availability Researchers dem<strong>and</strong> their data be treated as first class research<br />

objects<br />

Researchers loosen control over data<br />

Define roles <strong>of</strong> responsibility <strong>and</strong> control<br />

Findability Agree convention to propose to publishers regarding data citation<br />

Use <strong>of</strong> persistent identifiers such as DOI’s<br />

Ensure common metadata <strong>and</strong> citation practices<br />

Interpretability Recognize that data require metadata <strong>and</strong> work towards community<br />

best practice in metadata development<br />

Re-usability Be concerned about the long term ability <strong>for</strong> secondary use <strong>and</strong><br />

consider or seek out responsible preservation actions<br />

Citability Agree a convention <strong>for</strong> data citation<br />

Follow metadata st<strong>and</strong>ards <strong>for</strong> datasets<br />

Use <strong>of</strong> persistent identifiers such as DOI’s<br />

Curation Develop sustainable <strong>and</strong> realistic data management plans<br />

Collaboration with public data archives<br />

Preservation Develop sustainable realistic preservation plans<br />

Active engagement with public data archives<br />

Table 1: <strong>Data</strong> Opportunities <strong>for</strong> Researchers<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 7


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

<strong>Data</strong> Issue: Publishers opportunities (Chapter 3):<br />

Availability Articles with data provide richer content <strong>and</strong> higher usage<br />

Impose stricter editorial policies about availability <strong>of</strong> underlying data<br />

which is in line with general funder’s trends<br />

Ensure data is stored in a safe place, preferably a public repository<br />

Be transparent about curation <strong>and</strong> preservation <strong>of</strong> submitted data<br />

Findability Ensure bi-directional links between data <strong>and</strong> publications<br />

Ensure common citation practices<br />

Interpretability Provide services around data such as viewer apps <strong>for</strong> underlying data<br />

from within the article or interactive graphs, tables <strong>and</strong> images<br />

<strong>Data</strong> <strong>Publications</strong><br />

Re-usability Interactive data from within articles<br />

Links to the relevant datasets, not just to the database<br />

<strong>Data</strong> <strong>Publications</strong><br />

Citability Establish uni<strong>for</strong>m data citation st<strong>and</strong>ards<br />

Follow metadata st<strong>and</strong>ards <strong>for</strong> datasets<br />

Use <strong>of</strong> persistent identifiers such as DOI’s<br />

<strong>Data</strong> <strong>Publications</strong><br />

Curation Transparency about curation <strong>of</strong> submitted data<br />

Collaboration with public data archives<br />

Preservation Transparency about preservation <strong>of</strong> submitted data<br />

Collaboration with public data archives<br />

Table 2: <strong>Data</strong> Opportunities <strong>for</strong> Publishers<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 8


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

<strong>Data</strong> Issue: Libraries <strong>and</strong> data centres opportunities (Chapter 4):<br />

Availability Lower barriers to researchers to make their data available.<br />

Integrate data sets into retrieval services.<br />

Findability Support <strong>of</strong> persistent identifiers.<br />

Engage in developing common metadescription schemas <strong>and</strong> common<br />

citation practices.<br />

Promote use <strong>of</strong> common st<strong>and</strong>ards <strong>and</strong> tools among researchers<br />

Interpretability Support crosslinks between publications <strong>and</strong> datasets.<br />

Provide <strong>and</strong> help researchers underst<strong>and</strong> metadescriptions <strong>of</strong><br />

datasets.<br />

Establish <strong>and</strong> maintain knowledge base about data <strong>and</strong> their context.<br />

Re-usability Curate <strong>and</strong> preserve datasets.<br />

Archive s<strong>of</strong>tware needed <strong>for</strong> re-analysis <strong>of</strong> data.<br />

Be transparent about conditions under which data sets can be re-used<br />

(expert knowledge needed, s<strong>of</strong>tware needed).<br />

Citability Engage in establishing uni<strong>for</strong>m data citation st<strong>and</strong>ards.<br />

Support <strong>and</strong> promote persistent identifiers.<br />

Curation/<br />

Preservation<br />

Transparency about curation <strong>of</strong> submitted data.<br />

Promote good data management practice.<br />

Collaborate with data creators<br />

Instruct researchers on discipline specific best practices in data<br />

creation (preservation <strong>for</strong>mats, documentation <strong>of</strong> experiment,…)<br />

Table 3: <strong>Data</strong> Opportunities <strong>for</strong> Libraries <strong>and</strong> Researchers<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 9


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

0. INTRODUCTION<br />

Science is changing. The massive volume <strong>and</strong> variety <strong>of</strong> data pouring out <strong>of</strong> publicly<br />

funded science are trans<strong>for</strong>ming the face <strong>of</strong> research. These data belong to everyone. If<br />

we manage these precious resources properly, we may tackle the Gr<strong>and</strong> Challenges <strong>of</strong><br />

our times – even as budgets become more restricted. It is easy to take <strong>for</strong> granted that<br />

data in the public domain will be protected <strong>and</strong> remain both available <strong>and</strong> accessible.<br />

Researchers, publishers, policymakers <strong>and</strong> funders – among many others – have started<br />

to appreciate that a robust, sustainably funded infrastructure is absolutely necessary to<br />

protect the hard-earned fruits <strong>of</strong> publicly funded research.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) 1 , a project funded by the European Commission<br />

(FP7) with Grant Agreement number 261530, is gathering evidence to support <strong>and</strong><br />

promote data sharing, re-use <strong>and</strong> preservation. ODE partners are members <strong>of</strong> the<br />

<strong>Alliance</strong> <strong>for</strong> <strong>Permanent</strong> <strong>Access</strong> (APA) <strong>and</strong> represent stakeholders with significant<br />

influence within their communities. ODE is identifying, collating, interpreting <strong>and</strong><br />

delivering evidence <strong>for</strong> emerging best practice in sharing, re-using, safeguarding <strong>and</strong><br />

citing data. ODE is also documenting drivers <strong>of</strong> change, <strong>and</strong> barriers to progress in this<br />

important area.<br />

The transition from science to e-Science is happening: a data deluge emerges from<br />

publicly-funded research facilities; a massive investment <strong>of</strong> public funds into the<br />

potential answer to thegr<strong>and</strong> challenges <strong>of</strong> our times. This potential can only be realised<br />

by adding an interoperable data sharing, re-use <strong>and</strong> preservation layer to the emerging<br />

eco-system <strong>of</strong> e-Infrastructures. The importance <strong>of</strong> this layer, on top <strong>of</strong> emerging<br />

connectivity <strong>and</strong> computational layers, has not yet been addressed coherently at ERA or<br />

global level. All stakeholders in the scientific process must be involved in its design this<br />

layer: policy makers, funders, infrastructure operators, data centres, data providers <strong>and</strong><br />

users, libraries <strong>and</strong> publishers. They need evidence to base their decisions <strong>and</strong> shape the<br />

design <strong>of</strong> this layer.<br />

The ODE partners are:<br />

• European Organization <strong>for</strong> Nuclear Research (CERN)<br />

• <strong>Alliance</strong> <strong>for</strong> <strong>Permanent</strong> <strong>Access</strong> (APA)<br />

• CSC – IT Centre <strong>for</strong> Science<br />

• Helmholtz Association<br />

• Science <strong>and</strong> Technology Facilities Council (STFC)<br />

• The British Library<br />

• Deutsche Nationalbibliothek (DNB)<br />

• International Association <strong>of</strong> STM Publishers (STM)<br />

• Stichting LIBER Foundation)<br />

All <strong>of</strong> them are members <strong>of</strong> the <strong>Alliance</strong> <strong>for</strong> <strong>Permanent</strong>, which collectively represent all<br />

these stakeholder groups <strong>and</strong> have a significant sphere <strong>of</strong> influence within those<br />

communities. The project will identify, collate, interpret <strong>and</strong> deliver evidence <strong>of</strong><br />

1<br />

www.ode-project.eu<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 10


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

emerging best practices in sharing, re-using, preserving <strong>and</strong> citing data, the drivers <strong>for</strong><br />

these changes <strong>and</strong> barriers impeding progress, in <strong>for</strong>ms suited to each audience. ODE<br />

will:<br />

• Enable operators, funders, designers <strong>and</strong> users <strong>of</strong> national <strong>and</strong> pan-european e-<br />

Infrastructures to compare their vision <strong>and</strong> explore shared opportunities<br />

• Provide projections <strong>of</strong> potential data re-use within research <strong>and</strong> educational<br />

communities in <strong>and</strong> beyond the ERA, their needs <strong>and</strong> differences<br />

• Demonstrate <strong>and</strong> improve underst<strong>and</strong>ing <strong>of</strong> best practices in the design <strong>of</strong> e-<br />

Infrastructures leading to more coherent national policies<br />

• Document success stories in data sharing, visionary policies to enable data re-use,<br />

<strong>and</strong> the needs <strong>and</strong> opportunities <strong>for</strong> interoperability <strong>of</strong> data layers to fully enable<br />

e-Science<br />

• Make that in<strong>for</strong>mation available in readiness <strong>for</strong> FP8<br />

This report sought to coalesce current though <strong>and</strong> opinions from numerous <strong>and</strong> diverse<br />

sources to reveal opportunities <strong>for</strong> supporting a more connected <strong>and</strong> integrated scholarly<br />

record. Four perspectives were considered, those <strong>of</strong> the Researcher who generates or<br />

reuses primary data, Publishers who provide the mechanisms to communicate research<br />

activities <strong>and</strong> Libraries & <strong>Data</strong> enters who maintain <strong>and</strong> preserve the evidence that<br />

underpins scholarly communication <strong>and</strong> the published record.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 11


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

1. INTEGRATION OF DATA AND PUBLICATIONS: GENERAL<br />

1.1. Introduction <strong>and</strong> summary<br />

The web, the cloud <strong>and</strong> computational capabilities in general provide an ever growing<br />

infrastructure <strong>for</strong> scholarly communication that makes it much easier <strong>for</strong> researchers to<br />

share their research data with others. At the same time, <strong>and</strong> <strong>of</strong>ten driven by the same<br />

factors, nearly all scientific disciplines have a computational stream , generating ever<br />

more research data. We seem to be at the verge <strong>of</strong> a <strong>Data</strong> Deluge as a recent EU report<br />

also concluded. 2<br />

We know from previous research, carried out <strong>for</strong> the project PARSE.Insight 3 , that<br />

around 60 % <strong>of</strong> researchers would like to use the research data <strong>of</strong> others. A similar<br />

reluctance <strong>for</strong> sharing as has been apparent in the interviews <strong>of</strong> WP3 <strong>of</strong> this project ODE<br />

that built on PARSE.Insight, where over 40 % <strong>of</strong> researchers state to have real problems<br />

in sharing their own data. This is further elaborated in Chapter 2: Researcher’s<br />

Perspective. In this sense, we may coin a new 60-40 rule: 60 % likes to get data from<br />

others but 40% have problems to give their own.<br />

Andrew Treloar, Director <strong>of</strong> the Australian National <strong>Data</strong> Service gave a talk on 28<br />

March 2011 at a JISC workshop in Birmingham on the management <strong>of</strong> research data 4<br />

<strong>and</strong> distinguished several basic problems <strong>for</strong> research data. In the h<strong>and</strong>ling <strong>of</strong> research<br />

data, he described in a cascading way, how research data are <strong>of</strong>ten:<br />

1. Unavailable, <strong>and</strong> if at all available:<br />

2. Unfindable, <strong>and</strong> if available AND findable:<br />

3. Uninterpretable.<br />

And even if all these 3 obstacles have been overcome, the research data found may still<br />

prove to be:<br />

4. Not re-usable.<br />

In this report, created <strong>and</strong> delivered in the context <strong>of</strong> project ODE (Opportunities <strong>for</strong><br />

<strong>Data</strong> Exchange) we investigate how integration <strong>of</strong> data <strong>and</strong> publications can help solve<br />

these 4 obstacles. The questions we try to answer in this report are: How do research<br />

data enter the stage <strong>of</strong> scholarly communication <strong>and</strong> what are the present practices <strong>and</strong><br />

policies? Where can we find important improvements <strong>for</strong> the accessibility <strong>and</strong> reusability<br />

<strong>of</strong> research data? What roles <strong>and</strong> responsibilities may we expect <strong>for</strong> different<br />

stakeholders in the scholarly in<strong>for</strong>mation chain?<br />

2 John Wood, EU, 2010, Riding the Wave: http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/hlgsdi-report.pdf<br />

3 Survey Report PARSE.Insight : http://www.parse-insight.eu/downloads/PARSE-Insight_D3-<br />

4_SurveyReport_final_hq.pdf<br />

4 Andrew Treloar at JSC workshops see<br />

http://www.jisc.ac.uk/whatwedo/programmes/mrd/rdmevents/mrdinternationalworkshop.aspx<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 12


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

We have used 3 different perspectives to shed light on this:<br />

1. A researchers’ perspective, see Chapter 2<br />

2. A publishers’ perspective, see Chapter 3<br />

3. A libraries’ <strong>and</strong> data centres’ perspective, see Chapter 4<br />

The final chapter summarizes the outcomes <strong>of</strong> a workshop held with librarians in<br />

Barcelona on June 29, 2011 5 as part <strong>of</strong> the LIBER 2011 conference with a view to:<br />

4. Report Epilogue: Mapping the Road ahead?, see Chapter 5.<br />

In this general introduction, we summarize findings from previous research, such as<br />

PARSE.Insight <strong>and</strong> from newer desk research focused on the way research data are<br />

connected with publications. The hard data from the PARSE.Insight study were <strong>of</strong> high<br />

value <strong>for</strong> this project <strong>for</strong> 2 reasons: they span the responses from different stakeholder<br />

groups (researchers, publishers, librarians <strong>and</strong> data managers) <strong>and</strong> the survey was truly<br />

international <strong>and</strong> truly multidisciplinary. As the PARSE.Insight report points out, the<br />

representativity <strong>of</strong> the large spectrum <strong>of</strong> scientific disciplines was well covered, as well<br />

as the spread <strong>of</strong> respondents over different continents (with some dominance <strong>for</strong> US <strong>and</strong><br />

Europe). This makes it one <strong>of</strong> the few studies that overarches local or mono-disciplinary<br />

inventories <strong>for</strong> current practices on the sharing <strong>of</strong> data.<br />

The value that this study attempts to add on the previous PARSE.Insight report by reuse<br />

<strong>of</strong> the data gathered there, is by means <strong>of</strong> a re-analysis <strong>of</strong> the survey responses from<br />

the perspective <strong>of</strong> data-sharing. We also included much unused data from<br />

PARSE.Insight that proved very relevant <strong>for</strong> this study.<br />

1.2. Current Practice – some numbers relating research data with publications<br />

In PARSE.Insight, the 2008 survey included a question: Where do you currently store<br />

your research data ? From the responses is clear that approximately 18 % <strong>of</strong> researchers<br />

submit data with the manuscripts <strong>of</strong> their publications, this is a slightly higher figure<br />

(but not much) than the number who deposit in archives, see the following diagram from<br />

Parse.<br />

5 Barcelona workshop: http://www.libereurope.eu/news/liber-annual-conference-<strong>and</strong>-firsteuropean-project-workshops<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 13


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Where do you currently store your research data ? (multiple answers possible)<br />

Graph 2: Source: PARSE.Insight 2 survey, held among researchers internationally, N = 1202<br />

researchers<br />

But this number is now growing fast, as the following evidence shows from a recent<br />

study into Medical Journals 6 . In that study it was found that <strong>for</strong> a sample <strong>of</strong> 28 journals<br />

in medicine, sampled from 138 high-impact medical journals, the number <strong>of</strong> articles<br />

carrying supplementary data files roughly doubled every 2 years <strong>and</strong> is now over 25 %.<br />

Compared to 2003, when 32 % <strong>of</strong> the journals provided the possibility to add web-based<br />

supplements to an article, this percentage had grown to 50 % in 2005, <strong>and</strong> to 64 % in<br />

2009. The number <strong>of</strong> articles <strong>of</strong>fering online supplements, increased in that same period<br />

<strong>for</strong> these journals from 7% to 14% to the already mentioned 25%, respectively. There was<br />

also an increase in the percentage <strong>of</strong> journals <strong>for</strong> which at least 20% <strong>of</strong> articles have<br />

online-only supplements (4% in 2003 to 11% in 2005 to 32% in 2009).<br />

A marked increase in the number <strong>of</strong> video supplements was observed, while the largest<br />

share <strong>of</strong> the supplementary files is <strong>for</strong> data represented in supplementary graphs <strong>and</strong><br />

tables. The number <strong>of</strong> articles with supplementary tables doubled every 2 years (10 to 22<br />

to 55 to 100 from 2003 to 2009), as did the number <strong>of</strong> supplementary tables (29 to 57 to<br />

149 to 317, the last two numbers referring to 2007 <strong>and</strong> 2009, respectively).<br />

From the PARSE.Insight2 survey, we also know that > 90 % <strong>of</strong> journals accepts<br />

supplementary material . The graph here below presents the responses from different<br />

publishers (small versus large). If the number <strong>of</strong> journals is factored in, it appears that<br />

>90 % accepts supplementary files containing research data.<br />

6 See as source: http://www.annemergmed.com/article/S0196-0644(10)01648-3/fulltext#sec3<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 14


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Can authors submit their underlying digital research data with their publication to you?<br />

Graph 3, source PARSE.Insight 2 , N= 134 publishers<br />

1.3. Why the relationship between data <strong>and</strong> publications is so important<br />

Whereas the data presented in the study quoted in the previous paragraph5 show the<br />

substantial amount <strong>of</strong> supplementary data added to journal articles, as well as its high<br />

growth rates, there is every reason to expect even more. The survey <strong>of</strong> PARSE.Insight 2<br />

enquired where researchers would like to submit their research data <strong>and</strong> the responses<br />

gave the following picture (graph 4):<br />

Next to the most popular category <strong>of</strong> <strong>Data</strong> Archives (<strong>of</strong> their organisation: 81 %, in their<br />

subject field: 60 %), more than half <strong>of</strong> the researchers (51 %) would like to submit their<br />

data to publishers. Of particular interest here is the marked difference between the most<br />

popular categories in terms <strong>of</strong> desired destinations, Graph 4, when compared to the<br />

actual destinations, Graph 2. Archives <strong>and</strong> publishers are the most favored, but<br />

remarkably the least used nowadays: in current practice, institutional archives score<br />

below 20 % , subject archives below 10%, publishers score a little closer to but still below<br />

20 %.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 15


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Where would you be willing to submit your research data? (multiple answers)<br />

:<br />

Graph 4, Source: PARSE.Insight 2 : > 50 % <strong>of</strong> researchers would like to submit their data to<br />

journals, N= 1202 researchers<br />

The reason why researchers would like to submit their research data together with an<br />

article, probably relates to the way they find datasets. Again from PARSE.Insight2 we<br />

know that 63 % <strong>of</strong> researchers like to go to the <strong>for</strong>mal literature to find <strong>and</strong> discover the<br />

existence <strong>of</strong> data (see also Graph 5). This option is ranked second, immediately after<br />

colleagues as a source (73 %) <strong>and</strong> equal to the use <strong>of</strong> general search engines (63 %):<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 16


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Where do you locate <strong>and</strong> access digital research data ? (multiple answers)<br />

Graph 5, Source: PARSE.Insight 2 , N=1202 researchers<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 17


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

1.4. The <strong>Data</strong> Publication Pyramid<br />

Micros<strong>of</strong>t researcher Jim Gray is an <strong>of</strong>ten quoted source on the way literature <strong>and</strong> data<br />

would become more interrelated. He <strong>for</strong>esaw ways (graph 5) in which the e-environment<br />

would make it so much easier to move from the literature to data <strong>and</strong> back:<br />

The Jim Gray Pyramid on e-science<br />

Literature<br />

Derived <strong>and</strong> Recombined<br />

<strong>Data</strong><br />

Raw <strong>Data</strong><br />

Graph 6, Source Jim Gray on e-science 7 : <strong>Publications</strong> are only the tip <strong>of</strong> the iceberg<br />

For the purpose <strong>of</strong> this report, we endeavor to adapt the Jim Gray Pyramid slightly <strong>and</strong><br />

introduce our own so-called <strong>Data</strong> Publication Pyramid. Four years after Jim Gray<br />

expressed his thoughts, we see a new wave <strong>of</strong> practices emerge where the literature is in<br />

fact integrating literature <strong>and</strong> data or at least making its best attempts. This <strong>Data</strong><br />

Publication Pyramid (graph 1) aims to show the different manifestations that data can<br />

undergo when published within or in the context <strong>of</strong> publications, or even when not<br />

published at all (but remaining in drawers <strong>and</strong> on disks <strong>of</strong> the institute):<br />

7 http://research.micros<strong>of</strong>t.com/enus/collaboration/fourthparadigm/4th_paradigm_book_jim_gray_transcript.pdf<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 18


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Graph 1 repeated, The <strong>Data</strong> Publication Pyramid, based on the Jim Gray Pyramid.<br />

We wish to emphasize at this point that this pyramid does not describe which stages <strong>of</strong><br />

manifestation research data can go through in their evolution towards reusable data<br />

products. The main purpose <strong>of</strong> this pyramid is to explain the different manifestations<br />

research data can have in the context <strong>of</strong> their availability within, with, supplementary to<br />

or referenced from an <strong>of</strong>ficial scholarly article as the main manifestation <strong>of</strong> the record <strong>of</strong><br />

science.<br />

In Chapter 3 (<strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong>), in Chapter 4 (<strong>Data</strong> Centre <strong>and</strong><br />

Library perspective) <strong>and</strong> Chapter 5 (Next Steps, the Road ahead) this <strong>Data</strong> Publication<br />

Pyramid is explained further <strong>and</strong> used to distinguish different best practices as they<br />

currently occur.<br />

In short, we can summarize the benefits <strong>of</strong> good integration <strong>of</strong> articles <strong>and</strong> research data<br />

as follows, along the key issues raised by Andrew Treloar3 <strong>and</strong> tabulate obstacles <strong>and</strong><br />

the solutions provided by integrating <strong>Data</strong> <strong>and</strong> Publication (Table 4)<br />

• publications help the data to be better discoverable<br />

• publications help the data to be better interpretable<br />

• publications provide the author better credits <strong>for</strong> the data<br />

• <strong>and</strong> reversely: the data add depth to the article <strong>and</strong> facilitate better<br />

underst<strong>and</strong>ing.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 19


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Obstacle<br />

<strong>Data</strong> are unavailable<br />

<strong>Data</strong> are unfindable<br />

<strong>Data</strong> are uninterpretable<br />

<strong>Data</strong> are not re-usable<br />

<strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> Publication helps to<br />

Indicate availability <strong>of</strong> data<br />

Links from the publication help locate datasets<br />

<strong>Publications</strong> will describe <strong>and</strong> explain datasets, links<br />

from data to publications help interpretability <strong>of</strong><br />

datasets<br />

Description in article can improve usability or the<br />

article can provide access to a re-usable set, perhaps<br />

even <strong>of</strong>fer an API to datasets ?<br />

Table 4: <strong>Data</strong> Obstacles<br />

As will be explained in the following chapters, there are 3 additional necessary<br />

conditions that need to be fulfilled <strong>for</strong> good accessibility <strong>and</strong> re-usability <strong>of</strong> research<br />

data. They are:<br />

1. Citability<br />

2. Curation<br />

3. Preservation<br />

These elements are also addressed in-depth in the following chapters <strong>and</strong> used as<br />

criteria <strong>for</strong> drivers <strong>and</strong> potential opportunities <strong>for</strong> different players in the scholarly<br />

communication l<strong>and</strong>scape. Included in the following chapters are case studies <strong>for</strong><br />

laudable initiatives, reported unsolved issues, <strong>and</strong> opinions <strong>and</strong> desires as expressed to<br />

the project team in email exchanges <strong>and</strong> interviews from the perspective <strong>of</strong> researchers,<br />

publishers <strong>and</strong> libraries/data centres.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 20


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

2. RESEARCHER PERSPECTIVE OF DATA/PUBLISHING INTEGRATION:<br />

2.1. Researchers are the source <strong>of</strong> data<br />

Researchers are able to generate more data than ever be<strong>for</strong>e. This has partially been<br />

driven by technological advances that increase the accuracy, sensitivity <strong>and</strong> multiplicity<br />

<strong>of</strong> empirical data collection across disciplines8. But equally, researchers have been able<br />

to track, aggregate, abstract, trans<strong>for</strong>m <strong>and</strong> generally re-purpose existing data to drive<br />

<strong>for</strong>ward data driven research. Facilities <strong>and</strong> infrastructures that increase our<br />

underst<strong>and</strong>ing <strong>of</strong> sub-atomic particles or extend our reach into the universe have lead to<br />

way in generating data; they have been quickly followed by social, environmental <strong>and</strong><br />

biomedical disciplines that capture complex <strong>and</strong> uncertain data <strong>for</strong> modelling biological<br />

processes, social dynamics <strong>and</strong> environmental <strong>for</strong>ecasting. <strong>Data</strong> collected in vivo or in<br />

situ <strong>and</strong> modelled in silico reveal promising new ways in underst<strong>and</strong>ing <strong>and</strong> intervening<br />

in all manner <strong>of</strong> human behaviour <strong>for</strong> benefit9’10. All these data are a fundamental<br />

component <strong>of</strong> scholarly communication <strong>and</strong> are the evidence that underpin scholarly<br />

publication. Their value extends beyond original use <strong>and</strong> many represent substantial,<br />

non-reproducible <strong>and</strong> valuable intellectual assets to many stakeholders1112. Our ability<br />

to manage <strong>and</strong> maintain such digital data have not always kept pace with our ability to<br />

generate it <strong>and</strong> presents a risk to modern research; we are in danger <strong>of</strong> losing the ability<br />

to link the evidence base that support scholarly publication <strong>and</strong> as a consequence break<br />

the cycle <strong>of</strong> scholarly communication.<br />

From a researchers perspective the value <strong>of</strong> data is that <strong>of</strong> a first class research object<br />

which represents the basis <strong>of</strong> their research. Researchers discover <strong>and</strong> use data from<br />

others <strong>and</strong> analyse them to <strong>for</strong>mulate new <strong>and</strong> testable hypothesis be<strong>for</strong>e extending the<br />

evidence base with empirical data. The implications <strong>of</strong> first class research objects are<br />

that they require preservation, recognition, validation, curation <strong>and</strong> dissemination; in<br />

doing so they become more available, discoverable, interpretable, re-usable <strong>and</strong> citable.<br />

This section <strong>of</strong> our report will review contemporary evidence <strong>of</strong> current practice from a<br />

researcher’s perspective <strong>and</strong> investigate any need <strong>for</strong> change. From these needs we will<br />

8<br />

Hanson B, Sugden A, Alberts B (2011). Making data maximally available. Science.<br />

331(6018):649<br />

9 Editorial: Crowdsourcing human mutations. Nature Genetics 2011, 43(4):279<br />

10 Giardine B, Borg J, Higgs DR, Peterson KR, Philipsen S, Maglott D, Singleton BK, Anstee DJ,<br />

Basak AN, Clark B, Costa FC, Faustino P, Fedosyuk H, Felice AE, Francina A, Galanello R,<br />

Gallivan MV, Georgitsi M, Gibbons RJ, Giordano PC, Harteveld CL, Hoyer JD, Jarvis M, Joly P,<br />

Kanavakis E, Kollia P, Menzel S, Miller W, Moradkhani K, Old J, Papachatzopoulou A,<br />

Papadakis MN, Papadopoulos P, Pavlovic S, et al.(2011). Systematic documentation <strong>and</strong> analysis<br />

<strong>of</strong> human genetic variation in hemoglobinopathies using the microattribution approach.Nature<br />

Genetics 43(4):295-301<br />

11 Mons B, van Haagen H, Chichester C, Hoen PB, den Dunnen JT, van Ommen G, van Mulligen<br />

E, Singh B, Ho<strong>of</strong>t R, Roos M, Hammond J, Kiesel B, Giardine B, Velterop J, Groth P, Schultes E,<br />

(2011). The value <strong>of</strong> data. Nature Genetics 43(4):281-3<br />

12 Curry A (2011). Rescue <strong>of</strong> old data <strong>of</strong>fers lesson <strong>for</strong> particle physicists. Science.<br />

331(6018):694-5<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 21


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

seek opportunities in the <strong>for</strong>m <strong>of</strong> enablers, drivers <strong>and</strong> incentives to support such<br />

change.<br />

2.2. What is the current practice<br />

<strong>Data</strong> has always been part <strong>of</strong> scholarly communication though increasingly scholarly<br />

publication has been unable to maintain the join to this evidence base 13 . Integrating<br />

data <strong>and</strong> publications implies an outcome <strong>of</strong> data sharing. Thus a researcher perspective<br />

is primarily concerned with data sharing; the integration <strong>of</strong> scholarly communication is<br />

essentially an enabler <strong>of</strong> this. In order to underst<strong>and</strong> why data <strong>and</strong> publications risk<br />

moving apart it is essential to underst<strong>and</strong> what causes such divergence <strong>and</strong> whether<br />

these are barriers we can influence.<br />

Illustrated in Graph 4 in the Introduction Chapter, the PARSE.Insight 14 survey provided<br />

compelling <strong>and</strong> insightful evidence regarding the perspective <strong>of</strong> researchers on current<br />

practice in data management <strong>and</strong> sharing. With support from additional <strong>and</strong><br />

contemporary reports ,15 , <strong>and</strong> publications a strong picture <strong>of</strong> researcher attitudes to data<br />

management <strong>and</strong> requirements begins to emerge.<br />

Researchers find data in the same way as other in<strong>for</strong>mation (via literature,<br />

colleagues) <strong>and</strong> IT has added to these rather than supplanted any one common<br />

practice activity.<br />

The methods used to disseminate <strong>and</strong> validate research seems tied to common<br />

research practices where colleagues <strong>and</strong> <strong>for</strong>mal literature predominate. Search<br />

engines <strong>and</strong> institutional databases suggest that technology has made this easier<br />

than their analogue counterparts <strong>of</strong> catalogues <strong>and</strong> libraries. This is in<br />

agreement with the report <strong>of</strong> the UK’s Research In<strong>for</strong>mation Network (RIN) in<br />

which technology was used to increase efficiency in behaviour, in the life sciences,<br />

rather than supplant it 16 . <strong>Data</strong> archives <strong>of</strong>fer a significant source <strong>of</strong> discovery but<br />

it is difficult to determine if this was in addition to, or independent <strong>of</strong> the more<br />

common discovery through literature <strong>and</strong> colleagues.<br />

13<br />

Aalbersberg, IJ, Kähler, O. (2011)Supporting Science through the Interoperability <strong>of</strong> <strong>Data</strong> <strong>and</strong><br />

Articles , D-Lib. Volume 17, Number ½ doi:10.1045/january2011-aalbersberg<br />

14<br />

http://www.parse-insight.eu/index.php<br />

15<br />

If you build it they will come. How Researchers perceive <strong>and</strong> use web2.0. A Reseatrch<br />

In<strong>for</strong>mation Report, July 2010.<br />

16<br />

Patterns <strong>of</strong> in<strong>for</strong>mation use <strong>and</strong> exchange. Case studies <strong>of</strong> researchers in the life sciences.<br />

Research In<strong>for</strong>mation Network, November 2009.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 22


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Researchers want control over the extent <strong>and</strong> manner <strong>of</strong> sharing their data,<br />

consistent with the concept <strong>of</strong> data as a pr<strong>of</strong>essional asset.<br />

This particular finding <strong>of</strong> data largely being shared with colleagues but only <strong>for</strong><br />

25 % <strong>of</strong> the respondents sharing it openly with everyone <strong>and</strong> more than 20 % not<br />

sharing any, re-en<strong>for</strong>ces the notion that researchers want to control their data<br />

How openly available is your data?<br />

My data is openly available <strong>for</strong> my research group /<br />

colleagues in research collaboration.<br />

58%<br />

My data is openly available <strong>for</strong> everyone.<br />

25%<br />

<strong>Access</strong> to my data is temporarily restricted.<br />

I do not share my data, but I would like to do so in the future.<br />

My data could be made available with appropriate changes<br />

(e.g. anonymous clinical data)<br />

My data is openly available <strong>for</strong> my research discipline.<br />

16%<br />

16%<br />

11%<br />

11%<br />

I do not share my data <strong>and</strong> I do not want to share it in the<br />

future.<br />

My data is available <strong>for</strong> a fee.<br />

4%<br />

6%<br />

<strong>and</strong> reject the concept <strong>of</strong> un-controlled open data sharing.<br />

0% 10% 20% 30% 40% 50% 60% 70%<br />

Graph 7: Which <strong>of</strong> the following applies to the digital research data <strong>of</strong> your current research<br />

N=1270 (PARSE.Insight 14 )<br />

They are open with their nearest colleagues <strong>and</strong> when their pr<strong>of</strong>essional practice<br />

m<strong>and</strong>ates it (e.g. publisher policy or good practice), though the extent <strong>of</strong> data sharing<br />

decreases rapidly as their potential control over the data decreases, i.e. they wish to<br />

choose when <strong>and</strong> with whom they share their data (Graph 7).<br />

However the researchers’ practice becomes more complex when questioned about their<br />

specific data needs, i.e. not what they do with their own data but rather what they<br />

require from others.<br />

According to the responses to the PARSE.Insight survey, 63% <strong>of</strong> the researchers would<br />

like to make use <strong>of</strong> data gathered by other researchers in their discipline (N=430) while<br />

70% <strong>of</strong> respondents already do (N=638). Interestingly, when asked about data gathered<br />

by researchers in other disciplines, still 40% would like to make use <strong>of</strong> it (N=689 ), <strong>and</strong><br />

46% state that they do so already (N=1264).<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 23


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Some researchers have identified a requirement in skills training or resources to assist<br />

in the safe <strong>and</strong> appropriate maintenance processes <strong>of</strong> data management 17 <strong>and</strong> this is<br />

likely to be reflected across other disciplines. Several organisations are producing guides<br />

<strong>and</strong> tools to assist with this, but it is unclear if they are making as big an impact as they<br />

perceive. A key to this was the extent <strong>and</strong> use <strong>of</strong> data st<strong>and</strong>ards together with their<br />

effective uptake pr<strong>of</strong>ile. A great majority <strong>of</strong> respondents claimed to have used no<br />

st<strong>and</strong>ards when <strong>of</strong>fered a selection <strong>of</strong> those in current use (Graph 8).<br />

Knowledge <strong>of</strong> st<strong>and</strong>ards <strong>and</strong> guidelines<br />

Other<br />

3%<br />

OAIS<br />

6%<br />

Dublin Core<br />

7%<br />

PREMIS<br />

3%<br />

OAI-PMH<br />

3%<br />

OAI-ORE<br />

3%<br />

METS<br />

3%<br />

MPEG21-DIDL<br />

3%<br />

None<br />

67% NISO<br />

2%<br />

Graph 8: Which <strong>of</strong> the following st<strong>and</strong>ards or guidelines that are used in digital preservation are<br />

you familiar with? n=1202(PARSE.Insight 14 )<br />

It is well known that the development <strong>of</strong> community st<strong>and</strong>ards is a slow <strong>and</strong> dem<strong>and</strong>ing<br />

process. Agreement is based on function <strong>and</strong> <strong>of</strong>ten there are many opinions on<br />

pragmatic utility than can be accommodated in an easily implemented st<strong>and</strong>ard. There<br />

may be a case <strong>for</strong> supporting st<strong>and</strong>ards development as a community activity more<br />

actively, though how this could be achieved needs to be defined.<br />

17<br />

Reichman OJ, Jones MB, Schildhauer MP (2011). Challenges <strong>and</strong> opportunities <strong>of</strong> open data in<br />

ecology. Science. 331(6018):703-5<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 24


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Researchers perceive legal <strong>and</strong> pr<strong>of</strong>essional reasons <strong>for</strong> not sharing their data<br />

Barriers <strong>for</strong> sharing research data<br />

Legal issues<br />

Misuse <strong>of</strong> data<br />

41%<br />

41%<br />

Incompatible data types<br />

33%<br />

Lack <strong>of</strong> technical infrastrcuture<br />

Lack <strong>of</strong> financial resources<br />

fear to Lose scientific edge<br />

28%<br />

27%<br />

27%<br />

Restricted access to data archive<br />

21%<br />

No problems <strong>for</strong>eseen<br />

16%<br />

Other<br />

10%<br />

0% 5% 10% 15% 20% 25% 30% 35% 40% 45%<br />

Graph 9: source PARSE.Insight ; Do you experience or <strong>for</strong>esee any <strong>of</strong> the following problems in<br />

sharing you data? N=1270(PARSE.Insight 14 )<br />

Barriers to sharing data, whether real or perceived, were mixed between reasons <strong>for</strong> not<br />

sharing <strong>and</strong> reasons <strong>for</strong> not using certain data types or particular data sources. For<br />

example, Graph 9 suggests many respondents suggested there were legal barriers to<br />

their sharing, though it was unclear if these were fear <strong>of</strong> prosecution or responsibility <strong>for</strong><br />

IPR particularly in biomedical sciences that involve human subjects or commercial<br />

potential 18 . Licensing research data is a recognised as a complex <strong>and</strong> time consuming<br />

activity 19 <strong>and</strong> there is a need to simplify <strong>and</strong> streamline the process by which<br />

researchers or those in control <strong>of</strong> research data assert control over their data assets, an<br />

opinion supported by the recent Hargreaves report in the UK 20 . The next most frequent<br />

barrier was a fear <strong>of</strong> misuse, which may include a validation threat to analyses that<br />

contradict the original findings, discovery <strong>of</strong> additional findings or exposure <strong>of</strong> the data<br />

creator to legal redress, thus strongly associated with the most common response <strong>of</strong> legal<br />

issues. For example repurposing data that leads to breaching data protection legislation<br />

could hold the data publisher or those responsible <strong>for</strong> the data vulnerable to criminal<br />

proceedings. Finally, incompatibilities between data <strong>and</strong> lack <strong>of</strong> a financial <strong>and</strong><br />

technical infrastructure were cited as strong barriers to sharing, long recognised as a<br />

18<br />

Mathews DJ, Graff GD, Saha K, Winick<strong>of</strong>f DE (2011) <strong>Access</strong> to stem cells <strong>and</strong> data: persons,<br />

property rights, <strong>and</strong> scientific progress. Science. 331(6018):725-7<br />

19<br />

Alex Ball (DCC) 2011. How To Licence Research <strong>Data</strong>. A Digital Curation Centre <strong>and</strong> JISC<br />

Legal ‘working level’ guide.<br />

20<br />

Digital Opportunity A Review <strong>of</strong> Intellectual Property <strong>and</strong> Growth, An Independent Report by<br />

Pr<strong>of</strong>essor Ian Hargreaves May 2011<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 25


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

result <strong>of</strong> in<strong>for</strong>mation technology <strong>and</strong> computational ability moving faster that our<br />

capability <strong>for</strong> data management <strong>and</strong> planning.<br />

Taken together these results indicate researchers are willing to use others data<br />

providing they can validate them, but are wary about openly sharing their own. This<br />

need not be seen as a closed relationship. It could well indicate deeper levels <strong>of</strong> sharing<br />

enablers than simply altruistic motivation, e.g. attribution, provenance <strong>and</strong> reliability.<br />

Researchers want credit<br />

Even if data can be shared or published there was, as expected, almost universal<br />

recognition that a ‘credit to the data creator’ facility must exist (Graph 10). Good<br />

research practice requires recognition <strong>for</strong> intellectual contributions <strong>and</strong> these should<br />

include data. In the same way citation <strong>of</strong> traditional publications play <strong>for</strong> recognising<br />

individual intellectual work, a similar convention is required <strong>for</strong> data though no agreed<br />

convention exists 21,22 .<br />

Do you want to be credited when your underlying digital research data is used by others?<br />

(PARSE.Insight)<br />

No<br />

8%<br />

Yes<br />

92%<br />

Graph 10: Source PARSE.Insight 14 , N=1171<br />

However, when it comes to clear guidelines about how data can be made citable <strong>and</strong> how<br />

this can be integrated with traditional publications, there are few examples <strong>of</strong> common<br />

st<strong>and</strong>ards or shared good practice. This is supported by the findings in PARSE.Insight,<br />

where the vast majority <strong>of</strong> researchers underst<strong>and</strong> the benefits <strong>of</strong> a joined scholarly<br />

communication but were are unaware <strong>of</strong> publisher policies <strong>and</strong> these policies (see<br />

Graphs 11 & 12 <strong>and</strong> Chapter 3).<br />

21<br />

Starr,J. Gastl, A (2011) isCitedBy: A Metadata Scheme <strong>for</strong> <strong>Data</strong>Cite. D-Lib. Volume 17,<br />

Number ½ doi:10.1045/january2011-starr<br />

22<br />

Brase, J <strong>and</strong> Farquhar, A (2011). <strong>Access</strong> to Research <strong>Data</strong> . D-Lib. Volume 17, Number ½<br />

doi:10.1045/january2011-brase<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 26


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Do journals to which you typically submit t your work require you to include relevant<br />

digital research data (i.e. data used to create tables, figures, etc.)?<br />

n=129(PARSE.Insight 14 )<br />

Yes<br />

19%<br />

No<br />

81%<br />

Graph 11: Source PARSE.Insight 14 , N=1171<br />

Do you think it is useful to link underlying research data with <strong>for</strong>mal literature?n-<br />

2289(PARSE.Insight14<br />

14)<br />

Link underlying research to <strong>for</strong>mal literature.<br />

No<br />

15%<br />

Yes<br />

85%<br />

Graph 12: Source PARSE.Insight 14 , N=1171<br />

Many researchers see the problems with data getting worse<br />

The problem <strong>of</strong> data volume is illustrated by a steady increase in expectancy from<br />

research together with an increasing ‘don’t know’ cohort in the PARSE.Insight<br />

investigations <strong>of</strong> data volume expectations. The figures <strong>for</strong> data volumes below 1Gb-1Tb<br />

exhibit decreasing expectancy while those about 1Gb-1Tb are increasing expectancy<br />

(Graph 13).<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 27


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

45%<br />

40%<br />

35%<br />

30%<br />

25%<br />

20%<br />

15%<br />

10%<br />

5%<br />

0%<br />

Estimated amount <strong>of</strong> data stored per research project<br />

40% 41%<br />

36%<br />

25%<br />

17%<br />

19%<br />

20%<br />

13%<br />

13%<br />

8%<br />

5%<br />

6%<br />

3%<br />

5%<br />

1% 1% 2%<br />

1%<br />

2%<br />

0% 0%<br />

17%<br />

14%<br />

11%<br />

0MB 1-100MB 100MB-1GB 1GB-1TB 1TB-1PB 1PB-10PB >10PB Don't Know<br />

Current In 2 years In 5 Years<br />

Graph 13: Source PARSE.Insight 14 , N=1202<br />

Together with data volume expectancy there are well understood data preservation risks<br />

that include lack <strong>of</strong> infrastructure <strong>and</strong> custodian roles, indicated by data preservation<br />

issues being perceived as either important or very important by the majority <strong>of</strong><br />

respondents (Graph14).<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 28


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Threats to digital preservation<br />

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%<br />

Lack <strong>of</strong> sustainable hardware, s<strong>of</strong>tware or support <strong>of</strong> computer<br />

environment may make the in<strong>for</strong>mation inaccessible<br />

41<br />

39<br />

13<br />

5 1<br />

The current custodian <strong>of</strong> the data, whether an organisation or<br />

project, may cease to exist at some point in the future<br />

36<br />

42<br />

15<br />

5<br />

3<br />

Users may be unable to underst<strong>and</strong> or use the data e.g. the<br />

semantics, <strong>for</strong>mat or algorithms involved<br />

34<br />

42<br />

16<br />

6<br />

3<br />

Evidence may be lost because the origin <strong>and</strong> authenticity <strong>of</strong> the data<br />

may be uncertain<br />

33<br />

44<br />

17<br />

4 2<br />

Loss <strong>of</strong> ability to identify the location <strong>of</strong> data<br />

25<br />

44<br />

23<br />

5<br />

4<br />

The ones we trust to look after the digital holdings may let us down<br />

20<br />

37<br />

26<br />

10<br />

7<br />

<strong>Access</strong> <strong>and</strong> use restrictions (e.g. Digital Rights Management) may not<br />

be respected in the future<br />

19<br />

37<br />

25<br />

13<br />

6<br />

Very important Important Slightly Important Not important Don't Know<br />

Researcher perceived threats to digital preservation(n=1201-1210) 1210) (PARSE.Insight 14<br />

Graph 14: Source PARSE.Insight 14 , N=1201-1210<br />

14 )<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 29


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Researchers s believe others should pay<br />

There are strong views, possibly rein<strong>for</strong>ced by an appreciation <strong>of</strong> preservation <strong>and</strong><br />

hardware considerations, that once the data have been generated <strong>and</strong> used, their<br />

preservation <strong>and</strong> archiving responsibilities should rest with other organisational<br />

structures, with the exception <strong>of</strong> a non-specific ‘research community’ being identified.<br />

These were firm views as few respondents either didn’t know or selected ‘other’ (Graph<br />

15 &16).<br />

Who should pay <strong>for</strong> preservation <strong>of</strong> publications?<br />

National government<br />

57%<br />

Commercial organisation<br />

42%<br />

Research community<br />

European Union<br />

My organisation<br />

34%<br />

32%<br />

35%<br />

Don't know<br />

Other<br />

5%<br />

6%<br />

0% 10% 20% 30% 40% 50% 60%<br />

Graph 15: Source PARSE.Insight 14 , N=1188: Who, in your opinion, should pay <strong>for</strong> the preservation<br />

<strong>of</strong> publications?, multiple answers possible)<br />

Who should pay <strong>for</strong> preservation <strong>of</strong> digital research data?<br />

National government<br />

61%<br />

My organisation<br />

41%<br />

European Union<br />

36%<br />

Research community<br />

28%<br />

Don't know<br />

15%<br />

Commercial organisation<br />

10%<br />

Other<br />

4%<br />

0% 10% 20% 30% 40% 50% 60% 70%<br />

Graph 16: Source PARSE.Insight 14 , N=1188: Who, in your opinion, should pay <strong>for</strong> preservation <strong>of</strong><br />

digital research data? (multiple answers possible)<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 30


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

2.3. Conclusions on current practice<br />

Taken together there appears to be a complex relationship between the researcher <strong>and</strong><br />

the data they collect <strong>and</strong> create. Researchers perceive <strong>and</strong> en<strong>for</strong>ce their creator right<br />

over the data, choose when <strong>and</strong> with whom they share it <strong>and</strong> wish to maintain this<br />

control. This need <strong>for</strong> control appears based on perceived legal barriers <strong>and</strong> risk <strong>of</strong><br />

misuse, or absence <strong>of</strong> a trust network common in other <strong>for</strong>ms <strong>of</strong> scholarly<br />

communication; it may be a mixture <strong>of</strong> both. Researchers want somewhere safe to put<br />

their data while maintaining control in order to avoid legal redress <strong>and</strong> claims <strong>of</strong> misuse,<br />

but expect some central organisational structure to pay <strong>for</strong> these infrastructures. In a<br />

study <strong>of</strong> data sharing in the biomedical in<strong>for</strong>matics domain, a training review indicated<br />

that many researchers recognise they lack sufficient skills to manage their data<br />

appropriately, but importantly are enthusiastic to change this situation23. Researchers<br />

would benefit in joining the publication with the data in a more <strong>for</strong>mal <strong>and</strong> agreed<br />

convention, <strong>and</strong> recognition <strong>and</strong> credit mechanism <strong>for</strong> this can help as important drivers<br />

<strong>and</strong> incentives. They accept joining data to publication as good pr<strong>of</strong>essional practice (see<br />

graph above) <strong>and</strong> agree that data supporting traditional publication should be available<br />

with the publication. Technology can reduce the latency to joining data to publications,<br />

but the policies <strong>of</strong> the publishers requiring the availability <strong>of</strong> data supporting<br />

publications are so far very much in a pioneering stage (see chapter 3).<br />

2.4. Is there a need or case <strong>for</strong> change?<br />

<strong>Data</strong> <strong>and</strong> publications belong together24 <strong>and</strong> researchers are the link between these two<br />

established intellectual objects. The evidence that supports scholarly discourse cannot<br />

be lost without severe consequences <strong>for</strong> scholarly communication. Distilled into<br />

statements, our desk research has revealed five abstract researcher requirements <strong>for</strong><br />

integrating data <strong>and</strong> publication.<br />

1. Researchers need somewhere to put data <strong>and</strong> make it safe <strong>for</strong> reuse<br />

2. Researchers need to control its sharing <strong>and</strong> access<br />

3. Researchers need the ability to integrate data <strong>and</strong> publication<br />

4. Researchers need to get credit <strong>for</strong> data as a first class research object<br />

5. Researchers need someone to pay <strong>for</strong> the costs <strong>of</strong> data availability <strong>for</strong> re-use<br />

1. Where to put data <strong>and</strong> make it safe <strong>for</strong> reuse<br />

Research data centres exist but they are fragmented <strong>and</strong> operate in different ways.<br />

Generally data centres are community or discipline focussed <strong>and</strong> where significant<br />

investment is available more large scale operations are established (as is evident in<br />

particle physics <strong>and</strong> astronomical disciplines). Such mature data archives are capable <strong>of</strong><br />

taking full responsibility <strong>for</strong> the data they hold with clear preservation <strong>and</strong> access<br />

23 http://www.cancerin<strong>for</strong>matics.org.uk/training.html<br />

24 Smit, E (2011), Abelard <strong>and</strong> Héloise: Why <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Belong Together D-Lib.<br />

Volume 17, Number ½ doi:10.1045/january2011-smit<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 31


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

policies, with some making use <strong>of</strong> community developed quality tools like the <strong>Data</strong> Seal<br />

<strong>of</strong> Approval 25 . These organisations can either be concerned with ‘big data’ in the case <strong>of</strong><br />

particle physics <strong>and</strong> astronomical data centres, or complex social data, as in the case <strong>of</strong><br />

UK <strong>Data</strong> Archive. In contrast there are numerous ad hoc collections <strong>of</strong> community<br />

activities borne out <strong>of</strong> immediate need <strong>and</strong> uncertain future. These so-called ‘long tail’<br />

data centres in numerous <strong>and</strong> low level data generating disciplines, e.g. ecology,<br />

evolution etc have the potential to produce more volume <strong>and</strong> more complex data <strong>for</strong><br />

which ‘big data’ solutions will not be appropriate. In turn these ‘long tail’ data centres<br />

will likely require much more resource to both establish <strong>and</strong> maintain.<br />

Sustainability models <strong>for</strong> various data needs across disciplines would assist in<br />

determining how much resource is required <strong>for</strong> what type <strong>of</strong> data <strong>and</strong> <strong>for</strong> how long. A<br />

number <strong>of</strong> pilot <strong>and</strong> low scale projects have attempted to establish this alongside the<br />

well known <strong>and</strong> established initiatives <strong>for</strong> ‘big’ data 26<br />

In summary, the research data l<strong>and</strong>scape is both large <strong>and</strong> complex. PARSE.Insight<br />

provides evidence that these examples are not enough: researchers are either unaware<br />

(54 %) or know there is no facility (37 %) to put <strong>and</strong> maintain their data.<br />

Is there a preservation facility <strong>for</strong> preserving digital research data<br />

which can be used by all projects<br />

Yes<br />

within your discipline?<br />

9%<br />

No<br />

37%<br />

Don't Know<br />

54%<br />

Graph 17: Source PARSE.Insight 14 , N=1198 Is there a preservation facility <strong>for</strong> preserving digital<br />

research data which can be used by all projects within your discipline?)<br />

25 Klump, J.(2011). Criteria <strong>for</strong> the Trustworthiness <strong>of</strong> <strong>Data</strong> Centres D-Lib. Volume 17, Number<br />

½ doi:10.1045/january2011-klump<br />

26 www.datadryad.org<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 32


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

2. How to control data(?) sharing <strong>and</strong> access<br />

Disciplines vary widely; the RIN report on patterns <strong>of</strong> in<strong>for</strong>mation use <strong>and</strong> exchange<br />

provided compelling evidence from in depth interviews on the differences between<br />

sharing practices across life science disciplines 16 . There appeared many levels <strong>of</strong> control<br />

that were either encountered (while looking <strong>for</strong> data) or imposed (when the custodian <strong>of</strong><br />

data). What this suggests is the perception <strong>of</strong> <strong>and</strong> need <strong>for</strong> data sharing are confounded<br />

by confusion over the risks <strong>of</strong> data sharing. With the PARSE.Insight data suggesting a<br />

fear <strong>of</strong> legal redress <strong>and</strong> misuse <strong>of</strong> data as the main concerns it would seem that<br />

ownership <strong>and</strong> responsibility <strong>for</strong> data have become the default stance over authority;<br />

without a clear accreditation or registration framework <strong>for</strong> data, like citation, or a open<br />

sharing environment with accountability, reluctance to share wins out over pr<strong>of</strong>essional<br />

transparency. Researchers have ownership <strong>of</strong> data as a consequence <strong>of</strong> generating it,<br />

but transferring licence or ownership to use or share data is a confusing barrier <strong>and</strong> if<br />

there are any likelihoods <strong>of</strong> exposure to prosecution <strong>for</strong> either data misuse or unethical<br />

data collection then it is easier to simply close access to the data <strong>and</strong> severely limit any<br />

sharing. No clear authority or accountability role is available <strong>for</strong> research data outside<br />

its ownership.<br />

So as data enters scholarly communication the issue arises; who will be responsible <strong>for</strong><br />

it, securing the authority to persist, preserve, share it <strong>and</strong> take legal responsibility.<br />

Some co-ordination <strong>and</strong> advice centres exist. The DCC 27 in the UK was established to<br />

support UK researchers engaged in digital research activities. It has since exp<strong>and</strong>ed<br />

into international partnerships <strong>and</strong> provides a rich resource <strong>for</strong> any researchers or<br />

institutional service (eg digital repository or similar) <strong>and</strong> <strong>of</strong>fers advice on all manner <strong>of</strong><br />

data curation issues including licensing data. In addition the UK <strong>Data</strong> Archive releases<br />

a series <strong>of</strong> best-practice guidelines <strong>for</strong> researchers from the social sciences that can be<br />

applied more widelyError! Bookmark not defined..<br />

3. The capability to integrate data <strong>and</strong> publications<br />

Many researchers see the benefit <strong>of</strong> integrating the scholarly record but little best<br />

practice conventions exist. In contrast to the practicalities <strong>of</strong> what to do with data there<br />

are a growing number <strong>of</strong> examples <strong>of</strong> how to re-join data to the publications they support<br />

(see Chapter 3).<br />

4. How to get credit<br />

Almost all respondents in PARSE.Insight agreed that should their data be used they<br />

should receive credit <strong>for</strong> it. This is in line with the pr<strong>of</strong>essional impact that researchers<br />

receive from publication activity. Presently there is no agreed framework <strong>for</strong> citation <strong>of</strong><br />

data nor a capability to measure impact in a similar manner to the way traditional<br />

citation impact factors are created. Thus the question is raised, does data need a specific<br />

27<br />

http://www.dcc.ac.uk/<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 33


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

citation/recognition framework or is the current framework sufficient to absorb the<br />

requirements <strong>of</strong> data citation? <strong>Data</strong>Cite 28 believe data are first class research objects<br />

with separate requirements from scholarly publication <strong>and</strong> as such require a citation<br />

framework that can accommodate these. <strong>Data</strong>Cite is an international association that<br />

is implementing such a framework independent <strong>of</strong> discipline to support data citation via<br />

the registration <strong>of</strong> persistent identifiers (DOI’s) that enable linking to <strong>and</strong> from these<br />

data sets (see also Chapter 4).<br />

5. Who pays <strong>for</strong> what<br />

Researchers feel that while they recognise data preservation <strong>and</strong> archiving cost money,<br />

they are unable to pay <strong>for</strong> it. In fact the complex processes <strong>of</strong> data preservation are well<br />

understood in many disciplines outside research, especially those where severe <strong>and</strong><br />

expensive legal requirements are imposed, e.g. financial institutions are required to keep<br />

data <strong>for</strong> many decades, nuclear installations are required to keep digital records <strong>and</strong><br />

data <strong>for</strong> perhaps centuries. In the UK at least, the cost <strong>of</strong> both data preservation <strong>and</strong><br />

data sharing is recognised in many policies from the Research Councils. Assisting in<br />

developing <strong>and</strong> promoting these policies with well documented <strong>and</strong> realistic scenarios<br />

<strong>and</strong> use cases <strong>for</strong> best practice data management throughout the research process would<br />

be advantageous to all .<br />

2.5. Opportunities in data exchange relating to researchers<br />

In summary, returning to the criteria over which we attempt to identify opportunities <strong>for</strong><br />

data exchange, from a researcher’s perspective the following have been identified.<br />

<strong>Data</strong> Issue:<br />

Availability<br />

Findability<br />

Interpretability<br />

Re-usability<br />

Citability<br />

Researchers opportunity to help improve situation:<br />

Researchers dem<strong>and</strong> their data be treated as first class research<br />

objects<br />

Researchers loosen control over data<br />

Define roles <strong>of</strong> responsibility <strong>and</strong> control<br />

Agree convention to propose to publishers regarding data citation<br />

Use <strong>of</strong> persistent identifiers such as DOI’s<br />

Ensure common citation practices<br />

Recognize that data require metadata <strong>and</strong> work towards community<br />

best practice in metadata development<br />

Be concerned about the long term ability <strong>for</strong> secondary use <strong>and</strong><br />

consider or seek out responsible preservation actions. Further,<br />

consider this as part <strong>of</strong> good research practice rather than as a closing<br />

activity.<br />

Agree a convention <strong>for</strong> data citation<br />

Follow metadata st<strong>and</strong>ards <strong>for</strong> datasets<br />

Use <strong>of</strong> persistent identifiers such as DOI’s<br />

28<br />

http://www.datacite.org<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 34


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Curation<br />

Preservation<br />

Develop sustainable <strong>and</strong> realistic data management plans<br />

Collaboration with public data archives<br />

Develop sustainable realistic preservation plans<br />

Active engagement with public data archives<br />

Table 1 repeated: <strong>Data</strong> Opportunities <strong>for</strong> Researchers<br />

These have been further broken down to incentives, drivers <strong>and</strong> enablers.<br />

Incentives<br />

Joining functions that complete scholarly communication by integrating data <strong>and</strong><br />

publications<br />

Citation framework that encourages credit, attribution <strong>and</strong> re-use to remove hesitation<br />

on the researchers side<br />

Review/validation process (e.g. data journal) that support trust<br />

Drivers<br />

Impact <strong>and</strong> re-use metrics that support incentives<br />

<strong>Data</strong> management plans/<strong>Data</strong> sharing plans as part <strong>of</strong> start up activity<br />

Disentangled responsibility between data creators <strong>and</strong> data custodians<br />

Enablers<br />

Clear <strong>and</strong> consistent IPR <strong>and</strong> other rights statements from stakeholders<br />

M<strong>and</strong>ated infrastructures like data centres <strong>and</strong> data archives that can persist <strong>and</strong><br />

preserve<br />

Recognition frameworks that support data as first class research objects<br />

Submission processes that minimise overheads <strong>and</strong> ef<strong>for</strong>t<br />

Embedded training activities <strong>and</strong> practices that focus on data management skills rather<br />

than simply data manipulation skills<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 35


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

3. INTEGRATION OF DATA AND PUBLICATIONS: THE PUBLISHERS’<br />

PERSPECTIVE<br />

3.1. How scholarly journals h<strong>and</strong>le the increasing amount <strong>of</strong> data alongside the article<br />

Building on the previous chapters <strong>of</strong> this report, <strong>and</strong> well aware <strong>of</strong> the desire <strong>of</strong><br />

researchers to publish their data in a citable way <strong>and</strong> to find data via the <strong>for</strong>mal<br />

literature, we focus here on the ways publications <strong>and</strong> data are being integrated. In our<br />

pursuit <strong>of</strong> present practices <strong>and</strong> new initiatives in the field <strong>of</strong> STM journals, we have<br />

encountered (<strong>and</strong> will describe in this chapter) the 4 basic categories <strong>for</strong> different ways<br />

in which data <strong>and</strong> publications can be connected <strong>and</strong>/ or integrated. They follow largely<br />

the <strong>Data</strong> Publication Pyramid as presented in the Introduction Chapter <strong>and</strong> the first 4<br />

categories listed there (while category 5: data in drawers <strong>and</strong> on disks, concerns the<br />

category <strong>of</strong> unpublished data, which is addressed further in Chapter 5):<br />

Graph 1 repeated: The <strong>Data</strong> Publication Pyramid, see chapter 1 <strong>for</strong> a full explanation.<br />

This is a high level overview <strong>of</strong> each <strong>of</strong> these categories be<strong>for</strong>e they are described more<br />

in-depth in the remainder <strong>of</strong> the chapter, with several examples:<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 36


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

1. <strong>Data</strong> contained within peer reviewed articles<br />

This is the traditional publishing model in which the researcher fully analyzes <strong>and</strong><br />

processes the data <strong>and</strong> describes the conclusions derived from them in the scholarly<br />

article. The conclusions drawn from the data are illustrated by summarizing the<br />

relevant data (or data-outcomes) in tables, graphs <strong>and</strong> other illustrations, <strong>and</strong>,<br />

increasingly, also in multimedia applications.<br />

Advantages are in the tight embedding <strong>and</strong> integration <strong>of</strong> the data into the scholarly<br />

record, citable <strong>and</strong> retrievable as such, available to all researchers <strong>and</strong> users. Authors<br />

get all credits <strong>for</strong> their article.<br />

Limitations are that these present a high level <strong>of</strong> aggregation <strong>of</strong> the data, data are hard<br />

to find separately from the article <strong>and</strong> usually not in a re-use friendly way.<br />

2. <strong>Data</strong> resides in supplementary files added to the journal article<br />

Nearly all STM journals <strong>of</strong>fer authors the service to add in supplementary files to their<br />

article any relevant material that is too big or that will not fit the traditional article<br />

<strong>for</strong>mat or its narrative, such as large datasets, multimedia files, large tables,<br />

animations, high resolution files, protocols, large bibliographies, etc. With the increased<br />

computational nature <strong>of</strong> many disciplines, the use <strong>of</strong> supplementary files has increased<br />

sharply recently (see also Chapter 1, Introduction).<br />

Advantages <strong>of</strong> using supplementary files are that the volume <strong>of</strong> the data is no longer an<br />

issue <strong>and</strong> that the data are still closely tied to the <strong>of</strong>ficial scholarly record <strong>and</strong> remain<br />

citable, while authors are no longer restricted by the article <strong>for</strong>mat. It makes optimal use<br />

<strong>of</strong> online facilities.<br />

Limitations are that file size is usually not much larger than 10 GB <strong>and</strong> that, from an<br />

author’s perspective, the curation <strong>and</strong> preservation <strong>of</strong> the supplementary files is not<br />

always clear. Few st<strong>and</strong>ards exist between journals on how to indicate the presence <strong>of</strong><br />

supplementary files or where to find them. Only in few cases will the supplementary<br />

material be provided with a separate DOI or other persistent identifier to enable linking<br />

independent from the main article.<br />

a. Journal articles <strong>of</strong>fer supplementary files with extra data, but restricted<br />

in volume <strong>and</strong> <strong>for</strong>mat<br />

A sub-category under Supplementary data files exists <strong>for</strong> those journals that have<br />

restricted the use <strong>of</strong> supplements <strong>for</strong> this purpose. The first examples <strong>of</strong> high st<strong>and</strong>ing<br />

journals which can no longer manage the sheer volume <strong>of</strong> materials in supplementary<br />

article files <strong>and</strong> as a consequence have put limitations on what can be included in<br />

supplementary files (Cell) or even who no longer accept supplementary material at all<br />

(Journal <strong>of</strong> Neuroscience) other than multimedia files that should be considered integral<br />

to the article content.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 37


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Advantages are that any supplements <strong>of</strong> a journal articles remain tied to the <strong>of</strong>ficial<br />

scholarly record, <strong>and</strong> are part <strong>of</strong> its peer review process, but<br />

Limitations are in the new restrictions posed; adding original research data is <strong>of</strong>ten no<br />

longer possible.<br />

3. <strong>Data</strong> resides in Community-endorsed Public Repository with bi-directional<br />

linking to <strong>and</strong> from articles<br />

In this model the data relating to a scholarly article are deposited in designated Public<br />

Repositories, best examples are GenBank <strong>and</strong> World Protein <strong>Data</strong>Bank. The accession<br />

numbers <strong>of</strong> the data in those databases are added to the journal manuscript <strong>and</strong><br />

referenced, <strong>of</strong>ten within the article text, as well as in the footnotes or reference list.<br />

Advantages are that the data become part <strong>of</strong> larger datasets in the same area thereby<br />

serving the research community as a whole, <strong>and</strong> it is normalized, st<strong>and</strong>ardized, curated<br />

<strong>and</strong> preserved. The connection between data <strong>and</strong> publication are secured via the<br />

accession numbers that are embedded in the article. Even better are the examples<br />

(Pangaea, CCDC, PubChem) where the bidirectional linking between data <strong>and</strong> articles<br />

are secured, <strong>and</strong> likewise from the data to the articles. There are no restrictions on<br />

volume.<br />

Limitations are that these databases tend to exist only in a few subject areas so far,<br />

mainly biology, life science, earth science <strong>and</strong> chemistry. The future <strong>of</strong> these databases<br />

<strong>of</strong>ten depends on government funding <strong>and</strong> may be threatened by budget cuts.<br />

a. Journals have set up an own storage facility <strong>for</strong> data<br />

A sub-category <strong>for</strong> data referenced from articles that reside in a special data-storage<br />

facility established by the publisher fits the example <strong>of</strong> Thieme publishers in Germany,<br />

as later described in paragraph 3.6. Thieme have recently instigated collaboration with<br />

the data facilities <strong>of</strong> FIZ Karlsruhe to <strong>of</strong>fer their chemistry authors the possibility to<br />

store the raw <strong>and</strong> original research data alongside their articles.<br />

Advantages lie in the strong link between the data <strong>and</strong> the scholarly record <strong>and</strong> the<br />

availability <strong>of</strong> the data <strong>for</strong> further examination <strong>and</strong> re-use to all the journal readers,<br />

editors <strong>and</strong> other users.<br />

Limitations are the danger <strong>of</strong> creating new silos <strong>for</strong> data per journal or per publisher,<br />

which can be a barrier to discoverability <strong>and</strong> reuse.<br />

b. Journals <strong>of</strong>fer dynamic data made interactive, data can reside with the<br />

article or in public repositories<br />

Another sub-category exists <strong>for</strong> journals who present the relevant data sitting in an<br />

<strong>of</strong>ficial repository or data centre or anywhere else from within the article. This model<br />

emphasizes what readers <strong>of</strong> a journal article can do with the data rather than where the<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 38


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

data actually resides. Via click throughs from graphs <strong>and</strong> tables, readers can play with<br />

the underlying data <strong>and</strong> their visualizations. As an example, the BioChemical Journal<br />

from Portl<strong>and</strong> Press does this via dynamic pdf’s. Elsevier has some examples <strong>of</strong> data<br />

viewers that work from within the article but using data in Genbank <strong>and</strong> PDB bank.<br />

This may increasingly emerge as a model <strong>for</strong> Linked Open <strong>Data</strong> <strong>and</strong> the emerging data<br />

web.<br />

Advantages are clearly in reuse <strong>and</strong> in increased interpretability <strong>of</strong> underlying data.<br />

<strong>Data</strong> become re-usable within the context <strong>of</strong> the scholarly record.<br />

Limitations are in the open availability <strong>of</strong> these data. Applications are usually only used<br />

<strong>for</strong> the data within the article, i.e. category 1 <strong>of</strong> this list.<br />

4. Journals dedicated to so-called <strong>Data</strong> <strong>Publications</strong> only<br />

In this model the journal publishes descriptive articles about datasets that are usually<br />

stored in a repository. The description <strong>of</strong> the data generation <strong>and</strong> its potential use allows<br />

the authors credits <strong>for</strong> their work while strongly promoting the interpretability <strong>and</strong><br />

re/use. Examples are the Earth Science Systems <strong>Data</strong> Journal (ESSD) <strong>and</strong> the newly<br />

launched journal Gigascience. Other, already existing journals <strong>of</strong>fer hybrid models in<br />

which they have opened up <strong>for</strong> descriptive data articles (Int Journal <strong>of</strong> Robotics) as a<br />

new article type next to the traditional research papers.<br />

Advantages are in the credits <strong>for</strong> the author, the citability <strong>and</strong> the reuse.<br />

Limitations are in the challenges <strong>for</strong> high quality peer review, very few peer review<br />

st<strong>and</strong>ards exist so far <strong>for</strong> datasets <strong>and</strong> their descriptions. The system depends on proper<br />

<strong>and</strong> persistent bi/directional linking.<br />

We can summarize <strong>and</strong> compare these 4 categories (plus 3 sub-categories) in the<br />

following table. Subsequent paragraphs contain more extensive descriptions <strong>of</strong> current<br />

practices <strong>and</strong> policies.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 39


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

<strong>Data</strong> (selections) underlying<br />

articles reside in:<br />

<strong>Data</strong> contained within the peer<br />

reviewed article, in tables,<br />

graphs, plotting, etc<br />

Supplementary files to journal<br />

articles – whatever it contains<br />

with very few restrictions on size<br />

<strong>and</strong> <strong>for</strong>mat<br />

2.a Supplementary files to<br />

journal articles, with restrictions<br />

<strong>and</strong> tightened instructions to<br />

manage the proliferation <strong>of</strong><br />

supplemented material<br />

Advantages:<br />

Analyzed data <strong>and</strong> relevant data<br />

selections are integral part <strong>of</strong> the Record<br />

<strong>of</strong> Science.<br />

Readers, users <strong>and</strong> peer reviewers can<br />

find <strong>and</strong> consult these data selections<br />

<strong>Data</strong>sets <strong>and</strong> publications tightly<br />

connected, data is embedded in public<br />

record <strong>of</strong> science, managed <strong>and</strong> preserved<br />

as such, author gets full credits, reviewers<br />

<strong>and</strong> readers are able to access data in<br />

combination with the article<br />

More clarity on the supplemental<br />

materials that journals can <strong>and</strong> will<br />

support. Better reassurance <strong>of</strong> curation<br />

<strong>and</strong> preservation <strong>and</strong> perpetual access.<br />

Journals will encourage authors to place<br />

unsupported materials in a reliable<br />

repository<br />

Limitations:<br />

Usually high level <strong>of</strong> aggregation <strong>of</strong> the<br />

data, more data summaries than full<br />

set <strong>of</strong> original data.<br />

Usually not findable or retrievable<br />

separately from the article.<br />

Not well reusable outside the context <strong>of</strong><br />

the article.<br />

Volume is a limitation, usually datasets<br />

not bigger than 10MB.<br />

Curation sometimes unclear,<br />

preservation likely to remain limited to<br />

that <strong>for</strong> articles.<br />

Easy discovery <strong>and</strong> re-use hampered by<br />

fragmentation over journal silos. Not<br />

all supplements are linkable.<br />

Sometimes the files can only be<br />

accessed via the article <strong>and</strong> not<br />

independently.<br />

Volume is usually further restricted to<br />

underlying tables or explanatory<br />

graphs <strong>and</strong> full data sets are not<br />

included.<br />

Examples<br />

All peer reviewed scholarly journals<br />

Vast majority <strong>of</strong> STM journals <strong>and</strong><br />

dem<strong>and</strong> <strong>for</strong> this from authors is<br />

increasing rapidly lately.<br />

Some journals have made the<br />

availability <strong>of</strong> underlying research<br />

data (Nature, PLoS) a condition <strong>for</strong><br />

publication.<br />

First examples appear <strong>of</strong> journals<br />

who can no longer h<strong>and</strong>le the<br />

overload (Jnl <strong>of</strong> NeuroScience, Cell)<br />

Cell, Journal <strong>of</strong> Neuroscience, <strong>and</strong> a<br />

growing number <strong>of</strong> journals<br />

contemplating such restrictions as<br />

they find it hard to h<strong>and</strong>le the<br />

growing volume <strong>and</strong> variety in<br />

<strong>for</strong>mats (example: NISO/NFAIS<br />

working group)<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 40


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Community-endorsed Public<br />

Repository with bi-directional<br />

linking to <strong>and</strong> from articles<br />

3.a Journals have set up an own<br />

storage facility <strong>for</strong> data<br />

3.b Journals <strong>of</strong>fer dynamic data<br />

made interactive, data can reside<br />

with the article or in public<br />

repositories<br />

Journals dedicated to so-called<br />

<strong>Data</strong> <strong>Publications</strong> only<br />

<strong>Data</strong> resides in a place where proper<br />

curation <strong>and</strong> <strong>for</strong>matting is secured, as well<br />

as discoverability <strong>and</strong> reuse. The<br />

bidirectional linking ensures connections<br />

with the publications. Author credits are<br />

indirect.<br />

Authors can be ensured that data is well<br />

curated <strong>and</strong> gets the right metadata<br />

attached <strong>for</strong> findability <strong>and</strong> check <strong>of</strong><br />

provenance. <strong>Data</strong> remains closely<br />

connected to the article <strong>and</strong> becomes part<br />

<strong>of</strong> the public record <strong>of</strong> science.<br />

Graphs that show the underlying data via<br />

an extra click add depth to a research<br />

article. <strong>Data</strong> remain in the context <strong>of</strong> the<br />

article <strong>and</strong> become reusable at the same<br />

time.<br />

<strong>Data</strong> are described in-depth in these<br />

publications, facilitating findability,<br />

interpretability <strong>and</strong> re-use. <strong>Data</strong> remain<br />

in larger repositories <strong>and</strong> can be combined<br />

with other datasets.<br />

<strong>Data</strong> creators get the full credit <strong>of</strong> a public<br />

record <strong>of</strong> science. The data becomes<br />

citable.<br />

Life science area is well covered with<br />

many initiatives, in other areas first<br />

initiatives emerging.<br />

Many disciplines still lack a common<br />

solution.<br />

Future <strong>of</strong> existing repositories<br />

sometimes under threat because <strong>of</strong><br />

pending cuts in government funding.<br />

<strong>Data</strong> will be spread over many different<br />

journals <strong>and</strong> may end up fragmented<br />

over silos, hampering reuse across<br />

different plat<strong>for</strong>ms.<br />

Usually only applied to data<br />

presentations in the article, not to<br />

(large) raw data set as such.<br />

Peer review <strong>of</strong> large sets <strong>of</strong> data is a<br />

challenge.<br />

Curation <strong>and</strong> preservation is in h<strong>and</strong>s<br />

<strong>of</strong> the repository.<br />

The system requires persistent<br />

bidirectional linking to work well.<br />

Table 5: Categories to publish research data.<br />

Most journals in molecular biology<br />

<strong>and</strong> life sciences list these<br />

repositories <strong>and</strong> require authors to<br />

deposit there <strong>and</strong> submit the<br />

accession numbers <strong>of</strong> the databases.<br />

Strong supporters <strong>for</strong> this approach<br />

are PLoS, Science, Nature.<br />

Pangaea is an <strong>of</strong>ten cited example in<br />

earth sciences, collaborating with all<br />

Elsevier journals in this area.<br />

Thieme, <strong>for</strong> its chemistry journals in<br />

collaboration with FIZ <strong>and</strong> TIB<br />

The Biochemical Journal by Portl<strong>and</strong><br />

Press, experimental special issues<br />

by the Optical Society in America<br />

(OSA),<br />

Elsevier <strong>for</strong> data in the World<br />

Protein<strong>Data</strong>Base (PDB)<br />

Earth Systems Science <strong>Data</strong> working<br />

with Pangaea.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 41


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

In the following paragraphs, each <strong>of</strong> these categories will be further explained <strong>and</strong><br />

illustrated with real life cases.<br />

3.2. Common practice: Supplementary Material to Journal Articles<br />

The large majority <strong>of</strong> journals accept research data in supplementary files. From the<br />

results <strong>of</strong> the PARSE.Insight 29 survey we know that most journals accept research data<br />

in supplementary files:<br />

Can Authors submit their underlying research rch data with their publication?<br />

Graph 18 from PARSE.Insight 2 survey: N = 134 Publishers<br />

If we weigh in the size <strong>of</strong> the publishers (see PARSE.Insight report; the 3 % largest<br />

publishers publish 70 % <strong>of</strong> all journal articles <strong>and</strong> they all accept supplementary files),<br />

then over 90 % <strong>of</strong> all journals accept supplementary files with research data.<br />

Publishers generally accept a wide range <strong>of</strong> file <strong>and</strong> data <strong>for</strong>mats (again: source<br />

PARSE.Insight):<br />

29 PARSE.Insight survey, see http://www.parse-insight.eu/downloads/PARSE-Insight_D3-<br />

4_SurveyReport_final_hq.pdf<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 42


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Graph19 from PARSE.Insight survey: What file <strong>for</strong>mats does your journal accept ?, N=134<br />

Publishers<br />

As was shown in the introductory chapter (Chapter 1) <strong>of</strong> this report, we know that more<br />

than half <strong>of</strong> the researchers surveyed <strong>for</strong> PARSE.Insight would like to submit their data<br />

together with their manuscript to journals <strong>and</strong> publishers. This unveils a trend <strong>of</strong> likely<br />

growth in the submission <strong>of</strong> supplementary data files, because the present percentage<br />

who do so is below 20 % at the moment. Publishers confirm this growth trend as will<br />

become clear from several examples provided in this chapter.<br />

The instructions that these researchers find <strong>for</strong> most <strong>of</strong> the journals are fairly<br />

straight<strong>for</strong>ward. This is the general instruction to authors from a large publisher with<br />

more than 2000 journals (ic Springer, bold text by us) 30 :<br />

accepts electronic supplementary material (animations, movies,<br />

audio, large original data, etc.) which will be published in the online version along<br />

with the article or a book chapter. This feature can add dimension to the article,<br />

as certain in<strong>for</strong>mation cannot be printed or is more convenient in electronic <strong>for</strong>m.<br />

30 Springer author instructions, see<br />

http://www.springer.com/authors/manuscript+guidelines?SGWID=0-40162-12-339499-0<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 43


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

And from a similar size other publisher we find very similar instructions <strong>for</strong> most <strong>of</strong><br />

their journals (bold typeface by us) 31 :<br />

Elsevier accepts electronic supplementary material to support <strong>and</strong> enhance your<br />

scientific research. Supplementary files <strong>of</strong>fer the author additional possibilities to<br />

publish supporting applications, high-resolution images, background datasets,<br />

sound clips <strong>and</strong> more. Supplementary files supplied will be published online<br />

alongside the electronic version <strong>of</strong> your article in Elsevier Web products,<br />

including ScienceDirect.<br />

With a similar preference <strong>for</strong> a certain file <strong>for</strong>mats as found with other publishers.<br />

Some journals have started to make the availability <strong>of</strong> underlying research data<br />

conditional <strong>for</strong> acceptance <strong>of</strong> the article by the journal. See <strong>for</strong> example the text by<br />

Nature in its authors instructions (underlining added by us) 32 :<br />

An inherent principle <strong>of</strong> publication is that others should be able to replicate <strong>and</strong><br />

build upon the authors' published claims. There<strong>for</strong>e, a condition <strong>of</strong> publication in<br />

a Nature journal is that authors are required to make materials, data a <strong>and</strong><br />

associated protocols promptly available to readers without undue qualifications in<br />

material transfer agreements. Any restrictions on the availability <strong>of</strong> materials or<br />

in<strong>for</strong>mation must be disclosed to the editors at the time <strong>of</strong> submission.<br />

Nature says that supporting data must be made available to editors <strong>and</strong> peer-reviewers<br />

at the time <strong>of</strong> submission <strong>for</strong> the purposes <strong>of</strong> evaluating the manuscript. But the journal<br />

does not say where authors should make their materials, data <strong>and</strong> protocols available <strong>for</strong><br />

readers <strong>and</strong> users, except <strong>for</strong> a few exceptions regarding public repositories that will be<br />

mentioned in the next paragraph. The same is the case <strong>for</strong> the open access journal PLoS<br />

One 33 :<br />

PLoS is committed to ensuring the availability <strong>of</strong> data <strong>and</strong> materials that<br />

underpin any articles published in PLoS journals. We believe the ideal is that all<br />

data relevant to a given article <strong>and</strong> all readily replaceable materials be made<br />

immediately available without restrictions (whilst not compromising<br />

confidentiality in the context <strong>of</strong> human-subject research). (….)<br />

Failure to comply with this policy will be taken into account when publication<br />

decisions are made. We encourage researchers to contact journal editors if they<br />

encounter difficulties in obtaining data or materials from published articles. PLoS<br />

reserves the right to post corrections on articles, to contact authors’ institutions<br />

<strong>and</strong> funders, <strong>and</strong> in extreme cases to withdraw publication, if restrictions on<br />

data/materials access come to light after publication.<br />

How effective these m<strong>and</strong>ates are is not known <strong>and</strong> anecdotal evidence points at a long<br />

way to go still. In an empirical study <strong>of</strong> data sharing by authors publishing in PLoS<br />

31 Elsevier author instructions, see an example at:<br />

http://www.elsevier.com/wps/find/journaldescription.cws_home/505772/authorinstructions#87000<br />

32 Nature instructions to authors on availability <strong>of</strong> data:<br />

http://www.nature.com/authors/policies/availability.html<br />

33 PLoS: http://www.plosone.org/static/policies.action#sharing<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 44


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

journals, it is reported that requests <strong>for</strong> data underlying the published articles were only<br />

successful in one case out <strong>of</strong> ten, in spite <strong>of</strong> the journals’ clear policies (Savage <strong>and</strong><br />

Vickers 2009) 34 .<br />

3.3. New limits on Supplemental Files to journal articles: restrictions to supplements<br />

The notion is growing that scholarly journals struggle to h<strong>and</strong>le the exponential growth<br />

in supplementary data files <strong>and</strong> the responsibility <strong>for</strong> them in terms <strong>of</strong> securing<br />

permanent access <strong>and</strong> long term preservation. As an effect, the Journal <strong>of</strong> Neuroscience<br />

has declared a new policy in 2010 to no longer accept supplementary materials at all.<br />

Instead, authors may host supplemental material on an external web site <strong>and</strong> include in<br />

their article a footnote with a URL pointing to that site <strong>and</strong> a brief description <strong>of</strong> its<br />

contents. But reviewers <strong>and</strong> editors will no longer evaluate the supplemental material,<br />

the article as submitted will be treated as a self contained entity.<br />

At their journal site we can read the reasoning <strong>for</strong> this there is a clear indication that<br />

the exploding volume <strong>and</strong> the burden <strong>for</strong> peer review was becoming a real obstacle<br />

(underlining added by us) 35 :<br />

Although The Journal has published electronically since 1996,<br />

supplemental material first appeared around 2003. Since then, the amount <strong>of</strong><br />

material associated with a typical article has grown dramatically (….) The sheer<br />

volume <strong>of</strong> supplemental material is adversely affecting peer review.<br />

The related editorial even speaks <strong>of</strong> a ‘proliferation among authors’ adding more <strong>and</strong><br />

more material in the article supplements.<br />

In a similar context, we find a new policy by the journal Cell, implemented in October<br />

2009 <strong>and</strong> again referring to the ever growing amount <strong>of</strong> supplementary material<br />

(underlining added by us) 36 :<br />

Supplemental In<strong>for</strong>mation is a useful resource <strong>for</strong> presenting essential supporting<br />

materials online, <strong>and</strong> Cell Press is committed to the publication <strong>of</strong> these<br />

materials. However, as the amount <strong>of</strong> Supplemental In<strong>for</strong>mation has grown, it<br />

has become increasingly difficult <strong>for</strong> authors, reviewers, <strong>and</strong> readers to navigate<br />

due to the volume <strong>of</strong> in<strong>for</strong>mation <strong>and</strong> the lack <strong>of</strong> defined structure <strong>and</strong> limits<br />

its. To<br />

address these problems, we are introducing these guidelines, which we believe<br />

will make Supplemental In<strong>for</strong>mation more useful <strong>and</strong> accessible to readers.<br />

The restrictions that Cell puts in place concern a limit on volume <strong>and</strong> total number <strong>of</strong><br />

supplementary items:<br />

The total number <strong>of</strong> supplemental data items <strong>of</strong> all types (figures, tables, movies<br />

<strong>and</strong> other) per paper may not exceed two times the number <strong>of</strong> figures <strong>and</strong> tables<br />

34 Savage <strong>and</strong> Vickers 2009 in PLoSOne:<br />

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0007078<br />

36 Cell editorial: http://www.cell.com/retrieve/pii/S0092867409011817<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 45


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

in the main paper. For example, a paper with 7 main figures can have up to 14<br />

supplemental items total, <strong>of</strong> which up to 7 may be figures.<br />

Another restriction from Cell is that the supplemental material must bear a direct<br />

relationship to the main conclusions <strong>and</strong> content in the paper <strong>and</strong> must stay within the<br />

same scope:<br />

Supplemental In<strong>for</strong>mation is limited to data <strong>and</strong> other materials that directly<br />

support the main conclusions <strong>of</strong> a paper but are considered additional or<br />

secondary support <strong>for</strong> the main conclusions, or cannot be included in the main<br />

paper <strong>for</strong> reasons such as space or file <strong>for</strong>mat restrictions. Supplemental<br />

In<strong>for</strong>mation should be within the conceptual scope <strong>of</strong> the main paper <strong>and</strong> not<br />

extend beyond it.<br />

In a quote in The Scientists (February 2011) 37 , Emilie Marcus, Editor-in-Chief <strong>of</strong> Cell<br />

Press Journals, says "It had become a limitless bag <strong>of</strong> stuff." The publisher did not<br />

consider abolishing supplementary materials altogether because they have a diverse<br />

readership, with different levels <strong>of</strong> interest in a study's details, Marcus explained, but it<br />

was necessary to rein it in. "I do think there are different solutions <strong>for</strong> different<br />

journals," Marcus said. "Scientific communities <strong>and</strong> journals have probably not given<br />

enough thought to what to do with this capacity <strong>for</strong> supplemental materials. That needs<br />

to evolve."<br />

In 2010 a working group was set up by NISO/NFAIS 38 with a charge to define best<br />

practice recommendations on how publishers can best treat supplementary in<strong>for</strong>mation,<br />

in terms <strong>of</strong> inclusion, h<strong>and</strong>ling, display <strong>and</strong> preservation. These guidelines are expected<br />

to appear in the second half <strong>of</strong> 2011. They will cover best practices <strong>for</strong> supplementary<br />

files to journals <strong>and</strong> also roles <strong>and</strong> responsibilities <strong>for</strong> availability, findability, quality<br />

control <strong>and</strong> preservation.<br />

3.4. How safe is data in supplementary journal article files? (or: quality <strong>and</strong> preservation<br />

ervation<br />

<strong>of</strong> supplementary journal article files)<br />

How do publishers treat the data in supplementary files <strong>and</strong> how safe is the data there?<br />

A complaint that publishers sometimes receive is about the lack <strong>of</strong> transparency over<br />

whether the supplementary files received from authors have been edited at all, peer<br />

reviewed or checked on <strong>for</strong>mat <strong>and</strong> general quality. Many publishers will just <strong>of</strong>fer the<br />

author the service <strong>of</strong> posting the files in connection to the article exactly in the way they<br />

were received from the author. In some cases the files were part <strong>of</strong> the peer review<br />

process, in others they were supplied after acceptance <strong>of</strong> the article. Some publishers<br />

transfer the supplementary files into pdf’s be<strong>for</strong>e posting, which does not serve the reusability<br />

or further <strong>and</strong> deeper analysis <strong>of</strong> the data.<br />

The ways in which the supplementary files are provided with metadata vary widely<br />

between publishers. Some leave this to authors entirely which does not add to<br />

37 The Scientist: Supplemental or detrimental? - The Scientist - Magazine <strong>of</strong> the Life<br />

Sciences http://www.the-scientist.com/news/display/58027/#ixzz1NRVKAV6F)<br />

38 NISO, see http://www.niso.org/workrooms/supplemental<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 46


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

consistency across data files. Also the mention <strong>of</strong> supplementary files in the article or the<br />

reference from the article to the supplements follows many different practices.<br />

Librarians find it sometimes a struggle to make sure that a literature search they carry<br />

out <strong>for</strong> research groups contains all related supplements (see Chapter 4). The<br />

NISO/NFAIS recommendations <strong>for</strong> Best Practice as mentioned earlier aim to advocate a<br />

clearer <strong>and</strong> a more common practice <strong>for</strong> this across publishers that help librarians find<br />

supplementary material when it exists.<br />

Again the findings from PARSE.Insight confirm this picture on many <strong>of</strong> its aspects. The<br />

137 publishers responding to the PARSE survey say about research data submitted by<br />

the author:<br />

• only 51 % <strong>of</strong> publishers validates the data submitted, mostly checking the file<br />

<strong>for</strong>mats<br />

• only 44 % facilitates direct links to it<br />

• 39 % requires © transfer (against 57 % not)<br />

• 70 % has no preservation measures in place <strong>for</strong> the supplemental data other than<br />

<strong>for</strong> the articles<br />

As shown in the following survey results :<br />

Graph 20, source PARSE.Insight, N=137 publishers<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 47


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Graph 21, source PARSE.Insight, N=137 publishers<br />

Graph 22, source PARSE.Insight, N=137 publishers<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 48


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Do you have digital preservation arrangements <strong>for</strong> underlying research data<br />

Graph 23 PARSE.Insight survey: (N= 137 publishers)<br />

3.5. <strong>Data</strong> in community-endor<br />

endorsed public databases, linked to journal articles.<br />

In a number <strong>of</strong> subject areas, community archives have emerged where researchers are<br />

requested to deposit their data. Some <strong>of</strong> these examples were already mentioned in<br />

Chapter 2 <strong>of</strong> this report. Publishers have adopted these practices <strong>and</strong> a large number <strong>of</strong><br />

journals encourage the authors to deposit their data there, rather than sending it along<br />

with their article.<br />

The advantages <strong>of</strong> this are manifold: the databases become more comprehensive, the<br />

data becomes better discoverable <strong>and</strong> can be used in combination with other data, <strong>and</strong><br />

the connection with publications is ensured via bidirectional linking.<br />

PLoS One has been advocating such a policy quite extensively 31 :<br />

1. <strong>Data</strong> <strong>for</strong> which public repositories have been established <strong>and</strong> are in general use<br />

should be deposited be<strong>for</strong>e publication, <strong>and</strong> the appropriate accession numbers or<br />

digital object identifiers published with the paper.<br />

2. If an appropriate repository does not exist, data should be deposited as<br />

supporting in<strong>for</strong>mation with the published paper. If this is not practical, data<br />

should be made freely available upon reasonable request.<br />

3. The conclusions <strong>of</strong> a study must not be dependent solely on the analysis <strong>of</strong><br />

proprietary data. If proprietary data were used to reach a conclusion, <strong>and</strong> the<br />

authors are unwilling or unable to make these data public, then the paper must<br />

include an analysis <strong>of</strong> public data that validates the conclusions so that others<br />

can reproduce the analysis <strong>and</strong> build on the findings.<br />

PLoS also adds about the ideal <strong>of</strong> data sharing:<br />

We appreciate, however, that this ideal is not yet the norm in all fields. We are<br />

there<strong>for</strong>e currently collaborating with a number <strong>of</strong> subject-specific initiatives in<br />

order to develop relevant policies. In the meantime, authors must comply<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 49


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

with current best practice in their discipline <strong>for</strong> the sharing <strong>of</strong> data via databases:<br />

<strong>for</strong> example, deposition <strong>of</strong> microarray data in ArrayExpress or GEO; deposition <strong>of</strong><br />

gene sequences in GenBank or EMBL; <strong>and</strong> deposition <strong>of</strong> ecological data in<br />

DRYAD. We encourage all authors to comply with available field-specific<br />

st<strong>and</strong>ards <strong>for</strong> the preparation <strong>and</strong> recording <strong>of</strong> data.<br />

A similar policy encouraging authors to deposit their data in ‘approved databases’ is<br />

followed by the journal Science. Their instructions say 39 :<br />

Science supports the ef<strong>for</strong>ts <strong>of</strong> databases that aggregate published data <strong>for</strong> the<br />

use <strong>of</strong> the scientific community. There<strong>for</strong>e, appropriate data sets (including<br />

microarray data, protein or DNA sequences, atomic coordinates or electron<br />

microscopy maps <strong>for</strong> macromolecular structures, <strong>and</strong> climate data) must be<br />

deposited in an approved database, <strong>and</strong> an accession number or a specific access<br />

address must be included in the published paper.<br />

For those cases where such an approved appropriate repository does not exist, Science<br />

wishes to have the datasets on its own website as supplementary material or at least<br />

hold the material in escrow if the files are hosted on a institutional website by the<br />

author.<br />

Further reading <strong>of</strong> the instructions seems to indicate that Science is also struggling with<br />

certain types <strong>of</strong> supplementary material in a careful balance to avoid having to absorb<br />

too much <strong>and</strong> still ensure that the underlying material can be examined by its readers.<br />

The journal asks authors to follow special procedures <strong>for</strong> making large or complex<br />

datasets available. See their special instructions <strong>for</strong> complex supporting data at their<br />

website.<br />

While these well known titles are rather specific in their support <strong>for</strong> community<br />

endorsed databases, it has become an established custom in many areas <strong>for</strong> publishers to<br />

collaborate with the main data archives. Typical author instructions <strong>for</strong> data deposits in<br />

public archives, linked to the publications, are (source Elsevier 29 ):<br />

If your article contains relevant unique identifiers or accession numbers<br />

(bioin<strong>for</strong>matics) linking to in<strong>for</strong>mation on entities (genes, proteins, diseases, etc.)<br />

or structures deposited in public databases, then please indicate those entities<br />

according to the st<strong>and</strong>ard explained below.<br />

Authors should explicitly mention the database abbreviation (as mentioned<br />

below) together with the actual database number, bearing in mind that an error<br />

in a letter or number can result in a dead link in the online version <strong>of</strong> the article.<br />

Please use the following <strong>for</strong>mat: <strong>Data</strong>base ID: xxxx<br />

And most publishers will specify the databases that allow links from (<strong>and</strong> increasingly<br />

to) the article:<br />

39 Science: see http://www.sciencemag.org/site/feature/contribinfo/prep/gen_info.xhtml#dataavail<br />

<strong>and</strong> http://www.sciencemag.org/site/feature/contribinfo/prep/prep_online.dtl <strong>and</strong><br />

http://www.sciencemag.org/site/feature/contribinfo/prep/prep_online_special.xhtml<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 50


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Links can be provided in your online article to the following databases (examples<br />

<strong>of</strong> citations are given in parentheses):<br />

• GenBank: Genetic sequence database at the National Center <strong>for</strong><br />

Biotechnical In<strong>for</strong>mation (NCBI) (GenBank ID: BA123456)<br />

• PDB: Worldwide Protein <strong>Data</strong> Bank (PDB ID: 1TUP)<br />

• CCDC: Cambridge Crystallographic <strong>Data</strong> Centre (CCDC ID: AI631510)<br />

• TAIR: The Arabidopsis In<strong>for</strong>mation Resource database (TAIR ID:<br />

AT1G01020)<br />

• NCT: ClinicalTrials.gov (NCT ID: NCT00222573)<br />

• OMIM: Online Mendelian Inheritance in Man (OMIM ID: 601240)<br />

• MINT: Molecular INTeractions database (MINT ID: 6166710)<br />

• MI: EMBL-EBI IntAct database <strong>for</strong> Molecular Interactions (MI ID: 0218)<br />

• UniProt: Universal Protein Resource Knowledgebase (UniProt ID:<br />

Q9H0H5)<br />

Findability, interpretability <strong>and</strong> re-usability are best served if the database ensures<br />

links back from the data(sets) to the articles about that data. Elsevier put this in place<br />

in 2010 via a collaboration with earth data archive PANGAEA. <strong>Data</strong>sets deposited at<br />

PANGAEA are automatically linked to corresponding articles in Elsevier journals on its<br />

electronic plat<strong>for</strong>m ScienceDirect <strong>and</strong> vice versa. A single click brings the user from the<br />

data to the ScienceDirect article, or reversely from ScienceDirect to the underlying data<br />

at PANGAEA, by means <strong>of</strong> DOI’s, both <strong>for</strong> the article <strong>and</strong> the dataset, (see<br />

Elsevier/PANGAEA press release (Elsevier 2010) 40 .<br />

Elsevier summarizes the process in 5 simple steps:<br />

1. Author submits article to publisher<br />

2. Author submits data set to repository<br />

3. At article publication, repository links article DOI to associated data set DOI,<br />

creating actual connection<br />

4. User sees link to ScienceDirect from PANGAEA<br />

5. User sees link to PANGAEA from ScienceDirect<br />

A few other databases <strong>and</strong> data-archives are also capable <strong>of</strong> providing links from the<br />

data to the corresponding articles, <strong>for</strong> example CCDC (Cambridge Crystallographic <strong>Data</strong><br />

Centre) 41 <strong>and</strong> the PubChem <strong>Data</strong>base 42 . These initiatives will make it much easier <strong>for</strong><br />

authors to deposit the data in public archives as they can be ensured that future users <strong>of</strong><br />

the data can easily find the corresponding articles.<br />

Important intermediary services have entered the field lately that help facilitate the<br />

workflow <strong>of</strong> authors <strong>and</strong> publishers <strong>for</strong> the parallel submission <strong>of</strong> data to a repository<br />

<strong>and</strong> the manuscript to the publisher, ensuring the bi-directional linking between<br />

publications <strong>and</strong> data in public repositories to be in place. <strong>Data</strong>Cite <strong>and</strong> Dryad are very<br />

40 Elsevier press release:<br />

http://www.elsevier.com/wps/find/authored_newsitem.cws_home/companynews05_01616<br />

41 CCDC http://www.ccdc.cam.ac.uk/<br />

42 PubChem http://pubchem.ncbi.nlm.nih.gov/<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 51


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

good examples further being explained in the next Chapter about Libraries <strong>and</strong> <strong>Data</strong><br />

Centres. For most publishers, any system that uses DOI’s as persistent identifiers can be<br />

incorporated in the workflow.<br />

3.6. <strong>Data</strong> storage as a service by the journal<br />

We know <strong>of</strong> at least one example where the journal <strong>of</strong>fers storage facilities <strong>for</strong> primary<br />

research data to the authors <strong>of</strong> their articles. In 2010 Thieme established, <strong>for</strong> its<br />

chemistry journals, an easy process by which the authors can submit primary chemistry<br />

data with their articles. Thieme, based in Stuttgart, works together with FIZ Karlsruhe<br />

(who also host their publications plat<strong>for</strong>m) <strong>and</strong> the TIB National Technical Library in<br />

Hannover. The process consists <strong>of</strong> 5 easy steps:<br />

1. At the same time with the article the author submits the research data to<br />

Thieme.<br />

2. Thieme hosts the research data in a data center (FIZ Karlsruhe).<br />

3. TIB assigns a DOI to the data.<br />

4. At the same time the article is published the primary data are published as<br />

independent entity but in connection with the article.<br />

5. The article quotes the research data as reference items with the assigned DOI.<br />

As their motivation <strong>for</strong> this new initiative, Thieme says in their press release 43 (link<br />

http://www.thieme.de/SID-4BE1BD47-99107897/connect-en/tc_oct_06_09.html ):<br />

In the field <strong>of</strong> chemistry, (…) data is accumulated by a variety <strong>of</strong> analytical,<br />

spectroscopic or computer simulation methods. Thus far, the vast amount <strong>of</strong> data<br />

lies scattered on the computers <strong>of</strong> scientists, who have produced the in<strong>for</strong>mation.<br />

As no central repository exists, no archival storage is possible at the moment.<br />

Scientific results are solely published in journals – but not the primary data from<br />

which those results originate. Due to the missing credit that working up such<br />

data currently receives, primary data is <strong>of</strong>ten poorly documented, difficult to<br />

access <strong>and</strong> not saved <strong>for</strong> the long term.<br />

Dr. Susanne Haak, Managing Editor <strong>and</strong> responsible <strong>for</strong> the chemistry journals at<br />

Thieme explains, “<strong>Access</strong> to primary data is a fundamental condition <strong>for</strong> research work,<br />

particularly in the natural sciences.” There<strong>for</strong>e, Thieme <strong>and</strong> experts from TIB have<br />

developed a uni<strong>for</strong>m structure <strong>for</strong> publishing primary data. Through structuring <strong>and</strong><br />

central data registration, a Germany-wide unique service <strong>of</strong> TIB, valuable knowledge<br />

will be harnessed.<br />

Since its inception at the end <strong>of</strong> 2009, the Chemistry journals <strong>of</strong> Thieme had 13 articles<br />

with data files added, per article roughly <strong>of</strong> the size <strong>of</strong> 5 – 10 MB. The data is not<br />

touched by Thieme, who simply collect them in ZIPfiles <strong>and</strong> check if the ordering <strong>of</strong> the<br />

subfiles is logical in the context <strong>of</strong> the article. The data files cannot be included in the<br />

43 Thieme press release http://www.thieme.de/SID-4BE1BD47-99107897/connecten/tc_oct_06_09.html<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 52


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

(pdf) peer-review files <strong>and</strong> are not sent to the reviewers unless they request the Editor to<br />

see them.<br />

Thieme does not have the ambition to build up a database with chemistry in<strong>for</strong>mation.<br />

The main aim is to provide individual datasets that support better underst<strong>and</strong>ing <strong>of</strong> the<br />

related research article. As Dr Susanne Haak explains: “The philosophy <strong>of</strong> a database is<br />

quite different from the collection <strong>of</strong> new, raw data. Everything that is entered into a<br />

searchable database needs to be carefully verified (PubChem is the best example)<br />

whereas raw data from experiments in chemistry may turn out to just help a chemist to<br />

underst<strong>and</strong> what happened during a reaction.” She adds: “ CIF files that might be<br />

submitted to us as part <strong>of</strong> primary data have usually also been submitted to CCDC<br />

be<strong>for</strong>e as this is since decades THE database <strong>for</strong> crystallographic structure in<strong>for</strong>mation”.<br />

3.7. Articles with interactive data<br />

In previous chapters we have explained the data problem as not just about availability<br />

<strong>and</strong> findability. It is also about interpretability <strong>and</strong> re-usability (see Chapter 1). Several<br />

publishers have undertaken initiatives to <strong>of</strong>fer services around research data underlying<br />

the article that improve interpretability <strong>and</strong> sometimes even re-usability.<br />

The Biochemical Journal 44 , published by Portl<strong>and</strong> Press <strong>of</strong>fers dynamic pdf’s <strong>for</strong> its<br />

articles in their so-called Utopia Documents. These provide extra features that turn<br />

graphs into tables <strong>and</strong> tables into graphs from which the users can start viewing <strong>and</strong><br />

using the data in their own fashion. These are available in excel sheets, so can be reused<br />

in other calculations.<br />

The Optical Society <strong>of</strong> America (OSA) together with the American NLM started a<br />

Interactive Science Publishing (ISP) project 45 in 2008 to enable authors to submit their<br />

data <strong>and</strong> figures to journals <strong>and</strong> to give editors <strong>and</strong> readers <strong>of</strong> their journals the<br />

possibility to view, analyze <strong>and</strong> interact with the source data connected with a scholarly<br />

article. Their main focus was the journal Optics Express. OSA supplies special s<strong>of</strong>tware<br />

(from their ISP: Interactive Scholarly Publishing division 43 ) that help the reader to apply<br />

all visualization features <strong>for</strong> the underlying material to the article. ISP allows authors to<br />

publish large 2-D <strong>and</strong> 3-D datasets with original source data that can be viewed <strong>and</strong><br />

analyzed interactively by readers.<br />

Elsevier provides in its articles a visualization <strong>and</strong> interaction applet <strong>for</strong> all related data<br />

that authors deposit in the PDB, the Worldwide Protein <strong>Data</strong>bank. The app allows<br />

readers <strong>of</strong> the article to choose from several presentation <strong>for</strong>mats to investigate the<br />

protein structures, in 2D or 3D, rotating or still. Elsevier emphasizes via this example<br />

that publishers need not be restrained in their <strong>of</strong>fering <strong>of</strong> high-value added services <strong>for</strong><br />

the analysis <strong>of</strong> data by the fact whether they store the data themselves. The data can<br />

just as easily be available in a public repository, available to all <strong>and</strong> available to applets<br />

to run on the data. In close collaboration with the NCBI, Elsevier <strong>of</strong>fers a genome Viewer<br />

44 BioChemical Journal http://www.biochemj.org/bj/default.htm <strong>and</strong><br />

http://www.biochemj.org/bj/424/bj4240317add.htm<br />

45 OSA-ISP project http://www.opticsinfobase.org/isp.cfm<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 53


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

<strong>for</strong> all gene sequence data deposited by the author in GenBank. This viewer can also be<br />

applied from within the Elsevier article 46 . See Illustration 1:<br />

Illustration 1: Screen view <strong>of</strong> Gene-viewer on ScienceDirect, Elsevier 46<br />

3.8. Special <strong>Data</strong> <strong>Publications</strong> <strong>and</strong> <strong>Data</strong> Papers<br />

With the advent <strong>of</strong> ever growing volumes <strong>of</strong> datasets, <strong>and</strong> the urge <strong>for</strong> more sharing <strong>and</strong><br />

open availability <strong>of</strong> data, there have been numerous suggestions <strong>for</strong> a new phenomenon<br />

<strong>of</strong> journals specially dedicated to so-called data publications. We examine in this context<br />

one <strong>of</strong> the very first initiatives in this space, namely by the Journal <strong>of</strong> Earth Systems<br />

Science <strong>Data</strong> 47 .<br />

This journal, as it states on its website:<br />

aims to establish a new subject <strong>of</strong> publication: to publish data according to the<br />

conventional fashion <strong>of</strong> publishing articles, applying the established principles <strong>of</strong><br />

quality assessment through peer-review to datasets. The goals are to make<br />

datasets a reliable resource to build upon <strong>and</strong> to reward the authors by<br />

establishing priority <strong>and</strong> recognition through the impact <strong>of</strong> their articles.<br />

The journal sets as a very strict condition that the data described is deposited in a long<br />

term repository <strong>and</strong> lists several <strong>for</strong> which collaboration has been established.<br />

Other criteria <strong>for</strong> the data sets described are:<br />

• Persistent Identifier: The data sets have to have a unique <strong>and</strong> persistent<br />

identifiers, e.g. doi, ARK, etc.<br />

46 Example from the journal Genomics: http://dx.doi.org/10.1016/j.ygeno.2010.06.001<br />

47 ESSD: see http://www.earth-system-science-data.net/<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 54


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

• Open <strong>Access</strong>: The data sets have to be available free <strong>of</strong> charge <strong>and</strong> without any<br />

barriers except a usual registration to get a login free-<strong>of</strong>-charge.<br />

• Liberal Copyright<br />

ight: Anyone must be free to copy, distribute, transmit <strong>and</strong> adapt<br />

the data sets as long as he/she is giving credit to the original authors (equivalent<br />

to the Creative Commons Attribution License).<br />

• Long-term Availability: The repository has to meet the highest st<strong>and</strong>ards to<br />

guarantee a long-term availability <strong>of</strong> the data sets <strong>and</strong> a permanent access.<br />

Since its launch in 2010, the journal has accepted around 30 articles in three volumes,<br />

most <strong>of</strong> which Special Issues. The papers describe the data, the planning,<br />

instrumentation <strong>and</strong> execution <strong>of</strong> experiments or collection <strong>of</strong> data. Any interpretation <strong>of</strong><br />

data is outside the scope <strong>of</strong> its regular articles. Articles on methods describe nontrivial<br />

statistical <strong>and</strong> other methods employed, e.g. to filter, normalize or convert raw data to<br />

primary, published data, as well as nontrivial instrumentation or operational methods.<br />

Any comparison to other methods is out <strong>of</strong> scope <strong>of</strong> regular articles. The peer review is<br />

public <strong>and</strong> follows an open discussion <strong>for</strong>mat on their website.<br />

The peer-review which checks on uniqueness, usefulness <strong>and</strong> completeness as well as<br />

quality, ensures that the data sets are:<br />

• at least plausible <strong>and</strong> contain no detectable problems;<br />

• <strong>of</strong> sufficiently high quality <strong>and</strong> their limitations are clearly stated;<br />

• open accessible (toll free), well annotated by st<strong>and</strong>ard metadata (e.g., ISO 19115)<br />

<strong>and</strong> available from a certified data center/repository;<br />

• customary with regard to their <strong>for</strong>mat(s) <strong>and</strong>/or access protocol, however not<br />

proprietary ones (e.g., Open Geospatial Consortium st<strong>and</strong>ards), expected to be<br />

useable <strong>for</strong> the <strong>for</strong>eseeable future.<br />

The main aim <strong>of</strong> the journal appears to promote (re-)usability <strong>of</strong> research data:<br />

The articles in this journal should enable the reviewer <strong>and</strong> the reader to review<br />

<strong>and</strong> use the data, respectively, with the least amount <strong>of</strong> ef<strong>for</strong>t. To this end, all<br />

necessary in<strong>for</strong>mation should be presented through the article text <strong>and</strong> references<br />

in a concise manner <strong>and</strong> each article should publish as much data as possible. The<br />

aim is to minimize the overall workload <strong>of</strong> reviewers, e.g., by reviewing one<br />

instead <strong>of</strong> many articles, <strong>and</strong> to maximize the impact <strong>of</strong> each article.<br />

[…]It is clear that some <strong>of</strong> these quite abstract criteria may soon unfold to more<br />

(technically) specific ones, depending on the discipline or type <strong>of</strong> data. If<br />

necessary, the editors will try to make sure that more specific help <strong>for</strong> authors as<br />

well as <strong>for</strong> reviewers will be developed over time. (…)<br />

To help streamline the review process, a more <strong>for</strong>mal list <strong>of</strong> criteria has been<br />

developed, which may serve as a checklist.<br />

In a personal interview conducted <strong>for</strong> this study, Hans Pfeiffenberger, one <strong>of</strong> the two<br />

Editors-in-Chief, explains how over 20 articles were part <strong>of</strong> special issues <strong>and</strong> so far less<br />

than 10 spontaneous submissions were received, mainly because authors are not yet<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 55


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

accustomed to this new type <strong>of</strong> publication. From the peer review reports it is not<br />

completely clear how deep the reviewers have examined the data, but referees do tend to<br />

look at methodology <strong>and</strong> value <strong>of</strong> the data. The main task <strong>of</strong> the Editor is to check if the<br />

data are deposited in a safe repository <strong>and</strong> are persistently accessible. He thinks there is<br />

still an educational task to per<strong>for</strong>m on the way peer review can help check <strong>and</strong> improve<br />

the quality <strong>of</strong> data.<br />

Pfeiffenberger also sees a need <strong>for</strong> better st<strong>and</strong>ards on the citation <strong>of</strong> data, especially<br />

which version <strong>of</strong> data to be cited: in its raw <strong>for</strong>m, cleaned up or as gritted data, or only<br />

the final data product ready <strong>for</strong> re-use ? He quotes Ox<strong>for</strong>d scholar David Shotton who<br />

advocates that data should be cited as: ‘first described in” <strong>and</strong> then link to a paper.<br />

The journal has an open peer review process, making the comments by the reviewers<br />

available <strong>for</strong> further transparency <strong>of</strong> the journal’s policies.<br />

As well as ESSD, which started in 2009, more initiatives in this area have now been<br />

launched. One <strong>of</strong> them is the journal GigaScience 48 published by BioMedCentral. The<br />

journal, which opened <strong>for</strong> submissions in the summer <strong>of</strong> 2011 <strong>and</strong> works together with<br />

the Beijing Genomics Institute BGI, aims according to its website: “ to revolutionize data<br />

dissemination, organization, underst<strong>and</strong>ing, <strong>and</strong> use. An online open-access open-data<br />

journal, we publish 'big-data' studies from the entire spectrum <strong>of</strong> life <strong>and</strong> biomedical<br />

sciences. To achieve our goals, the journal has a novel publication <strong>for</strong>mat: one that links<br />

st<strong>and</strong>ard manuscript publication with an extensive database that hosts all associated<br />

data <strong>and</strong> provides data analysis tools <strong>and</strong> cloud-computing resources”.<br />

It is likely that other journals will follow this example, possibly also in a hybrid way,<br />

including data-articles as a new article type <strong>for</strong> existing journals, in the way the<br />

International Journal <strong>of</strong> Robotics Research accepts <strong>Data</strong> Papers 49 : “<strong>Data</strong> papers are<br />

short (circa 4 pages) submissions that support <strong>and</strong> summarize a substantial archival<br />

data set which has itself been peer reviewed with the same diligence that regular<br />

submissions receive. The contribution is expected to be in the quality <strong>and</strong> utility <strong>of</strong> the<br />

data to the robotics community”.<br />

Similar discussions have already appeared on the blogs around PLoS.<br />

3.9. Gap analysis.<br />

In the previous paragraphs we have provided an overview <strong>of</strong> emerging practices <strong>for</strong> the<br />

integration <strong>of</strong> data <strong>and</strong> publications. Descending the <strong>Data</strong> Publication Pyramid, we find<br />

that data have always been part <strong>of</strong> the traditional literature in a much aggregated way.<br />

In the recent decade new extended <strong>for</strong>ms <strong>of</strong> data presentation have found their way into<br />

online supplements to journal articles. Initially the absence <strong>of</strong> volume restrictions <strong>for</strong><br />

added data was a blessing. Their recent proliferation has caused a halt <strong>and</strong> new<br />

48 GigaScience Journal http://www.gigasciencejournal.com/<br />

49 <strong>Data</strong> Papers in IJRR<br />

http://www.uk.sagepub.com/journalsProdDesc.nav?prodId=Journal201324&crossRegion=eur#tabv<br />

iew=manuscriptSubmission<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 56


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

limitations are being put in place to keep the data added to journal manuscripts relevant<br />

<strong>and</strong> manageable.<br />

At the same time, certain disciplines show the emergence <strong>of</strong> community endorsed data<br />

centres supported by scientific journals that include the archives’ accession numbers <strong>for</strong><br />

links from the publication, or even add interactive viewers within the article to study the<br />

data in the archive within the context <strong>of</strong> the article.<br />

With more <strong>and</strong> more data deposited in archives, a new publication type has emerged,<br />

that <strong>of</strong> the <strong>Data</strong> Article.<br />

Returning to to the particular problem <strong>of</strong> making research data con<strong>for</strong>m to the list <strong>of</strong> 4<br />

items <strong>of</strong> Chapter 1 in terms <strong>of</strong>:<br />

• Availability<br />

• Findability<br />

• Interpretability<br />

• Reusability<br />

we can make a rough rating <strong>for</strong> the journal practices described here above against these<br />

4 criteria (the greener the higher the overall rating is):<br />

Availability Findability Interpretability Re-usability<br />

+ +/- ++ -<br />

<strong>Data</strong> presented within<br />

articles<br />

<strong>Data</strong> in Journal<br />

Supplements, unrestricted ++ ++ +++ +/-<br />

<strong>Data</strong> in Journal<br />

Supplements, but<br />

restricted<br />

+ ++ +++ -<br />

<strong>Data</strong> in public archives,<br />

linked to publications +++ ++++ +++ +++<br />

Journals storing data + ++ +++ ++<br />

Journals making data<br />

++ +++ ++++ ++++<br />

interactive<br />

<strong>Data</strong> <strong>Publications</strong> ++++ ++++ ++++ ++++<br />

Table 6: Rating the different ways to publish research data against the Treloar criteria<br />

The most common current practice, which is to add data in supplementary files to<br />

journal articles, is clearly not the most ideal one if measured against these 4 criteria.<br />

<strong>Data</strong> deposited in public, community endorsed archives seems to have a better future,<br />

especially in terms <strong>of</strong> re-use <strong>and</strong> also <strong>for</strong> availability <strong>and</strong> findability. This is particularly<br />

true if they are accompanied by proper data publications that describe them <strong>and</strong> make<br />

them interpretable <strong>and</strong> re-usable. At the same time we know that only a few discipline<br />

areas have these public, community endorsed archives. Indeed, some <strong>of</strong> these archives<br />

appear to be threatened by cuts in government spending.<br />

If we add the three additional criteria <strong>of</strong> Chapter 1, namely citability, curation <strong>and</strong><br />

preservation, the ratings or each <strong>of</strong> the current practices are:<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 57


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Citability Curation Preservation<br />

<strong>Data</strong> presented within articles +++ +++ +++<br />

<strong>Data</strong> in Journal Supplements,<br />

unrestricted ++++ +/- +/-<br />

<strong>Data</strong> in Journal Supplements, but<br />

restricted ++ ++ +<br />

<strong>Data</strong> in public archives, linked to<br />

publications ++++ ++++ ++++<br />

Journals storing data +++ +++ +++<br />

Journals making data interactive ++++ +++ ++++<br />

<strong>Data</strong> <strong>Publications</strong> ++++ ++++ ++++<br />

Table 7: Rating the different ways to publish research data against the criteria <strong>of</strong> citability,<br />

curation <strong>and</strong> preservation.<br />

Curation <strong>and</strong> preservation <strong>of</strong> data in journal supplements is not always ensured by<br />

publishers as we see from the PARSE.Insight data <strong>and</strong> the specific requirements <strong>for</strong><br />

specific <strong>for</strong>mats <strong>of</strong> data, against an ever growing variety <strong>of</strong> data <strong>for</strong>mats (including<br />

multimedia) asks <strong>for</strong> better solutions . Public repositories are better placed to deal with<br />

this issue. Bidirectional linking <strong>and</strong> data publications can ensure better citability <strong>of</strong><br />

data. For the citability <strong>of</strong> data there are two aspects that play a role: the data as such<br />

should be citable (via DOI’s, accession numbers <strong>of</strong> other persistent identifiers) but in<br />

addition, the people who created or generated the data deserve credit. An increasing<br />

number <strong>of</strong> data repositories provide citation means <strong>for</strong> the data itself (including the links<br />

to <strong>and</strong> from the data), but very few conventions exist <strong>for</strong> the way the people behind the<br />

data get citation counts <strong>and</strong> credits <strong>for</strong> the work. The best way to do that currently is<br />

following the traditional way people are cited, via a publication that describes the data<br />

or has the data included.<br />

3.10. From raw data to processed data to data interpretations.<br />

In 2007 a majority <strong>of</strong> the larger publishers in the STM arena undersigned the so-called<br />

Brussels declaration which states 50 :<br />

Raw research data should be made freely available to all researchers. Publishers<br />

encourage the public posting <strong>of</strong> the raw data outputs <strong>of</strong> research. Sets or sub-sets<br />

<strong>of</strong> data that are submitted with a paper to a journal should wherever possible be<br />

made freely accessible to other scholars<br />

In this context, it is important to emphasize that the statement concerns raw data, not<br />

processed data, or data presentations or the data-interpretations as presented in a<br />

journal article. Using the <strong>Data</strong> Publication Pyramid as presented in Chapter 1, the<br />

Brussels statement concerns the public posting <strong>of</strong> raw data, the base layers <strong>of</strong> the<br />

pyramid. Preferably this public posting is done in an aggregated way in community<br />

endorsed archives <strong>for</strong> specific subject fields. It can be expected that many scientific<br />

disciplines will see a growing need to have such common solutions available that allow<br />

50 http://www.stm-assoc.org/brussels-declaration/<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 58


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

interlinking with journal publications. Project Dryad is one such example, the<br />

establishment <strong>of</strong> <strong>Data</strong>Cite in 2009 another one (both described in Chapter 4).<br />

For the middle area in the pyramid, that <strong>of</strong> <strong>Data</strong> Selections, Processed <strong>Data</strong> <strong>and</strong> <strong>Data</strong><br />

Representations, several options exist. These can be cared <strong>for</strong> in the context <strong>of</strong> journals<br />

or in the context <strong>of</strong> archives <strong>and</strong> databases, depending on the level <strong>of</strong> processing <strong>of</strong> the<br />

data. Traditionally these kinds <strong>of</strong> data have been included in supplementary files to<br />

journals <strong>and</strong> this custom is expected to grow further. Increasingly, we can expect more<br />

<strong>and</strong> more journal policies to raise the level <strong>of</strong> required data selection <strong>and</strong> processing<br />

(example: Cell) <strong>and</strong> no longer accept anything <strong>and</strong> everything in the supplemental<br />

materials but instead pose restrictions on volume <strong>and</strong> <strong>for</strong>mat along the criteria <strong>of</strong><br />

relevancy <strong>and</strong> manageability.<br />

For the apex in this pyramid, data interpretations are included in a publication; there<br />

are no indications <strong>of</strong> paradigm shifts. But it is likely that data <strong>and</strong> publications will<br />

integrate further, at different levels <strong>and</strong> in more novel ways. There will be new<br />

innovations in the way the data are displayed <strong>and</strong> presented that confirm certain<br />

analyses <strong>and</strong> conclusions. Examples given in this chapter concern data made interactive<br />

from within the article via protein viewers <strong>and</strong> genome viewers.<br />

3.11. Diverging <strong>and</strong> Converging Trends.<br />

We see diverging trends taking place as well as converging trends in the way publishers<br />

are h<strong>and</strong>ling the increasing amount <strong>of</strong> data alongside articles. This is probably a strong<br />

indication that this area is in transition at the moment.<br />

A clear example <strong>of</strong> diverging trends can be found in:<br />

• Journals have more <strong>and</strong> more data submitted in supplementary files, <strong>and</strong> most <strong>of</strong><br />

the journals accommodate this, also <strong>for</strong> a growing variety <strong>of</strong> file <strong>for</strong>mats. At the<br />

same time, the first few examples now exist where journals could no longer<br />

h<strong>and</strong>le this flow in view <strong>of</strong> the sheer volume <strong>and</strong> hence have stopped accepting<br />

supplementary files or have put limitations <strong>for</strong> them.<br />

Whereas a converging trend is emerging in these areas:<br />

• More journals support the principles <strong>of</strong> data sharing <strong>and</strong> data availability <strong>and</strong><br />

press (or even m<strong>and</strong>ate) authors to deposit data in public archives <strong>and</strong> to follow<br />

the conventions <strong>for</strong> this in their subject area.<br />

• More publishers collaborate with community endorsed, public archives to make<br />

data <strong>and</strong> publications inter-linkable <strong>and</strong> citable, thereby endorsing the Brussels<br />

declaration in practice, <strong>and</strong> with positive effects on the integration <strong>of</strong> data <strong>and</strong><br />

publications, their findability, discoverability, interpretability <strong>and</strong> re-use.<br />

• A growing number <strong>of</strong> publishers <strong>of</strong>fer services to present data in more<br />

sophisticated <strong>and</strong> even interactive ways, that increase their interpretability <strong>and</strong><br />

hence their re-use further.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 59


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

3.12. Opportunities <strong>for</strong> publishers in <strong>Data</strong> Exchange<br />

From the practices <strong>and</strong> laudable initiatives gathered <strong>and</strong> analyzed in this research<br />

study, we can summarize the following elements as important opportunities <strong>for</strong><br />

publishers to further improve the integration <strong>of</strong> data <strong>and</strong> publications.<br />

• Require availability <strong>of</strong> underlying research material as an editorial policy<br />

(example: Nature, PLoS)<br />

• More careful treatment <strong>of</strong> digital research data submitted to journals <strong>and</strong> ensure<br />

it is stored, curated <strong>and</strong> preserved in trustworthy places (several examples <strong>of</strong><br />

collaboration with community endorsed repositories)<br />

• Ensure (bi-directional) links <strong>and</strong> persistent identifiers (examples <strong>for</strong> listed public<br />

archives, <strong>Data</strong>Cite, Dryad)<br />

• Establish uni<strong>for</strong>m citation practices (examples Elsevier-PANGAEA, ESSD,<br />

<strong>Data</strong>Cite, Dryad, Thieme)<br />

• Establish common practice <strong>for</strong> peer review <strong>of</strong> data (example ESSD)<br />

• Develop data-publications <strong>and</strong> quality st<strong>and</strong>ards (example ESSD, GigaScience, IJ<br />

Robotics Research)<br />

In order to <strong>of</strong>fset these points against the listed issues around data we create the<br />

following table:<br />

<strong>Data</strong> Issue:<br />

Availability<br />

Findability<br />

Interpretability<br />

Re-usability<br />

Citability<br />

Curation<br />

Preservation<br />

Publishers opportunity to help improve situation:<br />

Articles with data provide richer content <strong>and</strong> higher usage<br />

Impose stricter editorial policies about availability <strong>of</strong> underlying data<br />

which is in line with general funder’s trends<br />

Ensure data is stored in a safe place, preferably a public repository<br />

Be transparent about curation <strong>and</strong> preservation <strong>of</strong> submitted data<br />

Ensure bi-directional links between data <strong>and</strong> publications<br />

Use <strong>of</strong> persistent identifiers such as DOI’s<br />

Ensure common citation practices<br />

Provide services around data such as viewer apps <strong>for</strong> underlying data<br />

within the article or interactive graphs, tables <strong>and</strong> images<br />

<strong>Data</strong> <strong>Publications</strong><br />

Interactive data from within articles<br />

Links to the relevant datasets, not just the database<br />

<strong>Data</strong> <strong>Publications</strong><br />

Establish uni<strong>for</strong>m data citation st<strong>and</strong>ards<br />

Follow metadata st<strong>and</strong>ards <strong>for</strong> datasets<br />

Use <strong>of</strong> persistent identifiers such as DOI’s<br />

<strong>Data</strong> <strong>Publications</strong><br />

Transparency about curation <strong>of</strong> submitted data<br />

Collaboration with public data archives<br />

Transparency about preservation <strong>of</strong> submitted data<br />

Collaboration with public data archives<br />

Table 2 repeated: <strong>Data</strong> Opportunities <strong>for</strong> Publishers.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 60


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

In general, we can say that publishers will tend to follow their authors’ wishes. With the<br />

trend clearly towards researchers who share more <strong>and</strong> more data, funders who make<br />

this conditional, <strong>and</strong> libraries <strong>and</strong> archives working towards better accessibility <strong>and</strong><br />

retrievability <strong>of</strong> data, publishers can play an important role in the integration <strong>of</strong> data<br />

<strong>and</strong> publications <strong>for</strong> the sake <strong>of</strong> better discoverability, interpretability <strong>and</strong> re-sue <strong>of</strong><br />

research data.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 61


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

4. DATA CENTRE AND LIBRARY PERSPECTIVE<br />

This chapter describes how libraries <strong>and</strong> data centres respond to the increasing amount<br />

<strong>of</strong> data that is produced <strong>and</strong> available <strong>and</strong> how they support availability, findability,<br />

interpretability, <strong>and</strong> re-usability <strong>of</strong> data. As such we assume that libraries <strong>and</strong> data<br />

centres deal, or have to deal, with any level <strong>of</strong> data that researchers <strong>and</strong>/or publishers<br />

want to make available: selected data representations, data collections <strong>and</strong> structured<br />

databases, raw data <strong>and</strong> original data sets. Based on desk research, we describe the<br />

current practice <strong>and</strong> rationale <strong>for</strong> action in libraries <strong>and</strong> data centres. We elaborate on<br />

the implications <strong>of</strong> increasing data integration in publication workflows, <strong>and</strong> present<br />

exemplary data initiatives, in which libraries <strong>and</strong> data centres are involved. Contact<br />

persons <strong>of</strong> each initiative were addressed with key questions, <strong>and</strong> their responses<br />

in<strong>for</strong>med the analysis. The chapter ends by highlighting gaps as well as opportunities.<br />

4.1. Libraries <strong>and</strong> data centres as a<br />

custodians <strong>of</strong> data<br />

Libraries <strong>and</strong> data centres are traditionally positioned at opposite ends <strong>of</strong> the research<br />

lifecycle: <strong>Data</strong> centres help researchers collect <strong>and</strong> process their data, <strong>and</strong> libraries deal<br />

with the publications that result from research projects. Libraries also help arrange the<br />

input at the start a new research cycle: the search <strong>for</strong> publications as the basis <strong>for</strong> new<br />

research. With the convergence <strong>of</strong> data <strong>and</strong> publications, <strong>and</strong> interdependencies between<br />

data <strong>and</strong> journal publications, such traditional roles become blurred. Reports like<br />

“Riding the Wave” 51 recognise that the requirements <strong>of</strong> e-science <strong>and</strong> enhanced scientific<br />

publishing necessitate a comprehensive infrastructure <strong>for</strong> scientific in<strong>for</strong>mation.<br />

Libraries <strong>and</strong> data centres have important, partly overlapping, but mostly<br />

complementary roles to fulfil.<br />

To create an infrastructure that systematically supports such data publication scenarios,<br />

libraries <strong>and</strong> data centres must align, or create new common conventions in data<br />

description <strong>and</strong> identification, <strong>and</strong> balance the relation between disciplinary<br />

particularities <strong>and</strong> large-scale interoperability. In this process, libraries <strong>and</strong> data centres<br />

complement each other:<br />

Research data centres can be best considered in this context as the experts in their<br />

respective research disciplines <strong>and</strong> in h<strong>and</strong>ling their discipline specific data. They are<br />

set up to support data creation <strong>and</strong> access. They provide research teams with storage<br />

space, <strong>and</strong> services around data creation as well as preservation, <strong>and</strong> they provide<br />

academics <strong>and</strong> other users with access to data files <strong>and</strong> with training <strong>and</strong> advice on how<br />

to use them. They are familiar with data protection <strong>and</strong> privacy regulations <strong>and</strong> with<br />

research ethic issues. They are well positioned to adopt new <strong>and</strong> emerging types <strong>of</strong> data<br />

in their discipline <strong>and</strong> increasingly sophisticated methods <strong>of</strong> record-linking <strong>and</strong><br />

statistical matching. They are aware <strong>of</strong> data quality issues <strong>and</strong> existing disciplinary<br />

st<strong>and</strong>ards.<br />

51 E.g., Riding the Wave: How Europe Can Gain From The Rising Tide <strong>of</strong> Scientific <strong>Data</strong>. Final<br />

report <strong>of</strong> the High Level Expert Group on Scientific <strong>Data</strong> (2010).<br />

http://ec.europa.eu/in<strong>for</strong>mation_society/newsroom/cf/document.cfm?action=display&doc_id=707<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 62


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Libraries have been keepers <strong>of</strong> knowledge <strong>for</strong> hundreds <strong>and</strong> thous<strong>and</strong>s (in the case <strong>of</strong><br />

Library <strong>of</strong> Alex<strong>and</strong>ria) <strong>of</strong> years. They are experts in categorizing fields <strong>of</strong> knowledge <strong>and</strong><br />

in recording <strong>and</strong> cataloguing all relevant in<strong>for</strong>mation about a particular publication,<br />

including provenance <strong>and</strong> in<strong>for</strong>mation about its author. They have been collecting,<br />

organizing, describing, preserving <strong>and</strong> making available knowledge <strong>and</strong> in<strong>for</strong>mation<br />

manifested in printed books <strong>and</strong> articles <strong>and</strong> they are ready to transfer this experience<br />

to new <strong>for</strong>ms <strong>of</strong> (digital) collections.<br />

Several symposia <strong>and</strong> publications bear witness that the libraries community is in a<br />

transition process, rethinking their role in an increasingly digital environment in<br />

general <strong>and</strong> in e-research in particular:<br />

A study commissioned by the Research In<strong>for</strong>mation Network (RIN) in 2007 asked if data<br />

management was a job <strong>for</strong> academic librarians. The survey data provided mixed<br />

messages: “Many librarians see data curation as a natural extension <strong>of</strong> their current<br />

role, but there is also evidence <strong>of</strong> caution in terms <strong>of</strong> the curation <strong>of</strong> large-scale datasets<br />

linked to e-research.” 52<br />

A symposium organized by the US Council on Library <strong>and</strong> In<strong>for</strong>mation Resources (CLIR)<br />

explored functions <strong>of</strong> the research library in a changing in<strong>for</strong>mation l<strong>and</strong>scape in 2008. 53<br />

It came to the conclusion that libraries needed to engage in data management <strong>and</strong> data<br />

curation to reflect basic changes in how scholars work – in collaboration with faculty <strong>and</strong><br />

publishers. Rick Luce suggested in the same publication that traditional library roles<br />

must be augmented by new capabilities, centred on collaborative, data-intensive<br />

in<strong>for</strong>mation resources. 54<br />

The Association <strong>of</strong> European Research Libraries, LIBER, made “Scholarly<br />

Communication” one <strong>of</strong> its 5 strategic priorities 2009-2012. 55 The corresponding working<br />

group on e-science recently organised the workshop “Libraries <strong>and</strong> research data:<br />

exploring alternatives <strong>for</strong> services <strong>and</strong> partnerships” that met large interest (presently<br />

not published).<br />

While the library system has evolved over the centuries, the data centre l<strong>and</strong>scape is a<br />

relatively young one, with the first scientific data centres dating back to the mid 20 th<br />

century (ex: US National Climatic <strong>Data</strong> Centre: 1951, World <strong>Data</strong> Centre: 1957/8). The<br />

data centre l<strong>and</strong>scape is more fragmented than the library system. There are wellestablished<br />

disciplinary data centres in some data intensive research domains (see <strong>for</strong><br />

example CESSDA member organisations <strong>for</strong> the Social Sciences, CERN computer centre<br />

<strong>for</strong> Particle Physics, the members <strong>of</strong> the International Virtual Observatory <strong>Alliance</strong> in<br />

52 RIN 2007: Researchers’ Use <strong>of</strong> Academic Libraries <strong>and</strong> their Services. A report commissioned<br />

by the Research In<strong>for</strong>mation Network <strong>and</strong> the Consortium <strong>of</strong> Research Libraries (2007).<br />

http://www.rin.ac.uk/system/files/attachments/Researchers-libraries-services-report.pdf<br />

53 CLIR 2008: No Brief C<strong>and</strong>le: Reconceiving Research Libraries <strong>for</strong> the 21st Century. Council on<br />

Library <strong>and</strong> In<strong>for</strong>mation Resources (2008). http://www.clir.org/pubs/abstract/pub142abst.html<br />

54 Richard E. Luce: A New Value Equation Challenge: The Emergence <strong>of</strong> eResearch <strong>and</strong> Roles <strong>for</strong><br />

Research Libraries. In: No Brief C<strong>and</strong>le: Reconceiving Research Libraries <strong>for</strong> the 21 st Century<br />

(2008). http://www.clir.org/pubs/reports/pub142/pub142.pdf<br />

55 http://www.libereurope.eu/committee/scholarly-communication<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 63


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Astronomy or the said World <strong>Data</strong> Centres <strong>for</strong> geophysical data), but with few exceptions<br />

barely visible centres in the Humanities.<br />

One noteworthy example <strong>of</strong> a publicly funded, national disciplinary data archive, with a<br />

vast collection <strong>of</strong> digital data in the Social Sciences <strong>and</strong> Humanities is the UK <strong>Data</strong><br />

Archive (UKDA). It was established in 1967 by the UK Social Science Research Council,<br />

which has so far provided the long-term commitment <strong>of</strong> funds.<br />

The Life Science community is characterized by numerous specialized data centres, e.g.,<br />

GenBank at the US National Center <strong>for</strong> Biotechnical In<strong>for</strong>mation, the Worldwide<br />

Protein data Bank, Cambridge Crystallographic <strong>Data</strong> Centre <strong>and</strong> many more.<br />

4.2. Common practice <strong>and</strong> rationale <strong>for</strong> action<br />

Libraries <strong>and</strong> data centres serve the needs <strong>of</strong> the research community <strong>and</strong> in that role,<br />

must react to increasingly dem<strong>and</strong>ing user needs (see chapter 2 on researchers’<br />

perspective) <strong>and</strong> support increasingly sophisticated <strong>and</strong> complex publishers’ products<br />

(see chapter 3 on publishers’ perspective). In a society where in<strong>for</strong>mation is available<br />

abundantly <strong>and</strong> <strong>of</strong>ten <strong>for</strong> free on the internet, libraries <strong>and</strong> data centres are under<br />

pressure to strengthen their role as pr<strong>of</strong>essional in<strong>for</strong>mation suppliers.<br />

Another influencing factor is institutional requirements: Libraries <strong>and</strong> data centres are<br />

increasingly confronted with data management requirements from their funding bodies,<br />

involving them in data creation during the research workflow. Many academic institutes<br />

in a growing number <strong>of</strong> countries 56 have adopted Open <strong>Access</strong> policies which require<br />

their data centres to provide publishing support to the university’s research groups.<br />

More <strong>and</strong> more funders oblige their grant recipients to make their data (openly)<br />

available after the end <strong>of</strong> a project (see also “Who pays <strong>for</strong> what” in chapter 2). There is a<br />

natural expectation that libraries <strong>and</strong> data centres will support the principal<br />

investigators with data management plans <strong>and</strong> to provide secure storage space <strong>for</strong> the<br />

created data.<br />

A large evidence base on the common practice as well as on trends in the scientific<br />

infrastructure is available from the PARSE.Insight project. As far as libraries <strong>and</strong> data<br />

centres are concerned, however, the underlying figures must be interpreted carefully.<br />

The PARSE.Insight report never consistently defined the difference between data<br />

centres <strong>and</strong> libraries as we try to do <strong>for</strong> this report. The stakeholder group “data<br />

management” <strong>of</strong> the PARSE.Insight survey was composed <strong>of</strong> 7 archives, 20 data centres,<br />

152 research libraries, 13 regional institutes, 24 national libraries <strong>and</strong> 3 institutions<br />

that identified themselves as “other” (see Graph 24).<br />

56 See ROARMAP, the Registry <strong>of</strong> Open <strong>Access</strong> Repositories M<strong>and</strong>atory Archiving Policies:<br />

http://roarmap.eprints.org/<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 64


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Graph 24, Source PARSE.Insight, N=241, background <strong>of</strong> ‘<strong>Data</strong> Management’ respondents<br />

The questionnaire distributed in the stakeholder group “data management” is available<br />

online, as are Graphs <strong>of</strong> most survey results. 57 Here, a special selection <strong>of</strong> the findings<br />

most relevant to data sharing are reanalyzed. The participating libraries <strong>and</strong> data<br />

centres agreed predominantly that data preservation was important or very important<br />

<strong>for</strong> the following reasons:<br />

• Publicly funded research output should be properly preserved (98%)<br />

• Preserved data stimulates the advancement <strong>of</strong> science (96%)<br />

• It allows <strong>for</strong> re-analysis <strong>of</strong> existing data (95%)<br />

“Researchers want someone else to pay <strong>for</strong> data preservation!”<br />

The PARSE.Insight survey showed that researchers look <strong>for</strong> an organisational structure<br />

to invest in data curation (see Graph 25, also chapter 2). In agreement with the high<br />

awareness <strong>for</strong> the importance <strong>of</strong> data preservation, especially libraries consider<br />

themselves responsible to fulfil this role:<br />

57 http://www.parse-insight.eu/downloads/PARSE-insight_survey_questions_datamanagement.pdf<br />

<strong>and</strong> https://www.swivel.com/people/1015959-PARSE-insight/group_assets/public<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 65


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Graph 25, Source PARSE.Insight, N=241, ‘<strong>Data</strong> Management’ respondents, N = 77<br />

However, in practice, only 44% <strong>of</strong> the responding institutions accept research data <strong>for</strong><br />

storage <strong>and</strong> preservation (Graph 26).<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 66


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Graph 26, Source PARSE.Insight, N=241, ‘<strong>Data</strong> Management’ respondents, N = 111<br />

It is likely that the percentage is unevenly distributed between the group <strong>of</strong><br />

participating data centres <strong>and</strong> that <strong>of</strong> the libraries. If 100% <strong>of</strong> the participating data<br />

centres <strong>and</strong> perhaps some or most <strong>of</strong> the archives accept research data <strong>for</strong> storage <strong>and</strong><br />

preservation <strong>and</strong> only a very small proportion <strong>of</strong> the libraries actually accept research<br />

data, this result can either indicate a large gap on the side <strong>of</strong> the libraries or indicate a<br />

strategy <strong>of</strong> specialization <strong>and</strong> division <strong>of</strong> labour.<br />

It is likely that the percentage is unevenly distributed between the group <strong>of</strong><br />

participating data centres <strong>and</strong> that <strong>of</strong> the libraries. If 100% <strong>of</strong> the participating data<br />

centres <strong>and</strong> perhaps some or most <strong>of</strong> the archives accept research data <strong>for</strong> storage <strong>and</strong><br />

preservation <strong>and</strong> only a very small proportion <strong>of</strong> the libraries actually accept research<br />

data, this result can either indicate a large gap on the side <strong>of</strong> the libraries or indicate a<br />

strategy <strong>of</strong> specialization <strong>and</strong> division <strong>of</strong> labour.<br />

“Researchers want to be in control <strong>of</strong> their data!”<br />

According to the findings <strong>of</strong> PARSE.Insight (see chapter 2) <strong>and</strong> to findings <strong>of</strong> the SURF<br />

foundation 58 , it is <strong>of</strong> paramount importance <strong>for</strong> authors that they keep control <strong>of</strong> their<br />

data: “In all cases, when the data is transferred to another party, researchers wish to<br />

remain in control <strong>of</strong> their data.” 59 Consequently, libraries <strong>and</strong> data centres, as keepers<br />

<strong>of</strong> authors’ data must make sure that they respect this wish.<br />

58 SURF 2010: Martin Feijen: What researchers want. A review <strong>of</strong> literature describing what<br />

researchers want with regard to storage <strong>of</strong> <strong>and</strong> access to research data. Commissioned by the<br />

SURF Foundation (2010).<br />

http://www.surffoundation.nl/en/publicaties/Pages/Whatresearcherswant.aspx<br />

59 SURF 2010<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 67


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

71.5% <strong>of</strong> the PARSE.Insight data management survey participants stated that they had<br />

security protocols in place that protect stored data from unauthorized modification,<br />

damage or deletion. In 19.2% <strong>of</strong> the participating organisations, action remains to be<br />

taken (Graph 27).<br />

Graph 27, Source PARSE.Insight, N=241, ‘<strong>Data</strong> Management’ respondents, N = 172<br />

However, only 54,1% confirmed that they had procedures to determine ownership <strong>and</strong><br />

<strong>for</strong> identifying <strong>and</strong> managing data rights – an important criteria <strong>for</strong> researchers to<br />

entrust an organisation with their data (Graph 28).<br />

Graph 28, Source PARSE.Insight, N=241, ‘<strong>Data</strong> Management’ respondents, N = 77<br />

“Researchers rs want credit <strong>for</strong> sharing their data!”<br />

Another key topic <strong>for</strong> researchers is that if they make their data available, it should be<br />

visible <strong>and</strong> it should be possible to receive credit <strong>for</strong> it.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 68


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Graph 29, Source PARSE.Insight, N=241, ‘<strong>Data</strong> Management’ respondents, N = 164<br />

At present, only 54% <strong>of</strong> the libraries <strong>and</strong> data centres support linking to stored data from<br />

journal articles (Graph 29). However, ef<strong>for</strong>ts are under way to facilitate exactly this.<br />

<strong>Data</strong>Cite, <strong>for</strong> example, undertakes ef<strong>for</strong>ts to make research data citable <strong>and</strong> accessible in<br />

an internationally harmonized way 28 .<br />

The statistical findings suggest already that there is a high awareness in libraries <strong>and</strong><br />

data centres, but not yet a comprehensive preparedness to take on the challenge.<br />

4.3. Implications <strong>of</strong> data integration <strong>for</strong> libraries <strong>and</strong> data centres<br />

There are good reasons <strong>for</strong> fostering the integration <strong>of</strong> research data <strong>and</strong> publications.<br />

As shown in the introduction <strong>of</strong> this report (see Chapter 1) integration <strong>of</strong> publications<br />

<strong>and</strong> research data has the potential to facilitate findability <strong>and</strong> re-usability <strong>of</strong> data, <strong>and</strong><br />

to provide authors with better credits <strong>for</strong> their data. It also adds value <strong>and</strong> background<br />

to the publications.<br />

Publishers <strong>of</strong>fer several ways in which data <strong>and</strong> publications can be integrated already<br />

(see chapter 3). <strong>Data</strong> centres are, to a high degree, part <strong>of</strong> new publishing models,<br />

supporting data creation in the first place, or when publishers require authors to deposit<br />

underlying research data in public data archives <strong>and</strong> link to it. After all, many<br />

manifestations <strong>of</strong> data, as illustrated by the <strong>Data</strong> Publication Pyramid in the<br />

introductory chapter <strong>of</strong> this report (Graph 1) have an impact on libraries <strong>and</strong> data<br />

centres:<br />

1. <strong>Data</strong> contained <strong>and</strong> explained within the article<br />

Implication <strong>for</strong> libraries/data centres: Prepare <strong>for</strong> adequate preservation<br />

strategies. The preservation <strong>of</strong> enriched articles may be more dem<strong>and</strong>ing than<br />

preservation <strong>of</strong> traditional articles. Novel ways <strong>of</strong> embedding data within the<br />

article (clickable graphs that provide underlying tables) will require more<br />

sophisticated preservation means.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 69


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

2. <strong>Data</strong> published in supplementary files to articles<br />

Implication <strong>for</strong> libraries/data centres: Ensure that article <strong>and</strong> supplementary files<br />

stay together <strong>and</strong> that presentation <strong>and</strong> preservation mechanisms <strong>for</strong><br />

supplementary files are in place.<br />

3. <strong>Data</strong>sets referenced from the articles <strong>and</strong> held in data centres <strong>and</strong> repositories<br />

Implication <strong>for</strong> libraries/data centres: Distributed responsibility between article<br />

holder (publisher, library) <strong>and</strong> data holder (data centre or publisher). <strong>Data</strong>sets<br />

must be citable. The link between article <strong>and</strong> referenced data must be persistent.<br />

Presentation <strong>and</strong> preservation mechanisms must be ensured. If the data resides<br />

in a publisher‘s storage facility, perpetual access <strong>and</strong> eventually h<strong>and</strong>-<strong>of</strong>f<br />

mechanisms to either libraries or data centres must be developed.<br />

4. <strong>Data</strong> published independently from written publications, e.g. in databases or<br />

special data journals (“data publication”)<br />

Implication <strong>for</strong> libraries/data centres: Support publication processes. <strong>Data</strong>sets<br />

need curation <strong>and</strong> special treatment that considers the granularity <strong>and</strong> dynamics<br />

<strong>of</strong> research data. Add metadata to datasets <strong>for</strong> documentation, to support re-use,<br />

<strong>and</strong> to facilitate search <strong>and</strong> retrieval <strong>of</strong> data.<br />

5. <strong>Data</strong> in drawers <strong>and</strong> on disks at the institute<br />

Implication <strong>for</strong> libraries/data centres: Support scientists in preparing data<br />

management plans at an early stage <strong>of</strong> the research process to avoid “isolated<br />

data holdings” in the first place. Develop user friendly, low-threshold data<br />

publication or data deposit services.<br />

To summarise, libraries <strong>and</strong> data centres must support data publishing as a prerequisite<br />

<strong>for</strong> data availability, including persistent identification/citation <strong>of</strong> datasets, <strong>and</strong><br />

solutions <strong>for</strong> data description, documentation <strong>and</strong> retrieval, which together facilitate<br />

findability. They must also ensure long-term data archiving including data curation <strong>and</strong><br />

preservation as a condition <strong>for</strong> data interpretability <strong>and</strong> re-usability. Libraries <strong>and</strong> data<br />

centres have started to enter into new alliances (as will be described in more detail in<br />

the next section) to develop new strategies together or with other actors such as<br />

publishers or research institutions. They can be involved on several levels, e.g. as active<br />

service operator, as provider <strong>of</strong> a specific sub-service (e.g. assignment <strong>of</strong> persistent<br />

identifiers), or as custodian <strong>of</strong> the results <strong>of</strong> such services.<br />

4.4. Libraries <strong>and</strong> data centres engagement in new services <strong>and</strong> alliances<br />

Durign the course <strong>of</strong> our research we found several new flagship projects <strong>and</strong> initiatives<br />

where libraries <strong>and</strong> data centres are probing the integration <strong>of</strong> data <strong>and</strong> publications on<br />

different levels. We present examples <strong>for</strong> persistent identifier <strong>and</strong> linking initiatives<br />

(findability – <strong>Data</strong>Cite), data publication <strong>and</strong> data management support (Availability &<br />

interpretability – Dryad, <strong>Data</strong>verse), <strong>and</strong> data archiving (re-usability – Pangaea).<br />

In <strong>Data</strong>Cite, libraries <strong>and</strong> data centres have allied to establish easier access to scientific<br />

research data online. 60 The goal <strong>of</strong> the international consortium is to make research data<br />

60 http://datacite.org/index.html<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 70


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

citable <strong>and</strong> accessible in a harmonized, interoperable <strong>and</strong> persistent way. 15 members<br />

from 10 countries – among them 6 <strong>Data</strong> centres or data services <strong>and</strong> 9 libraries – create<br />

<strong>and</strong> maintain an infrastructure to register data sets <strong>and</strong> assign unique persistent<br />

identifiers to them.<br />

By using DOI names <strong>for</strong> data registration <strong>Data</strong>Cite <strong>of</strong>fers a simple <strong>and</strong> well known<br />

solution <strong>for</strong> citing data from publications. Users – researchers, publishers, libraries – can<br />

use the same technical infrastructure <strong>for</strong> datasets that they already use <strong>for</strong> research<br />

articles.<br />

The focus <strong>of</strong> <strong>Data</strong>Cite is the registration <strong>of</strong> data sets <strong>and</strong> assigning <strong>of</strong> persistent<br />

identifiers, not on storage <strong>and</strong> preservation <strong>of</strong> research data. The responsibility <strong>for</strong> the<br />

research data, including access, remains with the data centres or other trusted<br />

institution. The content holders are held responsible <strong>for</strong> quality assurance, metadata<br />

creation, storage <strong>and</strong> access. <strong>Data</strong>Cite however provides supports, e.g. the <strong>Data</strong>Cite<br />

Metadata Scheme 61 , created by a working group <strong>of</strong> <strong>Data</strong>Cite members. Also, a <strong>Data</strong>Cite<br />

working group is defining criteria <strong>for</strong> trustworthy data centres with the rationale that<br />

stable descriptions <strong>of</strong> the duties <strong>and</strong> technical requirements <strong>for</strong> data centres which are<br />

using DOI names are needed. 62 With such criteria, <strong>Data</strong>Cite would create an instrument<br />

that helps publishers <strong>and</strong> researchers trust that research data is stored in a reliable <strong>and</strong><br />

persistent way.<br />

One <strong>of</strong> the greatest achievements <strong>of</strong> <strong>Data</strong>Cite is the inclusion <strong>of</strong> all relevant players in<br />

this arena: data centres, libraries, <strong>and</strong> publishers. <strong>Data</strong>Cite acts on the assumption that<br />

progress in integrating data <strong>and</strong> publication can only be achieved by a joint ef<strong>for</strong>t <strong>of</strong><br />

these stakeholders, combining strengths <strong>and</strong> influence <strong>of</strong> each <strong>of</strong> them. As the <strong>Data</strong>Cite<br />

initiators confirm to us, one <strong>of</strong> their main short-term goals is to raise awareness in the<br />

editorial boards to allow referencing <strong>of</strong> datasets in publications. In the <strong>Data</strong>Cite concept,<br />

libraries are not expected to act as data centres themselves, but continue to be a source<br />

<strong>of</strong> in<strong>for</strong>mation <strong>for</strong> researchers. They should open their catalogues to scientific data <strong>and</strong><br />

other content types <strong>and</strong> mediate access to data in data centres as remote content.<br />

<strong>Data</strong>Cite addresses data from science <strong>and</strong> technology alike <strong>and</strong> in principle spans all<br />

disciplines. The German Leibniz Institute <strong>for</strong> the Social Sciences (GESIS) runs a pilot<br />

DOI registration agency <strong>for</strong> social science research data based on the <strong>Data</strong>Cite<br />

infrastructure.<br />

DryadError! Bookmark not defined. is an example <strong>for</strong> an international repository that is<br />

committed to the “long tail” <strong>of</strong> data from the more decentralised biosciences, where data<br />

is not necessarily kept in large-scale repositories, or from under-financed fields such as<br />

the humanities:<br />

Dryad is designed to preserve the underlying data reported in a paper at the time <strong>of</strong><br />

publication, when there is the greatest incentive <strong>and</strong> the ability <strong>for</strong> authors to share<br />

their data. This is particularly important in the case <strong>of</strong> data <strong>for</strong> which a specialized<br />

61 http://datacite.org/schema/<strong>Data</strong>Cite-MetadataKernel_v2.0.pdf<br />

62 Klump 2011: Jens Klump: Criteria <strong>for</strong> the Trustworthiness <strong>of</strong> <strong>Data</strong> Centres. D-Lib Magazine,<br />

Volume 17, Number 1/2 (2011). http://www.dlib.org/dlib/january11/klump/01klump.html<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 71


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

repository does not exist. <strong>Data</strong>sets are assigned with persistent identifiers to enable data<br />

citations.<br />

Dryad was developed in the US by the National Evolutionary Synthesis Center <strong>and</strong> the<br />

University <strong>of</strong> North Carolina Metadata Research Center, in coordination with journals<br />

<strong>and</strong> societies in evolutionary biology <strong>and</strong> ecology. The initiative resulted from a<br />

workshop dedicated to “<strong>Data</strong> Preservation, Sharing, <strong>and</strong> Discovery: Challenges <strong>for</strong> Small<br />

Science in the Digital Era” in May 2007. To journals <strong>and</strong> societies in these “smaller”<br />

sciences, Dryad <strong>of</strong>fers a shared solution <strong>for</strong> data publication <strong>and</strong> archiving, thus<br />

relieving them from developing solutions on their own which would only lead to a<br />

fragmented l<strong>and</strong>scape.<br />

In the UK, there are attempts to integrate Dryad as a building block in the national<br />

scientific infrastructure. The initiative “Dryad UK” 63 is a 12 month JISC funded project<br />

<strong>and</strong> run by The British Library <strong>and</strong> Ox<strong>for</strong>d University with in partnership with several<br />

associate organisations. The project aims at moving Dray to a sustainable business<br />

proposition <strong>and</strong> establishing a UK mirror <strong>of</strong> the Dryad repository. In order to reach this<br />

goal, a business models are being developed <strong>and</strong> publisher expansion is being<br />

promoted 64 . Cost recovery is a realistic <strong>and</strong> achievable goal but establishing the most<br />

appropriate model <strong>for</strong> this is the challenge <strong>of</strong> Dryad. For example, a authors may fund<br />

Drayd submissions from their research grants.<br />

The Dryad UK initiators confirm to us that they have begun discussing potential<br />

business models together with publishers <strong>and</strong> funders. They point out the beginning<br />

integration <strong>of</strong> the Dryad repository with large publishers, e.g. PLoS <strong>and</strong> BiomedCentral<br />

<strong>for</strong> the first time, <strong>and</strong> new publisher workflow integration, allowing <strong>for</strong> peer review <strong>of</strong><br />

the data behind an academic publication. They also acknowledge that initial resistance<br />

from some large commercial publishers, who are wary <strong>of</strong> imposing a system on all <strong>of</strong><br />

their journals, or who still plan to explore commercial opportunities in the field<br />

themselves, could be overcome.<br />

While <strong>Data</strong>Cite <strong>and</strong> Dryad deal with readily produced datasets at the moment <strong>of</strong><br />

publication, <strong>Data</strong>verse is an example <strong>for</strong> an initiative <strong>of</strong>fering data management support<br />

throughout the research life cycle, thereby preventing that data gets lost in disks <strong>and</strong><br />

drawers in the first place:<br />

The <strong>Data</strong>verse Network 65 is an application to publish, share, reference, extract <strong>and</strong><br />

analyze research data. It started as collaboration between the Harvard-MIT <strong>Data</strong> Center<br />

(now Institute <strong>for</strong> Quantitative Social Science) <strong>and</strong> the Harvard University Library <strong>and</strong><br />

is presently implemented in Social Science disciplines. However, <strong>Data</strong>verse collaborates<br />

with researchers <strong>and</strong> archives to exp<strong>and</strong> the <strong>Data</strong>verse Network as a data management<br />

<strong>and</strong> publishing framework beyond social science.<br />

63 http://datadryad.org/dryaduk<br />

64 See Beagrie et al: Business Models <strong>and</strong> Cost Estimation: Dryad Repository Case Study.<br />

Proceedings <strong>of</strong> iPres 2010. http://www.ifs.tuwien.ac.at/dp/ipres2010/papers/beagrie-37.pdf<br />

65 http://thedata.org/home<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 72


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

The open source <strong>Data</strong>verse Network s<strong>of</strong>tware allows <strong>for</strong> implementation <strong>of</strong> individual<br />

virtual archives, so called “dataverses”. An institution can implement several <strong>of</strong> such<br />

dataverses <strong>and</strong> in this way distribute ownership between multiple researchers or<br />

research groups. That way, <strong>Data</strong>verse addresses the desire <strong>of</strong> scientists to maintain<br />

control <strong>of</strong> their data, in particular to keep sovereignty over restrictions <strong>for</strong> their data sets<br />

(important <strong>for</strong> interpretablility <strong>and</strong> <strong>for</strong> re-use). At the same time, <strong>Data</strong>verse <strong>of</strong>fers a<br />

central repository infrastructure with support <strong>for</strong> pr<strong>of</strong>essional archiving services such as<br />

back ups, recovery, <strong>and</strong> persistent identification.<br />

A dataverse is designed to contain data organized in studies. Each study comprises the<br />

actual data files, complementary files, <strong>and</strong> metadata. Persistent identifiers are allocated<br />

to studies <strong>and</strong> can be used in publications to point to the respective evidence base. By<br />

allowing owners <strong>of</strong> the data to create persistent identifiers <strong>and</strong> a citation <strong>for</strong> their<br />

datasets even be<strong>for</strong>e public release <strong>of</strong> the data set, <strong>Data</strong>verse accommodates the fear <strong>of</strong><br />

authors that their data may be (mis-)interpreted by others be<strong>for</strong>e their own analyses are<br />

published. The owner <strong>of</strong> the <strong>Data</strong>verse data can still use the persistent reference in an<br />

article, <strong>and</strong> then release the dataset once the article is published.<br />

The <strong>Data</strong>verse representatives point the data citation mechanism out as a major<br />

achievement <strong>of</strong> their initiative. The <strong>Data</strong>verse Network s<strong>of</strong>tware generates<br />

automatically a data citation <strong>for</strong> each data set published in a dataverse. Be<strong>for</strong>e the<br />

project, citations <strong>of</strong> data were inconsistent or nonexistent in many publications, which<br />

made data retrieval highly uncertain. <strong>Data</strong>verse facilitates referencing data sets from<br />

publications in a st<strong>and</strong>ardized <strong>and</strong> persistent way.<br />

Barriers are seen in lacking knowledge among researchers <strong>of</strong> the services that allow<br />

them to share data, in insufficient recognition <strong>and</strong> incentives to researchers <strong>for</strong><br />

publishing their data sets, <strong>and</strong> in inconsistency between requirements from funding<br />

agencies or journals on publishing data, <strong>and</strong> providing insufficient funding to continue<br />

implementing <strong>and</strong> maintaining framework solutions.<br />

Pangaea can be considered a representative initiative in the area <strong>of</strong> disciplinary data<br />

publishing <strong>and</strong> curation:<br />

Pangaea 66 is an in<strong>for</strong>mation service run by the German Alfred Wegener Institute <strong>for</strong><br />

Polar <strong>and</strong> Marine Research (AWI) <strong>and</strong> the Center <strong>for</strong> Marine Environmental Sciences<br />

(MARUM), also located in Germany. It focuses on archiving, publishing <strong>and</strong> distributing<br />

georeferenced data from earth system research.<br />

Various international research projects use Pangaea as their data repository. Thereby,<br />

Pangaea addresses a need <strong>of</strong> publicly funded projects, which are <strong>of</strong>ten required to store<br />

primary data <strong>for</strong> a defined period after the end <strong>of</strong> the project. The German regulation, <strong>for</strong><br />

example, sets <strong>for</strong>th that “Primary data as the basis <strong>for</strong> publications shall be securely<br />

stored <strong>for</strong> ten years in a durable <strong>for</strong>m in the institution <strong>of</strong> their origin.” 67 Because not all<br />

66 http://www.pangaea.de/<br />

67 Recommendations <strong>of</strong> the Commission on Pr<strong>of</strong>essional Self Regulation in Science: Proposals <strong>for</strong><br />

Safeguarding Good Scientific Practice (1998)<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 73


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

institutions can provide adequate storage capacity, let alone an infrastructure<br />

supporting description, persistent identification, search <strong>and</strong> retrieval <strong>of</strong> data, this task<br />

can be fulfilled by disciplinary data services.<br />

Pangaea has developed <strong>and</strong> <strong>of</strong>fers a range <strong>of</strong> archiving <strong>and</strong> publishing services. For<br />

example, it acts as publishing service <strong>and</strong> long term archive <strong>for</strong> the World <strong>Data</strong> Center<br />

<strong>for</strong> Marine Environmental Sciences. Pangaea aims at firmly establishing the concept <strong>of</strong><br />

data publishing in the Marine Environmental Community within the next couple <strong>of</strong><br />

years. Linking data to journal articles is an important part in this process <strong>and</strong> it<br />

happens in a bi-directional way: articles link to the data in Pangaea, <strong>and</strong> the Pangaea<br />

data point <strong>and</strong> link to the articles that use the data. Pangaea is already a designated<br />

archive <strong>for</strong> the Earth Science journals <strong>of</strong> Elsevier (see chapter 4) <strong>and</strong> is also the ‘home’ <strong>of</strong><br />

the new data journal Earth System Science <strong>Data</strong> (ESSD).<br />

The people responsible <strong>for</strong> Pangaea consider it the most important achievement <strong>of</strong><br />

Pangaea that it has established collaborations with science publishers <strong>for</strong> cross<br />

referencing <strong>of</strong> science data <strong>and</strong> articles. <strong>Data</strong>Cite DOIs are used <strong>for</strong> persistent<br />

identification <strong>of</strong> datasets. The Pangaea data publication process follows established<br />

publication processes with submission including a metadescription, <strong>for</strong>matting rules,<br />

abstract, archiving with lead time <strong>for</strong> pro<strong>of</strong>-read, defining a citation, registration <strong>and</strong><br />

final publication. Search <strong>and</strong> retrieval <strong>of</strong> the data sets via library catalogues is ensured<br />

through cooperation with the German National Library <strong>of</strong> Science <strong>and</strong> Technology<br />

4.5. Gaps <strong>and</strong> dilemmas<br />

The findings <strong>of</strong> the PARSE.Insight project, the initiatives presented in this chapter, <strong>and</strong><br />

the sheer number <strong>of</strong> workshops, conferences <strong>and</strong> publications related to data<br />

management, data sharing, (open) access to data suggest that the need <strong>for</strong> action has<br />

been recognised in the library <strong>and</strong> data centres communities alike. They are actively<br />

involved in developing persistent identifier systems <strong>for</strong> research data, data citation<br />

st<strong>and</strong>ards, <strong>and</strong> solutions <strong>for</strong> data description, documentation <strong>and</strong> retrieval, as well as in<br />

data curation <strong>and</strong> preservation.<br />

Availability<br />

In terms <strong>of</strong> available infrastructure, plenty <strong>of</strong> solutions <strong>and</strong> possibilities are already<br />

available <strong>for</strong> the <strong>of</strong>ten mentioned problem <strong>of</strong> making research data available. There are<br />

vast possibilities <strong>for</strong> researchers to make their data available via institutional or<br />

disciplinary repositories, <strong>and</strong> increasingly together with publications. A challenge is that<br />

not all researchers are aware <strong>of</strong> the services available to them. Another challenge may<br />

be that the multitude <strong>of</strong> possibilities may create a fragmented l<strong>and</strong>scape. Here,<br />

especially research libraries as in<strong>for</strong>mation suppliers have an important role to play:<br />

They should engage with researchers to raise awareness <strong>for</strong> good data management<br />

http://www.dfg.de/download/pdf/dfg_im_pr<strong>of</strong>il/reden_stellungnahmen/download/self_regulation_9<br />

8.pdf<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 74


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

practices, the benefits <strong>of</strong> data sharing, <strong>and</strong> the options available in the different<br />

disciplines, but also acceptance <strong>for</strong> the best <strong>and</strong> most reliable services available <strong>for</strong> a<br />

specific discipline. As Martin Feijen highlights in a report by SURF 2010: “Researchers<br />

must be in control <strong>of</strong> what happens to their data, who has access to it, <strong>and</strong> under what<br />

conditions.” 68<br />

When addressing researchers as both data users <strong>and</strong> data creators, a division <strong>of</strong> roles<br />

<strong>and</strong> responsibilities seems suitable between libraries <strong>and</strong> data centres with libraries<br />

addressing researchers in general, <strong>of</strong> all pr<strong>of</strong>essional levels <strong>and</strong> all kinds <strong>of</strong> disciplines,<br />

<strong>and</strong> data centres advising the pr<strong>of</strong>essional users. On the service level, it seems suitable<br />

that libraries focus on data retrieval, cataloguing, registering (“retrieval focus”), <strong>and</strong><br />

data centres focus on ensuring the long term availability <strong>of</strong> published data, including<br />

data storage, backups <strong>and</strong> replication in multiple locations (“data management focus”).<br />

Findability<br />

A precondition <strong>for</strong> proper data retrieval is that good metadescriptions are added to the<br />

datasets <strong>and</strong> that links that lead to datasets either from metadescriptions or from<br />

publications are persistent. Metadata schemes must reflect the granularity <strong>of</strong> research<br />

data, the relationship between data sets, <strong>and</strong> the <strong>of</strong>ten frequent updating <strong>of</strong> datasets.<br />

With persistent identifier initiatives like <strong>Data</strong>Cite or the <strong>Data</strong> Document Initiative in<br />

the Social Sciences 69 , some research communities are on a good track <strong>for</strong> common<br />

solutions. The challenge lies in aligning progress in all scientific disciplines – not<br />

necessarily between disciplines, but at least within disciplines. Retrieval services only<br />

make sense if researchers <strong>and</strong> institutions within an institution adhere to certain<br />

common conventions. Most data centres provide data retrieval services <strong>for</strong> their own<br />

data base. Libraries that integrate data material in their catalogues to facilitate<br />

findability <strong>and</strong> access are still a rare exception. One <strong>of</strong> the rare examples is the “GetInfo”<br />

service <strong>of</strong> the German National Library <strong>of</strong> Science <strong>and</strong> Technology, where different<br />

internal <strong>and</strong> external databases can be searched at one time (Illustration 2).<br />

68 SURF 2010<br />

69 http://www.ddialliance.org/<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 75


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Illustration 2: Screen view <strong>of</strong> the GetInfo service by TIB, Hannover, Germany.<br />

Interpretability<br />

To interpret research data created by others, good descriptions <strong>and</strong> documentation must<br />

be available (as <strong>for</strong> findability). The more documentation available, the easier it is <strong>for</strong><br />

researchers to interpret other researchers’ data. Documentation can range from<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 76


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

descriptions <strong>of</strong> the data to so-called data publications all the way to (linking) the full<br />

publication using the data. Services like Pangaea that require researchers to submit<br />

metadescriptions with their data <strong>and</strong> adhere to certain <strong>for</strong>matting conventions (so that<br />

all datasets can be interpreted in a similar way) are a solid beginning. Crosslinks<br />

between articles <strong>and</strong> data are another means to support interpretability, because<br />

verbalized interpretation <strong>of</strong> the dataset in a publication helps the underst<strong>and</strong>ing <strong>of</strong> the<br />

original dataset. While links from articles to data become increasingly common, the<br />

other way around from data to articles is not yet so widely used, but good examples exist:<br />

e.g., Pangaea, PubChem <strong>and</strong> the Cambridge Crystallographic <strong>Data</strong>base Centre. From a<br />

technical viewpoint, the interpretability <strong>of</strong> datasets can be ensured by separating them<br />

from vulnerable data carriers like CD-ROMs or DVDs <strong>and</strong> storing them on hard drives,<br />

including backups, <strong>for</strong>ward migration <strong>and</strong> replications. <strong>Data</strong> centres seem to be best<br />

equipped to take on this challenge. In disciplines where there are no established data<br />

centres (yet), the universities institutional data centre, well equipped libraries, or library<br />

federations or initiatives like Dryad UK should st<strong>and</strong> in, although this may perpetuate<br />

the risk <strong>of</strong> fragmentation.<br />

Re-usability<br />

Ensuring re-usability is the most difficult goal <strong>of</strong> data management in a data centre <strong>and</strong><br />

library setting. In addition to all the preconditions needed to ensure interpretability, reusability<br />

<strong>of</strong>ten requires s<strong>of</strong>tware to be available <strong>for</strong> analysing the datasets. The<br />

researcher who wants to re-use another researcher’s dataset does not only need<br />

intellectual, discipline specific underst<strong>and</strong>ing <strong>of</strong> the available datasets, but also the<br />

skills to operate the appropriate s<strong>of</strong>tware. Besides constant monitoring <strong>of</strong> the data<br />

holdings, libraries <strong>and</strong> data centres need to maintain <strong>for</strong>mat <strong>and</strong> s<strong>of</strong>tware registries to<br />

plan <strong>for</strong> data preservation actions. First approaches to preservation <strong>of</strong> scientific data<br />

were <strong>for</strong> example, developed in the CASPAR project 70 , <strong>and</strong> are followed up in the<br />

APARSEN network <strong>of</strong> excellence 71 , but continued research is needed.<br />

General dilemmas<br />

Altogether, the many new initiatives in the area <strong>of</strong> data integration are promising.<br />

However, against the expected explosion <strong>of</strong> research data (see chapter 1 <strong>and</strong> 2) they are<br />

still more or less exceptional cases. There are a couple <strong>of</strong> pioneering libraries, <strong>of</strong>ten<br />

embedded in big <strong>and</strong> capable universities <strong>and</strong> involved in several initiatives at one time.<br />

The danger is that a few actors master the transition to a data-intensive scholarly<br />

in<strong>for</strong>mation infrastructure well, <strong>and</strong> that the majority <strong>of</strong> stakeholders follow in a passive<br />

manner.<br />

70 http://www.casparpreserves.eu/<br />

71 http://www.alliancepermanentaccess.org/current-projects/aparsen<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 77


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

The results <strong>of</strong> the PARSE.Insight Gap Analysis <strong>of</strong> the “Scientific Libraries” community<br />

confirm this danger. 72 Although it was focused on preservation <strong>of</strong> digital research data,<br />

its tendency can easily be transferred to the wider area <strong>of</strong> data integration.<br />

Overall, the PARSE.Insight Gap Analysis indicated a gap between better <strong>and</strong> less<br />

prepared libraries. This gap could be found in almost all analyzed areas:<br />

The more data a library has to store, the better it considers itself prepared <strong>and</strong><br />

responsible <strong>for</strong> digital preservation – to the point that funding <strong>for</strong> digital preservation<br />

will be an important issue <strong>for</strong> the libraries. Another relation was shown between the<br />

amount <strong>of</strong> data stored at a library <strong>and</strong> the implementation <strong>of</strong> data <strong>and</strong> access<br />

management strategies. Libraries with preservation <strong>and</strong> selection policies in place have<br />

smaller preservation gaps (in terms <strong>of</strong> preservation strategies implemented) than those<br />

who have not. Effectively, the PARSE.Insight Gap Analysis found those libraries which<br />

were slower in addressing the problem, were far behind the “early starters” in most<br />

categories.<br />

Another, similar gap becomes apparent between disciplines: Influential, policy setting<br />

data centres do not exist in all scientific communities. A step in the right direction are<br />

two large programs implemented by the European Commission, CLARIN in the in the<br />

Linguistics <strong>and</strong> Humanities, <strong>and</strong> DARIAH in the Arts <strong>and</strong> Humanities.<br />

Substantial <strong>and</strong> sustained funding is required to develop <strong>and</strong> market new services but<br />

<strong>of</strong>ten there is a tendency to address novel challenges in time-limited projects. While it is<br />

absolutely necessary to initiate a first step into action that way, it puts at the same time<br />

even promising results need to progress to sustainable services <strong>and</strong> this needs to be<br />

directed by libraries <strong>and</strong> data centres in partnership with their funding bodies 73 .<br />

4.6. Opportunities <strong>for</strong> libraries <strong>and</strong> data centres<br />

The issue <strong>of</strong> data availability <strong>and</strong> data re-usability gets a lot <strong>of</strong> attention from users,<br />

funders, <strong>and</strong> decision makers 51,74, . For libraries <strong>and</strong> data centres, this opens up the<br />

possibility to re-position themselves as complementary pr<strong>of</strong>essional in<strong>for</strong>mation<br />

providers in this field. In order to enable data availability, findability, interpretability<br />

<strong>and</strong> re-useability, libraries <strong>and</strong> data centres’ data managers need to be involved from the<br />

very beginning <strong>of</strong> the research process in order to ensure high data quality. 75<br />

<strong>Data</strong>sets differ in many important ways from publications (e.g. granularity, iteration<br />

rate, data can be dynamic) <strong>and</strong> libraries must adjust to these new requirements as part<br />

<strong>of</strong> their new role. The immediate future <strong>for</strong> libraries will likely be characterised by<br />

72 PARSE 2010: PARSE.Insight. Deliverable D4.3. Gap Analysis Final Report (2010).<br />

http://www.parse-insight.eu/downloads/PARSE-Insight_D4-3_GapAnalysisFinalReport.pdf<br />

73 Sustainable Economics <strong>for</strong> a Digital Planet: Ensuring Long-Term <strong>Access</strong> to Digital In<strong>for</strong>mation.<br />

Final Report <strong>of</strong> the Blue Ribbon Task Force on Sustainable Digital Preservation <strong>and</strong> <strong>Access</strong><br />

(2010).<br />

74 Prepublication data sharing”, “Post-publication sharing <strong>of</strong> data <strong>and</strong> tools”, Nature: Vol 461/10<br />

September 2009<br />

75 Richard E. Luce (2008)<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 78


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

emerging <strong>and</strong> new roles in librarianship pr<strong>of</strong>essions. Considering the initiatives<br />

presented in 4.4, it seems obvious that there is a need <strong>for</strong> “data librarians”. The<br />

implementation <strong>of</strong> the “<strong>Data</strong>sets Programme” from The British Library 76 illustrates<br />

action in this area. Libraries in the US like the University <strong>of</strong> North Carolina at Chapel<br />

Hill Library or Penn State University Library have already developed data management<br />

toolkits to support researchers at the proposal stage. 77<br />

In all their actions, libraries <strong>and</strong> data centres as institutions serve the research<br />

communities. Many studies point out that researchers prefer local <strong>and</strong> discipline specific<br />

data management support <strong>and</strong> want to retain control <strong>of</strong> the data until research is<br />

published. 78 , 79 This balanced with the need to make data easily available <strong>and</strong> searchable<br />

suggesting there may be a role <strong>for</strong> libraries to act as an intermediary between<br />

researchers <strong>and</strong> larger data centres <strong>and</strong> in disciplines where there are no large data<br />

centres available, between researchers <strong>and</strong> the institutional repository. Overall, dialogue<br />

<strong>and</strong> interaction between the stakeholders is crucial <strong>and</strong> plat<strong>for</strong>ms to systematically<br />

enable it are desirable.<br />

76 http://www.bl.uk/datasets<br />

77 http://www.lib.unc.edu/reference/data_services/researchdatatoolkit/index.html,<br />

http://www.libraries.psu.edu/psul/scholar/datamanagement.html<br />

78 SURF 2010<br />

79 PARSE.Insight. Deliverable D3.6. Insight Report (2010). http://www.parseinsight.eu/downloads/PARSE-Insight_D3-6_InsightReport.pdf<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 79


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

<strong>Data</strong> Issue:<br />

Availability<br />

Findability<br />

Interpretability<br />

Re-usability<br />

Citability<br />

Libraries opportunity to help<br />

improve situation:<br />

Accept datasets <strong>for</strong> storage at<br />

library <strong>and</strong>/or open up library<br />

catalogue to research data sets to<br />

allow access to data at least as<br />

remote content.<br />

<strong>Data</strong> centres opportunity to help<br />

improve situation:<br />

Be open <strong>and</strong> transparent <strong>and</strong> lower<br />

barriers to researchers to make their<br />

data available via discipline or<br />

institutional data centre.<br />

Support <strong>of</strong> persistent identifiers.<br />

Engage in developing common metadescription schemas <strong>and</strong> common<br />

citation practices.<br />

Promote use <strong>of</strong> common st<strong>and</strong>ards <strong>and</strong> tools among researchers<br />

Provide metadescriptions to<br />

datasets.<br />

Support crosslinks between<br />

publications <strong>and</strong> datasets.<br />

Be transparent about conditions<br />

under which the data sets can be<br />

re-used (expert knowledge needed,<br />

s<strong>of</strong>tware needed).<br />

Help researchers underst<strong>and</strong><br />

metadescriptions <strong>of</strong> datasets.<br />

Establish <strong>and</strong> maintain knowledge<br />

base about data <strong>and</strong> their context.<br />

Curate <strong>and</strong> preserve datasets.<br />

Archive s<strong>of</strong>tware needed <strong>for</strong> reanalysis<br />

<strong>of</strong> data.<br />

Engage in establishing uni<strong>for</strong>m data citation st<strong>and</strong>ards.<br />

Support <strong>and</strong> promote persistent identifiers.<br />

Curation/Preserv<br />

ation<br />

Transparency about curation <strong>of</strong><br />

submitted data.<br />

Collaboration with data creators<br />

<strong>and</strong> data centres.<br />

Promote good data management<br />

practice.<br />

Transparency about curation <strong>of</strong><br />

submitted data.<br />

Collaboration with data creators <strong>and</strong><br />

libraries.<br />

Instruct researchers on discipline<br />

specific best practices in data<br />

creation (preservation <strong>for</strong>mats,<br />

documentation <strong>of</strong> experiment,…)<br />

Table 3 repeated: <strong>Data</strong> Opportunities <strong>for</strong> Libraries <strong>and</strong> <strong>Data</strong> Centers.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 80


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

5. REPORT EPILOGUE: MAPPING PING THE ROAD AHEAD<br />

5.1. Can Libraries <strong>and</strong> <strong>Data</strong> centres fill the missing link?<br />

This report has shown that the key to ensuring the long term success <strong>of</strong> integrating data<br />

<strong>and</strong> publications is to ensure that the data are managed <strong>and</strong> preserved in such as way<br />

that they remain:<br />

• Available<br />

• Findable<br />

• Interpretable<br />

• Reusable<br />

• Citable<br />

These criteria are meaningful to, <strong>and</strong> act as incentives <strong>for</strong>, both researchers <strong>and</strong><br />

publishers to engage in linking data to publications but neither group believe they can<br />

fulfil this role on their own. No one group has responsibility across the whole<br />

communication chain or has the resources to satisfy all <strong>of</strong> these criteria. So a key<br />

question we wish to raise here is; can librarians <strong>and</strong> data centres help fill this missing<br />

link ? To what extent do they have the existing relationships with researchers <strong>and</strong>/ or<br />

the related knowledge (including skills) to ensure that many <strong>of</strong> these criteria are met?<br />

We have established that our three stakeholder groups <strong>of</strong> researchers, publishers <strong>and</strong><br />

libraries <strong>and</strong> data centres have much to gain from embracing the integration <strong>of</strong> data <strong>and</strong><br />

publications. There are several opportunities to be grasped <strong>for</strong> researchers, publishers,<br />

<strong>and</strong> libraries <strong>and</strong> data centres. It is clear from the researcher’s perspective that making<br />

data available needs to be incentivised. Funding is one incentive, but it is important that<br />

researchers are credited <strong>for</strong> their data <strong>and</strong> that making data available will increase the<br />

author’s visibility, in other words the data should be easily citable.<br />

It is also clear that, whilst publishers, in principle, are open to further integration <strong>of</strong><br />

data <strong>and</strong> publications, there are challenges associated with ensuring the quality <strong>and</strong><br />

longevity <strong>of</strong> the data submitted. Accepting <strong>and</strong> storing data submitted in supplementary<br />

files can be hugely dem<strong>and</strong>ing on publishers’ resources. There is also the question over<br />

what level <strong>of</strong> data, as illustrated in the ‘Gray’s Pyramid, see Graph 1, a publication<br />

should accept <strong>and</strong> in what manifestation.<br />

There are exciting developments in publishing in relation to data, such as datapublications<br />

whose main aim is to describe available datasets. Publishers are investing<br />

in developing services to enrich publications with data <strong>and</strong> are doing so in collaboration<br />

with public archive services such as libraries <strong>and</strong> data centres.<br />

<strong>Data</strong> centres play an important role in the long term storage <strong>of</strong> data <strong>and</strong> it can be<br />

surmised that libraries have a supporting role in this l<strong>and</strong>scape, whether that be in<br />

supporting researchers in storing their data or ensuring that data remains available to<br />

<strong>and</strong> is discoverable by the end user when starting up new research projects. It is clear<br />

that, <strong>for</strong> libraries, a priority is to ensure that quality data can always be accessed easily<br />

by their users.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 81


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

What is unclear is a definition <strong>of</strong> the roles that libraries should fill <strong>and</strong> what the<br />

incentives <strong>and</strong> barriers exist <strong>for</strong> researchers to work with libraries on data management.<br />

This lack <strong>of</strong> clarity calls <strong>for</strong> a dialogue to take place about why <strong>and</strong> how libraries should<br />

play a role in the integration <strong>of</strong> data <strong>and</strong> publications. There are gaps that need to be<br />

addressed as the preceding chapter identifies. By drawing on data from previous<br />

research <strong>and</strong> on real life examples it presents some arguments <strong>and</strong> outlines<br />

opportunities but a more complete picture may be drawn by engaging library<br />

pr<strong>of</strong>essionals <strong>and</strong> researchers themselves in this dialogue.<br />

5.2. What does the <strong>Data</strong> Publication Pyramid mean <strong>for</strong> roles <strong>and</strong> responsibilities ?<br />

The <strong>Data</strong> Publication Pyramid has appeared throughout this report (see Graph 1). Every<br />

layer <strong>of</strong> the <strong>Data</strong> Publication Pyramid presents different challenges <strong>and</strong> calls <strong>for</strong> a<br />

variety <strong>of</strong> approaches to improve data exchange. As we descend the layers <strong>of</strong> the data<br />

pyramid the division <strong>of</strong> roles <strong>and</strong> responsibilities becomes less clear.<br />

The top layer, publications with data, is well established <strong>and</strong> already an integral part <strong>of</strong><br />

the record <strong>of</strong> science <strong>and</strong> the systems around it <strong>for</strong> discoverability, access <strong>and</strong> retrieval<br />

are well in place. The associated roles <strong>and</strong> responsibilities between researchers, libraries<br />

<strong>and</strong> publishers are well defined <strong>and</strong> delineated. But it only presents a tip <strong>of</strong> the iceberg<br />

<strong>of</strong> available data.<br />

At this stage <strong>of</strong> the pyramid, there is limited potential <strong>for</strong> the reuse <strong>of</strong> data as the data is<br />

usually embedded in the article in an aggregated <strong>for</strong>m. While this potential <strong>for</strong> re-use<br />

increases in the layer below that: Processed <strong>Data</strong> <strong>and</strong> <strong>Data</strong> representations, <strong>of</strong>ten<br />

presented in supplementary files to journal articles, the criteria <strong>of</strong> discoverability <strong>and</strong><br />

longevity are less sure there. As a result, an increasing number <strong>of</strong> publishers encourage<br />

authors to submit their data to the third layer, that <strong>of</strong> <strong>Data</strong> Collections <strong>and</strong> Structured<br />

<strong>Data</strong>bases, from where <strong>and</strong> to which the publications can link.<br />

At this level, the <strong>Data</strong> Collections <strong>and</strong> Structured <strong>Data</strong>bases layer, boundaries begin to<br />

blur <strong>and</strong> roles <strong>and</strong> responsibilities <strong>of</strong> libraries <strong>and</strong> data centres need to be better defined.<br />

This layer also <strong>of</strong>fers the most potential <strong>for</strong> data exchange as, at this level, the data are<br />

at their most findable, reusable, <strong>and</strong> well curated <strong>and</strong> preserved.<br />

The ‘long tail’ <strong>of</strong> data at the bottom <strong>of</strong> the pyramid, where data tends to remain in<br />

drawers <strong>and</strong> on disks <strong>of</strong> the institute, has been wholly the domain <strong>and</strong> responsibility <strong>of</strong><br />

the researcher. That this data meets none <strong>of</strong> our criteria <strong>for</strong> data exchange is both a<br />

problem <strong>and</strong> an opportunity. As discussed in chapters 2 <strong>and</strong> 4, there are several reasons<br />

why this data has not been integrated. Lack <strong>of</strong> incentives, lack <strong>of</strong> local or discipline<br />

specific repositories, <strong>and</strong> fears over losing control over or credit <strong>for</strong> the data all<br />

contribute to this data remaining locked in their silos. Collaborative projects such as<br />

<strong>Data</strong>Cite, Dryad <strong>and</strong> <strong>Data</strong>verse are addressing these issues but more needs to be done to<br />

encourage researchers to start embracing data management <strong>and</strong> to make their data<br />

available.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 82


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

5.3. In dialogue with Research Librarians: LIBER Workshop 2011<br />

As research libraries are one <strong>of</strong> the key stakeholders in this report, ODE ran a workshop<br />

at the LIBER 2011 annual conference drawing on a draft <strong>of</strong> this report. Five speakers<br />

representing various backgrounds, including publishing, data centres, libraries, <strong>and</strong><br />

research, were shown a copy <strong>of</strong> the draft report <strong>and</strong> asked to prepare a provocative<br />

statement related to their reading <strong>of</strong> the report.<br />

Speakers <strong>and</strong> Provocations<br />

“We need proper data citations in the reference section.“ Merce Crosas<br />

Merce Crosas, Project Director <strong>of</strong> the <strong>Data</strong>verse Network at Harvard, reminded the<br />

audience that, <strong>for</strong> the most part, we all agree that making data available is important,<br />

that we have important reasons <strong>for</strong> it (advancement <strong>of</strong> science, verifyability <strong>of</strong> research<br />

results etc.), <strong>and</strong> that nowadays, we have the technologies to realize it. So, she asked,<br />

shouldn’t it really be only a small problem that we can easily solve? In her opinion,<br />

upgrading data citations in scholarly articles from in-text citations to full citations in the<br />

reference sections would be a big step <strong>for</strong>ward in the right direction.<br />

“Libraries must tackle the long tail data problem.“ Brian Hole<br />

Brian Hole from DryadUK at The British Library stated that a major barrier to data<br />

exchange is the “long tail <strong>of</strong> data”, which means that a large proportion <strong>of</strong> research data<br />

sits on researchers computers <strong>and</strong> doesn’t get into a repository. These data are not made<br />

available <strong>for</strong> reuse <strong>and</strong> are almost inevitably lost. This is particularly true <strong>of</strong> Humanities<br />

data <strong>and</strong> Hole is convinced that there is a massive potential <strong>for</strong> re-use <strong>of</strong> these data.<br />

Hole argued against the prescription that smaller libraries should leave data curation to<br />

large data centres. When no specific subject repositories <strong>for</strong> the long tail data, which is<br />

particularly true <strong>of</strong> the humanities, it simply isn’t exploited, maintained or preserved. In<br />

Hole’s opinion, (research) libraries are very well placed to bridge this gap because they<br />

are already placed within the researchers’ workflow. At the very least they can educate<br />

researchers <strong>and</strong> provide data management plans. Beyond that, they may create <strong>and</strong><br />

maintain their own repositories or act as advocates <strong>for</strong> the establishment <strong>of</strong> data<br />

repositories within their institutions.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 83


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

“Clear <strong>and</strong> easy to underst<strong>and</strong> citation metrics <strong>for</strong> datasets <strong>and</strong> automatics<br />

mechanisms to count them are urgently needed“ Maurits Van der Graaf<br />

Maurits Van der Graaf, author <strong>of</strong> a SURF study on the quality <strong>of</strong> research data 80 ,<br />

advocates establishing a <strong>Data</strong>set Impact Factor as an incentive <strong>for</strong> researchers to<br />

publish <strong>and</strong> properly cite datasets, <strong>and</strong> a <strong>Data</strong> Archive Impact Factor as a means to<br />

further pr<strong>of</strong>essionalize data archive management. The <strong>Data</strong> Archive Impact Factor could<br />

help data archives to measure their relevance to scientific research, <strong>and</strong> help funders to<br />

evaluate the effectiveness <strong>of</strong> their support. Van der Graaf expects that this would<br />

influence publishers in developing additional services linking to high impact data<br />

archives.<br />

“No Publication without <strong>Data</strong> – no data without publication” Eefke Smit<br />

Eefke Smit, Director <strong>of</strong> St<strong>and</strong>ards <strong>and</strong> Technology at the International Association <strong>of</strong><br />

STM Publishers, referred to the <strong>Data</strong> Publication Pyramid <strong>of</strong> this report to illustrate<br />

that currently only a very small fraction <strong>of</strong> all data created gets ever published.<br />

Smit argues <strong>for</strong> a change in practices because publication has a large potential to make<br />

data visible <strong>and</strong> re-usable. And in return data publication enhances the traditional<br />

publication by providing supporting evidence <strong>and</strong> background to the <strong>of</strong>ficial Record <strong>of</strong><br />

Science. Also, via citations it serves as a credit systems <strong>for</strong> the people behind the data.<br />

Moreover, Smit calls <strong>for</strong> constructive collaboration throughout the in<strong>for</strong>mation chain,<br />

where data is not necessarily kept at the publisher, but rather securely stored <strong>and</strong><br />

preserved in certified, reliable data repositories, <strong>and</strong> via persistent identifiers remain<br />

linked to publications – bidirectionally.<br />

“The research community needs to establish a common <strong>for</strong>mat <strong>for</strong> data d<br />

acquisition <strong>and</strong><br />

interpretation” Rick Luce<br />

Rick Luce, Vice Provost <strong>and</strong> Director <strong>of</strong> Libraries <strong>for</strong> Emory University, drew the<br />

audience’s attention to the complexity <strong>of</strong> roles <strong>and</strong> responsibilities in the data l<strong>and</strong>scape.<br />

No institution can soundly manage data over time when it doesn’t know what is allowed<br />

to do with it or how to treat the data correctly. There<strong>for</strong>e, it is essential to clarify the<br />

question <strong>of</strong> ownership at the instant <strong>of</strong> data h<strong>and</strong>overs.<br />

Thereby, three principles should be observed:<br />

• The data integrity principle: Ensure the integrity <strong>of</strong> research data to implement<br />

trust in research processes, enable researchers to verify published research<br />

results.<br />

80 http://www.dlib.org/dlib/january11/waaijers/01waaijers.html <strong>and</strong><br />

http://www.surffoundation.nl/nl/publicaties/Documents/SURFshare_Over_kwaliteit_van_onderzo<br />

eksdata_dec2010DEF.pdf<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 84


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

• The data access <strong>and</strong> sharing principle: Make data that is integral to publicly<br />

reported results publicly available.<br />

• The data stewardship principle: Provide proper data documentation, curation,<br />

<strong>and</strong> long term preservation to enable re-use.<br />

5.4. Emerging Issues<br />

Feedback from the workshop was recorded <strong>and</strong> five main issues or concerns emerged:<br />

• Citation Metrics<br />

• Roles<br />

• Why libraries?<br />

• Approaches to data publishing<br />

• Incentives<br />

Much <strong>of</strong> what came out <strong>of</strong> the workshop reflected the speakers’ provocations <strong>and</strong><br />

reaffirmed the relevance <strong>of</strong> the issues that have been highlighted throughout this report.<br />

The following are some the issues that were emphasised by participants throughout the<br />

session. Due to the nature <strong>of</strong> the workshop, some opinions are in conflict <strong>and</strong> reflect the<br />

lively debate in the workshop.<br />

Citation Metrics<br />

Measuring the impact <strong>of</strong> published data may be central to the success <strong>of</strong> data<br />

publications <strong>and</strong> measuring the impact <strong>of</strong> publications is a preoccupation <strong>of</strong> research<br />

libraries.<br />

This is an area which requires further analysis. Journal Impact Factors, although<br />

established, have drawbacks; it may not be ideal to connect the impact <strong>of</strong> data to the<br />

impact <strong>of</strong> a journal, further it is important that whatever metrics are developed, they are<br />

simple <strong>and</strong> easy to interpret.<br />

Roles<br />

Unsurprisingly <strong>for</strong> a workshop with library pr<strong>of</strong>essionals, the definition <strong>of</strong> roles was a<br />

major concern <strong>and</strong> drew many comments from the audience. Comments ranged from<br />

what roles Libraries should play in data management <strong>and</strong> exchange, to what new skill<br />

set library pr<strong>of</strong>essionals need to meet the new challenges implied in these changing<br />

roles.<br />

It was acknowledged that libraries need to align themselves with what is a dynamic<br />

research life cycle, becoming more project-oriented rather than providing services on an<br />

as needed basis. This reflects Rick Luce’s (2008) assertion that libraries need to<br />

reposition themselves to become involved in research at the beginning <strong>of</strong> the research<br />

life cycle.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 85


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

The majority <strong>of</strong> participants believed that libraries should have a role as repositories <strong>of</strong><br />

data. Some traditional library skills are applicable to data management, such as<br />

collection management <strong>and</strong> in<strong>for</strong>mation retrieval skills <strong>and</strong> the libraries established<br />

relationship with researchers as teachers <strong>of</strong> in<strong>for</strong>mation literacy skills means that they<br />

are well placed to provide guidance to researchers in the creation <strong>of</strong> data management<br />

plans.<br />

If libraries are to become more involved in the integration <strong>of</strong> data <strong>and</strong> publications then<br />

there may well be a necessity to develop s<strong>of</strong>tware development skills to support the<br />

exploitation <strong>of</strong> data as well as expertise in the area <strong>of</strong> intellectual property (IP) in order<br />

to ensure that the rights <strong>of</strong> the creators <strong>of</strong> the data are protected.<br />

The increase in the use <strong>of</strong> integrated data means that boundaries between libraries, data<br />

centres <strong>and</strong> publishing are blurring, possibly, as one participant stated, becoming one<br />

intellectual unit, <strong>and</strong> this has perhaps the greatest implication <strong>for</strong> future roles.<br />

Why Libraries?<br />

As a further provocation participants were asked why libraries should have a role in<br />

data exchange? This question not only served to draw out the rational <strong>for</strong> the<br />

repositioning <strong>of</strong> libraries but also helped highlight some <strong>of</strong> the barriers to library<br />

involvement in data exchange.<br />

Reflective <strong>of</strong> an issue raised in chapter 2, researchers are not interested in archiving or<br />

curating data themselves <strong>and</strong> there is a real danger <strong>of</strong> fragmentation if these activities<br />

are left up to individual researchers. Libraries already have a track record in<br />

supporting researchers in their work.<br />

This could present an argument <strong>for</strong> institutional repositories but on the other h<strong>and</strong>,<br />

there may be a danger <strong>of</strong> fragmentation if libraries become repositories <strong>of</strong> data where<br />

discipline specific international resources such as PANGAEA or GenBank are available<br />

<strong>and</strong> a preferable solution <strong>for</strong> researchers. Also, libraries’ existing strengths lie in<br />

creating structure rather than storage.<br />

Approaches to data publishing<br />

Much <strong>of</strong> the dialogue in the workshop focused on the approach that should be taken to<br />

encourage, sustain <strong>and</strong> support data publishing <strong>and</strong> data integration. First <strong>of</strong> all, there<br />

was a general sentiment that there is a need <strong>for</strong> a definition <strong>of</strong> metadata <strong>for</strong> research<br />

data.<br />

There is also a need to focus on interoperability rather than specific subject domains,<br />

regions or institutions. This could pose a problem <strong>for</strong> libraries as existing library<br />

in<strong>for</strong>mation infrastructures are not always easy to make interoperable.<br />

At the same time it is important to work with researchers <strong>and</strong> research groups at local<br />

<strong>and</strong> project level. As one participant put it “local ef<strong>for</strong>t can result in global solutions”.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 86


Report on <strong>Integration</strong> <strong>of</strong> <strong>Data</strong> <strong>and</strong> <strong>Publications</strong> Grant Agreement no.: 261530<br />

Incentives<br />

Incentives <strong>for</strong> researchers to publish their data were discussed in chapter 2. From the<br />

research library perspective it is important that there is an incentive <strong>for</strong> researchers to<br />

work with libraries in making this data available.<br />

Without a doubt citation <strong>and</strong> impact is a major incentive <strong>for</strong> researchers (hence the<br />

importance <strong>of</strong> citation metrics). In order <strong>for</strong> data to be widely cited it must be findable,<br />

reusable <strong>and</strong> citable. Librarians <strong>and</strong> libraries have the skills <strong>and</strong> resources to make this<br />

achievable.<br />

Another incentive that libraries may provide (if they act as repositories) is the promise <strong>of</strong><br />

sustainability; the fact that an institution is taking care <strong>of</strong> the data, preserving it <strong>and</strong><br />

curating it. This is not just a boon to researchers but a valuable addition to research<br />

proposals.<br />

5.5. The Next Step: Survey to Document Current <strong>and</strong> Project Future Roles<br />

What is clear from the workshop analysis is that there are many questions that need to<br />

be answered. Research Libraries are keen to engage in a dialogue about data exchange<br />

<strong>and</strong> there exists awareness that there is a need to reposition library institutions in this<br />

changing l<strong>and</strong>scape. The workshop also served to validate the issues raised in this<br />

report.<br />

The report highlights an opportunity <strong>for</strong> collaboration. With the increasing dem<strong>and</strong> from<br />

funders <strong>for</strong> researchers to make their data publicly available <strong>and</strong> the ensuing need <strong>for</strong><br />

support in data management, <strong>and</strong> publishers supporting the principle <strong>of</strong> data sharing by<br />

signing up to the Brussels Declaration, libraries are faced with the opportunity to<br />

reposition themselves to become embedded in the research process.<br />

In order to mine the potential <strong>of</strong> the bottom layer <strong>of</strong> the data pyramid it is important to<br />

underst<strong>and</strong> how researchers can be supported in making data more available than is<br />

presently the case.<br />

This report has provided evidence <strong>of</strong> the impact that data sharing <strong>and</strong> reuse has <strong>and</strong> can<br />

have on the scholarly communication chain <strong>and</strong> how important the integration <strong>of</strong> data<br />

<strong>and</strong> publications is.<br />

The survey among research libraries <strong>and</strong> researchers within their institutes that follows<br />

this report during the fall <strong>of</strong> 2011, aims to clarify further the roles <strong>of</strong> stakeholders<br />

concerned by measuring their awareness <strong>and</strong> readiness <strong>for</strong> more responsibility in<br />

research data. It draws from the findings <strong>of</strong> this report <strong>and</strong> from what we have learned<br />

from the LIBER workshop. It aims to reveal how stakeholders’ roles are changing.<br />

Perhaps more importantly, it presents in<strong>for</strong>mation <strong>and</strong> guidance <strong>for</strong> the likely evolution<br />

<strong>of</strong> these roles -- to ensure the ongoing integrity <strong>of</strong> the scholarly record <strong>and</strong> <strong>for</strong> the<br />

creation <strong>of</strong> incentives <strong>for</strong> stakeholders to support this.<br />

Better insight in existing strengths versus weaknesses <strong>and</strong> opportunities versus threats<br />

should help create the conditions to increase data publication activity <strong>and</strong>, ultimately,<br />

data sharing <strong>and</strong> reuse.<br />

Opportunities <strong>for</strong> <strong>Data</strong> Exchange (ODE) –www.ode-project.eu 87

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!