11.04.2015 Views

Project Management Plan - Marine Geoscience Data System

Project Management Plan - Marine Geoscience Data System

Project Management Plan - Marine Geoscience Data System

SHOW MORE
SHOW LESS

Transform your PDFs into Flipbooks and boost your revenue!

Leverage SEO-optimized Flipbooks, powerful backlinks, and multimedia content to professionally showcase your products and significantly increase your reach.

<strong>Project</strong> <strong>Management</strong> <strong>Plan</strong><br />

<strong>Marine</strong> <strong>Geoscience</strong> <strong>Data</strong> <strong>System</strong><br />

&<br />

Geoinformatics for Geochemistry Program<br />

Version 1.7p<br />

August 22, 2006<br />

Lamont-Doherty Earth Observatory of Columbia University<br />

1


1. Introduction................................................................................................................ 3<br />

2. Geoinformatics <strong>Project</strong>s at LDEO............................................................................. 3<br />

2.1. Vision statement ........................................................................................................... 3<br />

2.2. General <strong>Management</strong> Structure ................................................................................. 5<br />

2.3. Advisory Structure....................................................................................................... 5<br />

2.4. Stakeholders ................................................................................................................. 6<br />

2.5. Synergies .......................................................................................................................6<br />

3. PEP for the <strong>Marine</strong> <strong>Geoscience</strong> <strong>Data</strong> <strong>System</strong>........................................................... 9<br />

3.1. <strong>Project</strong> Definition ......................................................................................................... 9<br />

3.2. <strong>Project</strong> <strong>Management</strong>.................................................................................................. 12<br />

3.2.1. Organizational Structure....................................................................................................... 12<br />

3.2.2. Roles and Responsibilities.................................................................................................... 13<br />

3.2.3. Work Tasks........................................................................................................................... 15<br />

3.2.4. <strong>System</strong> Development............................................................................................................ 16<br />

3.2.5. Operations and Sustaining Engineering................................................................................ 16<br />

3.2.6. Education.............................................................................................................................. 17<br />

3.2.7. Reporting .............................................................................................................................. 18<br />

3.2.8. Metrics of Success................................................................................................................ 18<br />

3.2.9. Risk and Contingency <strong>Management</strong>..................................................................................... 19<br />

3.2.10. Schedule/Milestones ........................................................................................................ 21<br />

4. Appendix................................................................................................................... 23<br />

4.1. <strong>System</strong> Architecture and Infrastructure for the <strong>Marine</strong> <strong>Geoscience</strong> <strong>Data</strong> <strong>System</strong><br />

(MGDS).................................................................................................................................... 23<br />

2


1. Introduction<br />

Federal agencies including the NSF, NASA, NOAA, and USGS fund a range of data<br />

management and Geoinformatics activities for the Earth and Ocean Sciences which are<br />

carried out at the Lamont-Doherty Earth Observatory (LDEO), the Center for<br />

International Earth Science Information Network (CIESIN) and the International<br />

Research Institute for Climate Change (IRI). LDEO, CIESIN and IRI are units of the<br />

Earth Institute at Columbia University, co-located on Columbia’s Lamont Campus in<br />

Palisades, NY. <strong>Project</strong>s funded by the NSF include several mature activities with a long<br />

history of funding; the Sediment Core Repository which houses sediment cores from the<br />

global oceans: and the ODP/IODP Borehole Group which serves down-hole logging data<br />

in support of the ODP and IODP programs.<br />

Recently the NSF has funded a suite of new activities to build and manage digital data<br />

collections for <strong>Marine</strong> Geology & Geophysics, Geochemistry, and the broader<br />

<strong>Geoscience</strong>s. Though funded as independent activities with different duration and<br />

termination dates and at different stages of maturity, these projects widely overlap in their<br />

objectives, technical requirements, equipment, and personnel resources with significant<br />

synergies among them.<br />

This document describes strategic and management approaches, organizational<br />

structures and timelines under which these NSF-funded cyberinfrastructure projects will<br />

be executed. This is intended to be a dynamic document, reviewed and revised an<br />

annually in discussion with the Advisory Committee and the National Science<br />

Foundation. Revisions will reflect changes in the scope and type of projects included, in<br />

the organizational structure, and in the funding. New projects will be integrated into this<br />

plan as appropriate.<br />

2. Geoinformatics <strong>Project</strong>s at LDEO<br />

2.1. Vision statement<br />

“It is exceedingly rare that fundamentally new approaches to research and education<br />

arise. Information technology has ushered in such a fundamental change. Digital data<br />

collections are at the heart of this change.” (Long-lived <strong>Data</strong> Collections, Report of the<br />

National Science Board to the NSF, 2005)<br />

3


Geoinformatics projects at LDEO are focused on building and maintaining digital data<br />

collections that are essential resources for the <strong>Geoscience</strong>s to maximize accessibility and<br />

thus the application of scientific data in research and education. These data collections<br />

aid data discovery, access to data, and data analysis, support cross disciplinary<br />

approaches in science by facilitating data integration across disciplinary, spatial, and<br />

temporal boundaries, and to advance the “principle of open access” to data and to<br />

samples that were acquired with public funding.<br />

Most of our systems can be viewed as Resource or Community <strong>Data</strong> Collections<br />

according to the classification of digital data collections in the NSB report. Resource<br />

<strong>Data</strong> Collections “serve a specific science and engineering community… They typically<br />

conform to community standards, where such standards exist. Often these digital<br />

collections can play key roles in bringing communities together to develop appropriate<br />

standards where a need exists. In many cases community database collections migrate to<br />

reference collections”. <strong>Project</strong>s in the MGDS and the GfG program have taken<br />

leadership roles in defining standards for cruise metadata and geochemical data,<br />

respectively, working closely with the appropriate communities through their Advisory<br />

Committees, and various outreach activities.<br />

<strong>Management</strong> of digital data for the <strong>Geoscience</strong>s has a long tradition at LDEO that dates<br />

back into the 1960s, and which led to a rich database of underway geophysical data and<br />

subsequently to an early experiment in geographical database access (W. Menke’s<br />

Geographical <strong>Data</strong>base Browser XGB). In the mid to late 1990s, the RIDGE Multibeam<br />

Synthesis project and the PetDB database for geochemical data of ocean floor igneous<br />

and metamorphic rocks were developed and started to serve scientific data over the web<br />

to a broad audience. The new projects described in this <strong>Project</strong> Execution <strong>Plan</strong>, build<br />

upon and extend these earlier efforts. Most of our projects were proposed based on<br />

recommendations from community workshops that outlined objectives, scope, and<br />

functionality of data management systems that would best serve science and education<br />

(Michael et al. 1991, Smith et al., 2001, Shipley et al., 1999, Cervato et al. 2004).<br />

The significant growth in data management activities at LDEO, which is based on the<br />

success of the initial efforts, was intentionally pursued to be able to establish a critical<br />

mass of professionals who can take on the wide range of tasks necessary to design,<br />

develop, and maintain digital data collections. <strong>Data</strong> management activities at LDEO have<br />

grown incrementally, but have now reached a level where a well-defined management<br />

plan is required that documents the approaches used to accomplish design and<br />

development tasks, and a sustained operation of the data collections. This plan also<br />

exposes the significant synergies among the individual projects and project clusters that<br />

have been critical for the efficiency with which the systems are built and operated. These<br />

synergies will be further exploited in the future through improved integration of our<br />

operations, management and advisory structure.<br />

Our primary goal for the future is to ensure the highest level of quality and long-term<br />

sustainability of our systems, data products, and of the services they provide to the<br />

community. In order to achieve this goal, we need to maintain a level of funding that will<br />

allow us to support our operations with the superior expert teams that we have gathered,<br />

and to continue infusion of relevant new technologies. A potential decrease of resources<br />

due to termination of projects or reduction of funding will need to be balanced by new<br />

4


endeavors that complement the existing capabilities. We will primarily focus on<br />

expanding collaborations with other Geoinformatics efforts to leverage ongoing<br />

developments and further contribute to establishing a cyberinfrastructure for <strong>Geoscience</strong><br />

research and education.<br />

2.2. General <strong>Management</strong> Structure<br />

The activities are managed in two thematically and organizationally distinct clusters,<br />

the <strong>Marine</strong> <strong>Geoscience</strong> <strong>Data</strong> <strong>System</strong> (MGDS) and the Geoinformatics for Geochemistry<br />

(GfG) system. The GfG projects are executed by a joint team from LDEO and CIESIN<br />

with management responsibilities for the IT team at CIESIN, while all IT and science<br />

directive activities for the MGDS are executed by a team from LDEO.<br />

Figure 1 General<br />

<strong>Management</strong> and<br />

Advisory Structure for<br />

the MGDS and GfG<br />

In order to simplify management of the activities in the future, we envision the projects<br />

within the two clusters to merge into two larger efforts. We plan to achieve this by<br />

combining operation, maintenance, and further development of the individual systems in<br />

upcoming renewal proposals.<br />

2.3. Advisory Structure<br />

All digital data collections and information systems built and maintained under this<br />

<strong>Project</strong> Execution <strong>Plan</strong> are community resources for the broad <strong>Geoscience</strong>s that manage,<br />

archive, and serve scientific data to better benefit research and education. As such, they<br />

need to address user concerns, and respond to broad scientific and educational needs. The<br />

recent NSB report on Long-lived Digital <strong>Data</strong> Collections emphasizes the obligation of<br />

data managers to maintain an “open and effective communication with the served<br />

community” and to “gain the trust of the community that the collection serves”.<br />

5


To establish the needed community input and oversight, and to ensure efficient<br />

communication with the community and other stakeholders, advisory committees have<br />

been established for individual projects and groups of projects with broad disciplinary,<br />

institutional, and agency representation: An Advisory Committee was set up in 2004 for<br />

the projects funded out of the MG&G program at NSF (all efforts combined under the<br />

MGDS plus PetDB). This committee met twice, in 2004 and 2005, to discuss progress of<br />

the projects and provide advice for ongoing and future development efforts. The<br />

EarthChem project set up an Advisory Committee in 2005 that advises on tools,<br />

functions, and data sets that may be desired to best benefit the user community, and to<br />

define and assist with implementation of data policies and metadata standards. The<br />

EarthChem AC met in December 2005 for the first time. The R2K and MARGINS data<br />

management systems also report to their respective program steering committees on a<br />

regular basis.<br />

We propose to establish a single Advisory Committee for all data management projects<br />

executed under this plan to enhance integration of the disciplinary data sets, their goals,<br />

strategies. This Advisory Committee will be composed of a broad range of disciplinary<br />

scientists representing marine and terrestrial geology, geophysics, and geochemistry,<br />

Geoinformatics, and Information Technology practice.<br />

2.4. Stakeholders<br />

The following list outlines our current view of the stakeholders in this effort:<br />

• <strong>Data</strong> Providers (system operators and scientists) require that data contribution be as<br />

easy and as efficient as possible<br />

• <strong>Data</strong> Users (researchers, educators, and the public) – need metadata and effective<br />

search tools<br />

• Researchers need to be able to access, visualize, and download data for further<br />

analysis and the ability to limit access to some data<br />

• Operating institutions (including vessel operators) need to be able to find data they<br />

collected and to be recognized for their efforts.<br />

• Oceanographic institutions have vested interest with respect to ownership and<br />

recognition<br />

• Shipboard technical support personnel who maintain, operate and document the data<br />

acquisition equipment<br />

• Mission scientists who have report-writing responsibilities<br />

• Foreign collaborators with data submission and distribution requirements<br />

• Educators use data and derived products for a broad range of activities<br />

• Federal agencies including NSF, USGS, MMS, and NOAA who use and who<br />

contribute data<br />

• General public who pay the bills.<br />

2.5. Synergies<br />

The construction and maintenance of the MGDS and GfG data systems benefit<br />

substantially from the natural synergies that exist among the individual projects as well as<br />

the two project clusters. Objectives, science applications, and user communities overlap<br />

6


widely, as do the technical requirements related to data and metadata types, tools, and<br />

interoperability, offering extensive opportunities for collaboration and integration that<br />

positively impact the execution of the projects and, more importantly, the quality of the<br />

products. The synergies allow us to share design approaches, experiences, expertise,<br />

personnel, and management tools to continuously support each other’s operation, ensure<br />

compatibility, accelerate progress, and promote the broadest application of our data<br />

collections. Figure 1 highlights the most prominent synergies among the projects.<br />

Examples of synergies among MGDS and GfG include the application of GeoMapApp<br />

(MGDS data visualization tool) as a map interface and plotting tool for the geochemical<br />

data systems, the use of sample metadata schemes developed by PetDB for the MGDS<br />

metadata catalog, seamless data exchange between PetDB and the MGDS, and the use of<br />

MGDS cruise metadata for SESAR sample profiles. We will further augment shared<br />

activities by integrating our advisory structure, and building a joint web site that will<br />

serve as a common portal to all projects.<br />

Figure 2 Synergies between projects in the MGDS and the GfG program<br />

Additional benefits arise from close interaction with the broader data and sample<br />

management activities at LDEO and CIESIN that include the IODP logging services data<br />

management, the Socioeconomic <strong>Data</strong> and Applications Center, SEDAC, one of the<br />

7


distributed active archive centers (DAACs) as part of NASA’s Earth Observing <strong>System</strong>,<br />

the IRI <strong>Data</strong> Library, and the LDEO core repository.<br />

8


3. PEP for the <strong>Marine</strong> <strong>Geoscience</strong> <strong>Data</strong> <strong>System</strong><br />

3.1. <strong>Project</strong> Definition<br />

The projects grouped within the MGDS include the Antarctic Multibeam and<br />

Geophysical <strong>Data</strong> Synthesis, the Ridge2000 <strong>Data</strong> <strong>Management</strong> <strong>System</strong> (DMS) the<br />

MARGINS DMS, Seismic Reflection Field DMS and the Legacy project. Although these<br />

individual projects serve different sections of the geoscience community, the underlying<br />

goals of each effort are fully complementary. Thus we have evolved these projects as a<br />

single unified data system. The overall scope of these projects is four fold 1) to serve as a<br />

primary sensor database for underway geophysical data, multibeam sonar and seismic<br />

reflection data from the R/V Palmer, Gould, Ewing, and Langseth 2) to develop a<br />

comprehensive metadata catalog and data repository including derived data products for<br />

Ridge2000 and MARGINS programs and for the wider spectrum of future and historical<br />

marine geoscience programs 3) to provide a synthesis of global ocean bathymetry at full<br />

spatial resolution of the available data, 4) to provide tools for data access tailored to<br />

science needs. The projects within the MGDS are being developed as a single data<br />

system with a common backend database and common front-end tools for access.<br />

Because of the integrated nature of these efforts, developments funded under each project<br />

have been leveraged to the benefit of all the others, providing significant economies of<br />

scale. Development of the integrated data system has been underway since 2003 in<br />

collaboration with researchers at other academic institutions (Table 3.1). The core data<br />

system architecture is established and public access to project components has been<br />

available since 2003/2004. The data system currently (Aug 2006) provides access to<br />

~1.9TeraBytes of data, corresponding to over 196,000 individual data objects, and<br />

associated with 1429 cruises dating back to the 1970’s. <strong>Data</strong> system use has been tracked<br />

since public access began. During 2006 we track ~500-1500 data download sessions per<br />

month and total monthly downloads of 160,000-510,000 individual data files.<br />

Table 3.1 MGDS <strong>Project</strong>s and Collaborators<br />

<strong>Project</strong><br />

Seismic<br />

Reflection DMS<br />

Lead<br />

Institution<br />

UTIG (T.<br />

Shipley)<br />

Collaborator<br />

LDEO<br />

Non-LDEO<br />

Responsibilities<br />

Development and operation of<br />

Seismic Reflection Processed<br />

DMS<br />

RODES LDEO WHOI (T. Shank)* Scientific guidance for<br />

biological data, biological data<br />

compilation<br />

MARGINS<br />

DMS<br />

LDEO TAMU (D. Becker)* Linkages with ODP database<br />

LEGACY LDEO NGDC (Chris Fox), University<br />

of New Hampshire (L. Mayer)<br />

No cost collaborations to<br />

implement multibeam data<br />

exchange (NGDC) and develop<br />

multibeam QC metrics (UNH)<br />

9


R2K/MARGINS LDEO<br />

renewal<br />

(proposed Feb<br />

2006)<br />

* Subcontract<br />

SIO (E. Simms)*<br />

WHOI (V. Ferrini, S. Lerner)*<br />

Vent Image <strong>Data</strong> Bank – image<br />

compilation<br />

NDSF post-processing Alvin/J2<br />

nav, metadata delivery<br />

1) Geophysical Sensor <strong>Data</strong>base for R/V Ewing, Langseth, Palmer and Gould<br />

Unlike other kinds of data which can be preserved and accessed via publications,<br />

digital data derived from sensors require a dedicated repository to ensure long-term data<br />

access and to provide needed functions of data migration to new digital media, and data<br />

documentation to facilitate re-use. The MGDS serves the marine geoscience community<br />

working on problems within the global oceans including the Southern Ocean to preserve<br />

key marine geoscience digital datasets collected from the RV Ewing, Langseth, Palmer<br />

and Gould. Geophysical sensor data which fall within the scope of the MGDS include<br />

multi-channel seismic (MCS) reflection, multibeam sonar and underway geophysics<br />

(gravity, magnetics) data. Acquisition costs alone for a typical multi-channel seismic or<br />

polar ocean expedition are close to $1M per cruise but prior to our effort, no central<br />

repository for these data existed and secondary reuse of these data has been minimal.<br />

<strong>Data</strong> have resided with scientist originators and re-use has required prior knowledge of a<br />

data set and direct contact with a PI who may or may not be able to read tapes and<br />

transfer the data. The MGDS undertakes data validation and quality assessment (QA) for<br />

these data, assumes responsibility for adequately documenting these data using<br />

community defined standards, and providing open public data access, secure storage and<br />

long-term data preservation.<br />

Under the AMBS, all multibeam sonar data from the Palmer and underway<br />

geophysical data from the Palmer and Gould have now been aggregated and are served.<br />

Under the SRDMS, we have established direct transfer of MCS reflection field data from<br />

the Ewing for programs during the final two years of operation and will implement a<br />

similar procedure for upcoming programs of the Langseth. Older MCS field data from the<br />

Ewing and processed MCS data, which reside with LDEO PIs are being recovered from<br />

tape and incorporated into the data system. Under the LEGACY project, all multibeam<br />

sonar data from the Langseth will be processed, validated, and archived.<br />

2) Metadata Catalog and <strong>Data</strong> Repository for R2K, MARGINS, historical and future<br />

MG&G programs.<br />

The scope of the Metadata Catalog effort covers R2K and MARGINS funded field<br />

expeditions, future MG&G programs outside these program areas, as well as historical<br />

expeditions and data sets, which can be accessed from the community and adequately<br />

documented. The primary goal for the Metadata Catalog component is to enable scientists<br />

to discover what data were collected, by who, where, when, and the location where the<br />

data reside. For the geophysical sensor data listed above, as well as for submitted data for<br />

which no designated national repository exists, data are served locally from our<br />

repository. For data for which a designated national repository does exist, the MGDS<br />

links to these distributed data resources, rather than duplicating data holdings. An<br />

important function of this component is to serve as a repository for derived data products,<br />

10


e.g. gridded data sets, velocity models, tomography solutions, GIS project files etc, which<br />

are of high value to the science user community for future re-use. The metadata catalog<br />

effort provides essential proxy functions for the science user community with respect to<br />

defining vocabularies and ontologies for marine geoscience data and establishing<br />

protocols for data exchange.<br />

3) Synthesis of seafloor bathymetry<br />

This component is primarily a function of the AMBS and LEGACY projects, and<br />

involves synthesis of publicly available multibeam bathymetry data from the ocean floor<br />

into an easy to access multi-resolution gridded global digital elevation model (DEM).<br />

Multibeam bathymetry data are unique among the marine geophysical data types in their<br />

relevance for a broad range of scientific investigations as well as for non-academic uses.<br />

For example, they provide fundamental characterization of the physical environment for<br />

studies ranging from ocean bottom circulation to biological studies, as well as serving as<br />

primary base maps for multidisciplinary programs. Detailed bathymetry maps are also<br />

relevant for applications beyond academic research including management of marine<br />

fisheries and other coastal resources as well as marine navigation (e.g. Jan 2005 collision<br />

of the USS San Francisco into an uncharted seamount off Guam). At present, specialist<br />

expertise is needed to access and manipulate multibeam bathymetry data, which is<br />

typically available only as individual survey areas. The only existing broad access and<br />

global compilations of seafloor bathymetry do not include multibeam sonar data at their<br />

full spatial resolution (e.g. 2x2 minute compilation of Smith and Sandwell, 1997). The<br />

MGDS provides synthesis of expedition based multibeam datasets at their full spatial<br />

resolution data for non-specialist use by maintaining a continually updated dynamic<br />

gridded global compilation.<br />

4) Tools for data access tailored to science users needs.<br />

In parallel with the data catalog and archiving functions, the MGDS works to provide<br />

tools for data access and interactive visualization tailored to science user needs. The goal<br />

of this component is to lower the barriers to data access and enable users to explore data<br />

holdings without requiring a specialist understanding of the underlying data structures.<br />

Our approach has been to develop a range of options for searching and accessing data. To<br />

enable specialist users to find particular data sets of interest or to quickly see what data<br />

have been collected in a region, we have developed a server-side text-based keyword<br />

search interface integrated with a server-side mapping tool (MapServer). To enable visual<br />

exploration for both specialist and non-specialist users, we have developed a domainaware<br />

client side application, GeoMapApp, which permits dynamic interaction with a<br />

variety of marine geoscience data including the multi-resolution global DEM. Open<br />

Geospatial Consortium compliant Web Services are being developed to enable MGDS<br />

data holdings to be accessed by other visualization tools.<br />

Long Term Goals<br />

The projects within the MGDS cluster are being developed as a community data<br />

resource as defined by NSB report 0540 on Long-Lived Digital <strong>Data</strong> Collections. The<br />

11


long-term goal of the MGDS is to establish core databases for marine geoscience data<br />

and serve as an active data system for these data. Related goals are to:<br />

• Enable marine geoscience data to be discovered and reused by a diverse<br />

community for present and future use.<br />

• Develop more than a data resource for marine geoscientist specialists by<br />

establishing links to other geoinformatics activities, and pursuing developments in<br />

interoperability.<br />

• Compile global datasets to facilitate global syntheses.<br />

Requirements arising from these goals drive the development of the data system and<br />

include the need to:<br />

• Handle diverse and large multidisciplinary data types (e.g. seismic, sonar,<br />

geological, fluid, biological, rock, and sediment samples, temperature, photo<br />

imagery)<br />

• Provide access for both specialist and non-specialist users<br />

• Take advantage of new emerging technologies<br />

• Respond to evolving user needs<br />

3.2. <strong>Project</strong> <strong>Management</strong><br />

3.2.1. Organizational Structure<br />

One of the strengths of the MGDS has been the capacity to gather adequate resources<br />

to build an expert team with the diverse expertise needed to handle the variety of MGDS<br />

activities. This includes science experts and science community liaisons who can guide<br />

the database development; data scientists and database and application programmers with<br />

the essential domain expertise to contribute to database design; and data specialists who<br />

gather and format data and metadata for input to the database. Any single project of the<br />

MGDS would be unable to support a team with this needed range of expertise. Figure 3.1<br />

illustrates the organizational structure for the MGDS with the teams identified that<br />

perform tasks within six functional areas.<br />

12


Figure 3.1 Organizational Structure of the <strong>Marine</strong> <strong>Geoscience</strong> <strong>Data</strong> <strong>System</strong><br />

As noted in section 3.1, several of the projects within the MGDS cluster include<br />

collaborations with investigators at other institutions (Table 3.1) who participate in<br />

Scientific Directorate activities as well as providing specific database components.<br />

3.2.2. Roles and Responsibilities<br />

The roles along with primary responsibilities of the MGDS team members are outlined<br />

below.<br />

Program Director: Oversees and coordinates all project activities; provides scientific<br />

guidance for data system design and priorities; manages and oversees resources;<br />

responsible for reporting, interfaces with NSF program managers for project components.<br />

coordinates activities with collaborating partners; scientific community liaison.<br />

Senior Scientist: Guides interface development (GeoMapApp); contributes to data<br />

system design and priorities; scientific community liaison; coordinates activities with<br />

collaborating partners; supervises undergraduate and graduate research experiences<br />

associated with data system.<br />

Science Advisors: Contributes to data system design and priorities including metadata<br />

requirements.<br />

13


Senior Engineer: Contributes to data system design and priorities, with focus on<br />

system engineering; coordinates activities with collaborating partners; directs multibeam<br />

QA effort.<br />

Applications Programmers: Responsible for development and maintenance of<br />

applications for data visualization and mapping (GeoMapApp); web services for<br />

implementing interoperability.<br />

<strong>Data</strong>base Developer and Programmers: Responsible for database design and<br />

implementation; web search tool for data access; aid data managers in development of<br />

data loading and validation procedures.<br />

<strong>Project</strong> <strong>Data</strong> Managers: Responsible for data solicitation, formatting, QC and entry;<br />

work with database programmers to develop procedures for data loading and validation;<br />

provides supervision for data specialists who work with team on data entry; science user<br />

support and liaison.<br />

<strong>System</strong> Analyst: Administration of servers including upgrades and monitoring;<br />

software maintenance; system backup. Ensures system reliability, performance and<br />

security.<br />

<strong>Data</strong> Specialists: Aids in data and metadata entry including reformatting and compiling<br />

metadata from legacy sources; recovers legacy data from tape; user support.<br />

14


3.2.3. Work Tasks<br />

Activities of the MGDS are grouped into tasks related to the development of the shared<br />

infrastructure as well as project specific tasks. The overarching activities that serve all<br />

projects represent the synergies gained by building a unified data system to serve the<br />

various science needs addressed by the projects within the MGDS. Significant cost<br />

savings are achieved by having all of these elements developed once rather than by<br />

multiple groups. Table 3.2 shows general Work Tasks of the MGDS with shared<br />

infrastructure and project specific tasks indicated, as well as primary staff members<br />

responsible for each component<br />

Work Task<br />

Staff<br />

<strong>Management</strong><br />

Program <strong>Management</strong> & Directive • Carbotte, Ryan<br />

Program Administration& Coordination Assistance<br />

Taylor (hire expected<br />

• November 2006)<br />

<strong>System</strong> Engineering & Development<br />

<strong>Data</strong>base design, development, and review • Arko, OHara +Team<br />

Web site development • sub contract<br />

Search interface development & deployment • Arko, OHara +Team<br />

Interoperability development (web services, controlled • Arko, Melkonian<br />

Map Interfaces (GeoMapApp, GoogleEarth,<br />

Melkonian, Ryan, Arko<br />

MapServer)<br />

•<br />

<strong>Data</strong> loading procedures/scripts OHara, Goodwillie<br />

MB quality assessment metrics development Chayes<br />

<strong>System</strong> Operation<br />

<strong>Data</strong>base administration • Arko<br />

<strong>Data</strong> ingest Goodwillie, OHara<br />

<strong>System</strong> maintenance, security, backups<br />

Arko, Chayes +LDEO<br />

• network support<br />

Web site maintenance • OHara, Goodwillie<br />

<strong>Data</strong> stewardship (long-term archiving, metadata<br />

Arko, OHara<br />

documentation)<br />

•<br />

<strong>Data</strong> Development<br />

<strong>Data</strong> compilation/solicitation: Legacy + Modern<br />

<br />

Carbotte (R2K), Goodwillie<br />

(Margins/Legacy), OHara<br />

(AMBS), Alsop (SRFDMS),<br />

Weissel, Barone<br />

15


<strong>Data</strong> and metadata entry & quality control<br />

User support<br />

<br />

<br />

Goodwillie<br />

(R2K/Margins/Legacy),<br />

OHara (AMBS), Alsop<br />

(SRFDMS),<br />

Weissel<br />

Goodwillie<br />

(R2K/Margins/Legacy),<br />

OHara (AMBS), Alsop<br />

(SRFDMS),<br />

Weissel (RT) +Team<br />

Outreach & Community Relations, Education<br />

Publications & presentations<br />

<br />

Carbotte, Ryan, Goodwillie,<br />

Arko, Chayes<br />

Workshops Team<br />

Exhibits • Team<br />

Education<br />

<br />

Ryan, Kastens<br />

Table 3.2: Work Tasks for the MGDS program.<br />

Legend: • = overarching activities; = project-specific tasks; = partly overarching<br />

Development of specific components of the shared infrastructure has been funded<br />

through the individual projects of the MGDS as reflected in the combined schedule for<br />

the system shown in section 3.10. The project schedule shows work tasks grouped by the<br />

project funding the task, rather than by the broad grouping shown above.<br />

3.2.4. <strong>System</strong> Development<br />

<strong>System</strong> development for the MGDS has been guided by community input regarding<br />

science needs as well as new technological developments. The core architecture of the<br />

MGDS is designed to follow recommendations of community workshop reports (Smith et<br />

al., 2001, Shipley et al., 2000) as well as requirements defined in R2K and MARGINS<br />

data policy documents. The MGDS is based on open, well documented, standards and<br />

implemented almost entirely with open source free software. Outreach and reporting to<br />

the science user community has been an integral part of system development (see section<br />

3.2.7) along with participation via meetings and workshops with geoinformatics activities<br />

in other areas (www.marine-geo.org/meetings). Our most closely aligned activity is with<br />

the <strong>Marine</strong> Metadata Initiative (MMI), a community-based effort lead by John Graybeal<br />

at MBARI to develop standard vocabularies and ontologies for metadata across the<br />

spectrum of marine science activities. Our involvement in this effort ensures that our<br />

metadata developments are coordinated with broader community efforts.<br />

3.2.5. Operations and Sustaining Engineering<br />

<strong>System</strong> operations include periodic review and refinement of database design, data<br />

solicitation procedures (digital cruise metadata forms), as well as data ingestion,<br />

16


validation and archiving procedures. We currently operate under version 89 of our<br />

database schema. Our digital cruise metadata forms have been through two cycles of<br />

revision based on user feedback and will continue to be assessed on an annual basis.<br />

Periodic review of search interfaces and GeoMapApp are conducted based on user<br />

feedback provided through advisory and science steering committee feedback. We are<br />

scheduled for a major new release of our text-based search interface, <strong>Data</strong> Link, in spring<br />

2006. Updates to GeoMapApp functionality have been provided on a roughly quarterly<br />

basis.<br />

<strong>Project</strong> goals, tasks, timelines, critical paths and milestones are planned with FastTrack<br />

Schedule 9® software chosen because it runs in both our network Windows and<br />

Macintosh environments.<br />

Document management is handled using Plone®, a web-based content management<br />

system with a full audit trail that lets the team enter, view, and edit documents. The<br />

MGDS currently uses Plone to manage documents describing data ingestion procedures,<br />

dynamic task lists, as well as documents associated with team meetings (agendas, action<br />

items),<br />

Request Tracker® is employed for handling all user feedback and inquiries and also<br />

provides an audit trail of all user communications. Each user request is logged via<br />

Request Tracker along with subsequent correspondence, comments and solutions.<br />

Routine security checks and updates are done in collaboration with the network support<br />

group at Lamont. Software updates to our core services are applied after testing on a nonproduction<br />

system.<br />

3.2.6. Education<br />

The MGDS contributes to the goal of educating a new science workforce with an<br />

understanding of data stewardship by involving student interns in project activities. Our<br />

approach is to develop science research projects for these interns which include data<br />

compilation and entry into the database, Since project inception, graduate students and<br />

undergraduate summer interns as well as local high school students have been involved in<br />

many aspects of data harvesting and programming (including the Java code for<br />

GeoMapApp). Two of these individuals have presented their work at AGU and GSA<br />

meetings (Ryan, W B, Muhlenkamp et al., 2003; Schon et al., 2004). The experience of<br />

these individuals contributes to the education of a new generation that understands the<br />

importance of metadata as well as data contribution as part of the scientific research<br />

process.<br />

The R2KDMS includes a dedicated Education Component, which has proceeded in<br />

parallel with development of the data system. This component has been lead by Kim<br />

Kastens and involved collaboration with Tamara Ledley of the DLESE <strong>Data</strong> Services<br />

Group to host a workshop focused on obtaining input from education users into the<br />

design of data system tools. This workshop was held at Lamont on July 21-22, 2004 and<br />

is documented on the web at http://swiki.dlese.org/<strong>Data</strong>SvcsWkshp-04/140. Development<br />

of data-rich activities for K-16 education was supported by this component and has<br />

resulted in the following curriculum activities: a module on the Dynamics and<br />

Geomorphology of Mid Ocean Ridges by Jeff Thomas of Teachers College<br />

17


(http://serc.carleton.edu/dev/eet/rodes_6/index.html), a Mid Ocean Ridge Basalt example<br />

using PetDB and Excel by Matt Smith and Mike Perfit of the University of Florida and a<br />

Calcium Carbonate exercise based on GeoMapApp assembled by Bill Ryan and Peter<br />

DeMenocal of LDEO/Columbia University. Additional examples include an evaluation<br />

of earth science awareness developed by Sandra Swenson, a graduate student at Teachers<br />

College, working with 8 th , 10 th and 12th grade Earth Science students, spring 2005<br />

3.2.7. Reporting<br />

An external Advisory Committee which covers all projects within the MGDS cluster<br />

has been established to provide advice and oversight from the scientific community<br />

(http://www.marine-geo.org/advisory.html). This committee meets annually to provide the<br />

MGDS team with advice on priorities, directions and science community needs as well as<br />

advice to NSF regarding functioning of the DMS.<br />

Reporting to the Ridge2000 and MARGINS science user communities occurs twice<br />

yearly with biannual updates provided at the steering committee meetings of both of<br />

these programs. This frequency of reporting has kept the data system closely connected<br />

to the science user community and has contributed to the success of data submission to<br />

the MGDS. It also ensures system development in response to needs and priorities<br />

defined by these science communities.<br />

The MGDS has operated a booth on the Exhibition Hall for the past two fall AGU<br />

meetings and will continue this activity in future years. This activity contributes<br />

significantly to advertising the data system and engaging new users.<br />

Annual reports are written for NSF for each project. We favor moving to a single<br />

annual report to NSF for all projects of the MGDS. This annual report could also serve as<br />

a summary document for the AC and record of annual progress. NSF oversight of the<br />

MGDS could include facility site visits coincident with annual Advisory Committee<br />

meetings.<br />

3.2.8. Metrics of Success<br />

The success of the MGDS is monitored by tracking web site and GeoMapApp use, as<br />

well as growth of the data holdings. The number of site visits and volumes of file<br />

downloads are tracked on a monthly basis using the Webalizer site use statistics package.<br />

For GeoMapApp usage, we track number of new and returning users, and volumes<br />

downloaded. User feedback was directly solicited at our annual booth on the AGU<br />

Exhibition Hall in 2005 with a user logbook. Unsolicited user feedback received via<br />

email queries and comments are logged using Request Tracker. Other methods to solicit<br />

user feedback including email surveys and web site questionnaires will also be<br />

implemented in the future.<br />

Given the nature of the data system as a data catalog as well as a repository of sensor<br />

data, growth of the data system is monitored through a variety of metrics, including<br />

number of data objects within the DMS, number of field programs cataloged, as well as<br />

growth in data file volumes. Each data object cataloged within the MGDS is tied to the<br />

relevant project that funded the data ingestion so that growth in data holdings can be<br />

tracked via MGDS project.<br />

18


An additional metric of success will be the willingness of the user community to<br />

submit data to the MGDS. Since development of our digital metadata forms we have<br />

observed a growth in number of users who submit metadata forms without direct contact<br />

and solicitation from us. As the data system grows more useful to the user community we<br />

hope data submission may become more automatic and timely with less direct solicitation<br />

required.<br />

Citations in publications are the standard measure of scientific impact, and will be<br />

tracked as an additional measure of success. However, the nature of many of the data<br />

services provided by the MGDS, are unlikely to lead to citations, and publications are<br />

likely to reflect only a small percentage of DMS impact. For example, use of the system<br />

to find the whereabouts of data collected within a region of interest as part of a pre-cruise<br />

or proposal-planning effort would not lead to a citation. Downloaded bathymetry data<br />

used as a base map for a study are rarely cited. Given the typical publication cycle of a<br />

manuscript, citations also become a more useful measure of impact of a data system<br />

through time.<br />

3.2.9. Risk and Contingency <strong>Management</strong><br />

This section outlines our existing strategy for dealing with some possible risks<br />

including physical damage, malicious hacking, personnel changes, etc.<br />

1. Risk of physical destruction of facility by fire, flood or other catastrophic event.<br />

Mitigation: Risk of physical destruction is mitigated by maintenance of off-site data<br />

backups and off-site redundant server as described in the MGDS Infrastructure that is<br />

described in Appendix 5.1. In the event of physical destruction of our primary servers (or<br />

loss of physical access to the building) we would switch service to the off-campus<br />

system, which we are in the process of installing on the Morningside Campus of<br />

Columbia. We would then acquire new hardware and re-establish our servers in a<br />

different (temporary) location on our campus. In the event of a campus-wide destruction,<br />

we would rely on our Morningside location. In either case, we would restore from our<br />

off-campus backups that are stored by Iron Mountain.<br />

2. Network vandalism.<br />

Mitigation: In the event of malicious network vandalism, we assume that such an attack<br />

would hit both our primary servers at Lamont and the Morningside campus. Such an<br />

attack would most likely not damage the hardware so recovery would be relatively quick.<br />

This would take us off-line for a period on the order of 12-24 hours while we understood<br />

the breach, protected ourselves from a future instance and restore from our backups.<br />

3. Loss of key personnel. Due to limited funding the data system operates with minimal<br />

redundancy in expertise.<br />

Mitigation: The data system is operated with overlap in expertise in essential areas of<br />

database programming (O’Hara and Arko), applications (Arko, Melkonian, and new<br />

personnel), and science directive (Carbotte, Ryan). <strong>Project</strong> leadership, reporting and<br />

19


liaison with the science community have been shared within the team ensuring that loss<br />

of one individual can be absorbed. Adequate documentation of system components<br />

including database structure, ingest procedures, and interfaces ensures that DMS<br />

operations can be transitioned to new personnel as required.<br />

4. The short-term and uncertain nature of the funding under which the MGDS has<br />

evolved means that the longevity and security of the projects is a risk. <strong>Project</strong>s are<br />

competed through peer review and are subject to the interests and priorities of the science<br />

community regarding need for open data access and data preservation. In a tight funding<br />

climate, needs to ensure data preservation and data access may be perceived as in conflict<br />

with the need to fund new scientific research. <strong>Project</strong>s require highly expert personnel<br />

with training in both science and IT domains. Such people are difficult to find and keep<br />

in an uncertain and short term funding climate.<br />

Mitigation: The MGDS aims to ensure that the data system provides high quality<br />

service to build the community support needed to secure ongoing funding under renewal<br />

grant periods. In the event of a short-term gap in funding, bridging support can be<br />

requested from the Observatory. The MGDS Advisory Committee (currently configured<br />

to include PetDB and SedDB) voiced their support for future funding of the DMS under a<br />

Cooperative Agreement (see recommendations from October 2005, www.marinegeo.org/advisory).<br />

More secure long-term funding would contribute significantly to<br />

ensuring that high quality personnel can be trained and retained.<br />

In the event that all NSF funding for the MGDS ceases, operation of the data system<br />

would continue as long as servers remain operational. Addition of new data, user support,<br />

and further development of functionality would cease.<br />

5. Enforcement of data policies for data submission fall outside the purview of the<br />

MGDS and data submission requires cooperation of the science community. <strong>Data</strong><br />

submission requirements are defined by NSF and community data policies, and in the<br />

case of the R2K and MARGINS communities, include required submission to the<br />

MGDS. However, enforcement of these submission requirements fall outside the purview<br />

of the DMS and data contribution can only be solicited from the community on a<br />

voluntary basis. This leads to partial and in some cases difficult to obtain data<br />

submission, and a large time investment in solicitation.<br />

Mitigation: Develop a multi-prong approach to improving data submission in the longterm<br />

that includes direct solicitation of scientists, as well as working towards new<br />

automatic procedures for data documentation at sea and direct data transfer from the ship<br />

operators. Automatic transfer of data from the Palmer by Raytheon Polar Services<br />

personnel has been established through the directive of our OPP Program Manager.<br />

Direct transfer of seismic and multibeam data from the Langseth will be established in<br />

support of the SRDMS and LEGACY through our direct link to LDEO ship operations.<br />

The R2K and MARGINS DMS projects have dedicated science communities that support<br />

the data effort and can provide community pressure. The legacy data components of<br />

MGDS projects are focused on data sets within Lamont holdings as well as other public<br />

centers (NGDC) where data access is known. Mitigation also requires more rigorous NSF<br />

enforcement of data submission requirements as well as engagement of NSF Program<br />

20


Managers for Ship Operations in devising multi-prong data submission procedures that<br />

include both scientists and improved data documentation at sea.<br />

6. <strong>Project</strong> scope is large and project activities associated with data documentation and<br />

validation and all aspects of legacy data recovery are manpower intensive.<br />

Mitigation: Target activities of maximum benefit to science community as determined<br />

through the proposal review process and via an advisory structure that participates in<br />

establishing priorities.<br />

3.2.10. Schedule/Milestones<br />

Figure 3.3 provides a combined schedule for all project tasks for the MGDS for the<br />

duration of current funding (until September 2010). Note that the current R2K and<br />

MARGINS DMS grants expire in summer of 2006 and most project activities for these<br />

grants are completed. A renewal proposal for a combined R2K and MARGINS DMS was<br />

submitted to the February 15 2006 MG&G deadline and is currently pending.<br />

21


4. Appendix<br />

4.1. <strong>System</strong> Architecture and Infrastructure for the <strong>Marine</strong> <strong>Geoscience</strong><br />

<strong>Data</strong> <strong>System</strong> (MGDS)<br />

The diagram below illustrates the core architecture on which MGDS systems are<br />

developed, tested, and operated. It consists of backend Platforms (Web and <strong>Data</strong>base),<br />

connecting Services, and end-user Applications. This infrastructure serves both<br />

production and development.<br />

Hardware<br />

The production servers are Intel/Linux (Fedora Core), which yield excellent priceperformance<br />

and benefits from the open-source developer community. The development<br />

23


servers are a mix of Linux, Solaris, Mac OS X, and Windows systems, which allow us to<br />

test applications for cross-platform compatibility. The complete list is shown below:<br />

Purpose Operating <strong>System</strong> Manufacturer/Model<br />

Production #1 Linux Dell Precision 8300<br />

Production #2 Linux Dell Precision 8300<br />

Production #3 Linux Dell Precision 8300<br />

Development Linux Dell Precision 450N<br />

Development Solaris Sun Blade 100<br />

Development Mac OS X Apple iMac G5<br />

Development Windows Dell OptiPlex GX280<br />

A master copy of all MGDS data objects is stored in the LDEO Mass Store system, an<br />

enterprise-class “deep archive” consisting of a quad-dual-core Sun Fire V890 server and<br />

ADIC Scalar i2000 tape library. The system holds 2 TB of online disk cache and ~80 TB<br />

of near-line LTO tape storage, replicated to secure, climate-controlled sites in two<br />

different buildings on campus via redundant gigabit switches to the campus backbone.<br />

A wide range of magnetic/optical drives are available to transfer data packages, directly<br />

attached to the MGDS cluster or adjoining subnet, including DLT (SuperDLT 320 and<br />

DLT-4000/7000), IBM 3480/3490, 8mm tape (Exabyte 82xx, 85xx, and 87xx), 4mm tape<br />

(DDS-2, 3, and 4), removable disk (DynaMO 2300, Zip100 and 250, Jaz 1- and 2-GB)<br />

and optical disc (8x DVD+/-RW and 40x CD-RW).<br />

Software<br />

A wide range of open-source and commercial software supports MGDS development<br />

and operation. The complete list is shown below:<br />

Package/Version<br />

Apache HTTP 2.2<br />

Apache Tomcat 5.5<br />

PHP 5.1<br />

PostgreSQL 8.1<br />

PostGIS 1.1<br />

MapServer 4.8<br />

Eclipse 3.1<br />

Subversion 1.4<br />

Plone 2.1<br />

Request Tracker 3.6<br />

NetWorker 7.3<br />

Purpose<br />

Web server<br />

Java servlet engine<br />

Web script language<br />

Relational database<br />

<strong>Data</strong>base geospatial extensions<br />

Web GIS<br />

Integrated development environment<br />

Version control<br />

Content management system<br />

User feedback<br />

Backup/recovery<br />

MGDS development also utilitizes CU/LDEO site licenses for commercial scientific<br />

software packages including ArcGIS (ESRI), ENVI (RSI), and MATLAB (MathWorks).<br />

24


Networking and Security<br />

The MGDS server cluster is connected to the Lamont campus gigabit fiber backbone,<br />

which is in turn connected to the commodity Internet via redundant land (to the<br />

Morningside campus in New York City and Internet 2) and microwave (to an alternate<br />

ISP routed through Boston) T3 links.<br />

The entire Lamont campus network is protected by a Juniper NetScreen-208<br />

(www.juniper.net) enterprise firewall. External access to MGDS production machines,<br />

such as user logins and incoming file transfers, is segregated or disabled.<br />

Automatic backups are performed nightly on all DMS systems (both production and<br />

development). EMC NetWorker (www.legato.com) is used for general filesystem<br />

backups, and built-in (application-specific) export functions are used for the PostgreSQL<br />

relational database system and Plone content management system. All hardware systems<br />

are covered by extended vendor warranties with on-site service. All software systems are<br />

reviewed and updated as needed on a quarterly basis.<br />

Contingency<br />

An agreement has been reached with the Columbia University Information Technology<br />

(CUIT) department to install a remotely managed backup server at the CUIT Computer<br />

Center on the Morningside campus in New York City. A contract has been signed with<br />

Iron Mountain, Inc., to provide secure long-term digital media storage at their <strong>Data</strong> Vault<br />

facility in New Paltz, NY. Transfers to the Vault occur monthly by courier.<br />

25

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!