Project Management Plan - Marine Geoscience Data System
Project Management Plan - Marine Geoscience Data System
Project Management Plan - Marine Geoscience Data System
Transform your PDFs into Flipbooks and boost your revenue!
Leverage SEO-optimized Flipbooks, powerful backlinks, and multimedia content to professionally showcase your products and significantly increase your reach.
<strong>Project</strong> <strong>Management</strong> <strong>Plan</strong><br />
<strong>Marine</strong> <strong>Geoscience</strong> <strong>Data</strong> <strong>System</strong><br />
&<br />
Geoinformatics for Geochemistry Program<br />
Version 1.7p<br />
August 22, 2006<br />
Lamont-Doherty Earth Observatory of Columbia University<br />
1
1. Introduction................................................................................................................ 3<br />
2. Geoinformatics <strong>Project</strong>s at LDEO............................................................................. 3<br />
2.1. Vision statement ........................................................................................................... 3<br />
2.2. General <strong>Management</strong> Structure ................................................................................. 5<br />
2.3. Advisory Structure....................................................................................................... 5<br />
2.4. Stakeholders ................................................................................................................. 6<br />
2.5. Synergies .......................................................................................................................6<br />
3. PEP for the <strong>Marine</strong> <strong>Geoscience</strong> <strong>Data</strong> <strong>System</strong>........................................................... 9<br />
3.1. <strong>Project</strong> Definition ......................................................................................................... 9<br />
3.2. <strong>Project</strong> <strong>Management</strong>.................................................................................................. 12<br />
3.2.1. Organizational Structure....................................................................................................... 12<br />
3.2.2. Roles and Responsibilities.................................................................................................... 13<br />
3.2.3. Work Tasks........................................................................................................................... 15<br />
3.2.4. <strong>System</strong> Development............................................................................................................ 16<br />
3.2.5. Operations and Sustaining Engineering................................................................................ 16<br />
3.2.6. Education.............................................................................................................................. 17<br />
3.2.7. Reporting .............................................................................................................................. 18<br />
3.2.8. Metrics of Success................................................................................................................ 18<br />
3.2.9. Risk and Contingency <strong>Management</strong>..................................................................................... 19<br />
3.2.10. Schedule/Milestones ........................................................................................................ 21<br />
4. Appendix................................................................................................................... 23<br />
4.1. <strong>System</strong> Architecture and Infrastructure for the <strong>Marine</strong> <strong>Geoscience</strong> <strong>Data</strong> <strong>System</strong><br />
(MGDS).................................................................................................................................... 23<br />
2
1. Introduction<br />
Federal agencies including the NSF, NASA, NOAA, and USGS fund a range of data<br />
management and Geoinformatics activities for the Earth and Ocean Sciences which are<br />
carried out at the Lamont-Doherty Earth Observatory (LDEO), the Center for<br />
International Earth Science Information Network (CIESIN) and the International<br />
Research Institute for Climate Change (IRI). LDEO, CIESIN and IRI are units of the<br />
Earth Institute at Columbia University, co-located on Columbia’s Lamont Campus in<br />
Palisades, NY. <strong>Project</strong>s funded by the NSF include several mature activities with a long<br />
history of funding; the Sediment Core Repository which houses sediment cores from the<br />
global oceans: and the ODP/IODP Borehole Group which serves down-hole logging data<br />
in support of the ODP and IODP programs.<br />
Recently the NSF has funded a suite of new activities to build and manage digital data<br />
collections for <strong>Marine</strong> Geology & Geophysics, Geochemistry, and the broader<br />
<strong>Geoscience</strong>s. Though funded as independent activities with different duration and<br />
termination dates and at different stages of maturity, these projects widely overlap in their<br />
objectives, technical requirements, equipment, and personnel resources with significant<br />
synergies among them.<br />
This document describes strategic and management approaches, organizational<br />
structures and timelines under which these NSF-funded cyberinfrastructure projects will<br />
be executed. This is intended to be a dynamic document, reviewed and revised an<br />
annually in discussion with the Advisory Committee and the National Science<br />
Foundation. Revisions will reflect changes in the scope and type of projects included, in<br />
the organizational structure, and in the funding. New projects will be integrated into this<br />
plan as appropriate.<br />
2. Geoinformatics <strong>Project</strong>s at LDEO<br />
2.1. Vision statement<br />
“It is exceedingly rare that fundamentally new approaches to research and education<br />
arise. Information technology has ushered in such a fundamental change. Digital data<br />
collections are at the heart of this change.” (Long-lived <strong>Data</strong> Collections, Report of the<br />
National Science Board to the NSF, 2005)<br />
3
Geoinformatics projects at LDEO are focused on building and maintaining digital data<br />
collections that are essential resources for the <strong>Geoscience</strong>s to maximize accessibility and<br />
thus the application of scientific data in research and education. These data collections<br />
aid data discovery, access to data, and data analysis, support cross disciplinary<br />
approaches in science by facilitating data integration across disciplinary, spatial, and<br />
temporal boundaries, and to advance the “principle of open access” to data and to<br />
samples that were acquired with public funding.<br />
Most of our systems can be viewed as Resource or Community <strong>Data</strong> Collections<br />
according to the classification of digital data collections in the NSB report. Resource<br />
<strong>Data</strong> Collections “serve a specific science and engineering community… They typically<br />
conform to community standards, where such standards exist. Often these digital<br />
collections can play key roles in bringing communities together to develop appropriate<br />
standards where a need exists. In many cases community database collections migrate to<br />
reference collections”. <strong>Project</strong>s in the MGDS and the GfG program have taken<br />
leadership roles in defining standards for cruise metadata and geochemical data,<br />
respectively, working closely with the appropriate communities through their Advisory<br />
Committees, and various outreach activities.<br />
<strong>Management</strong> of digital data for the <strong>Geoscience</strong>s has a long tradition at LDEO that dates<br />
back into the 1960s, and which led to a rich database of underway geophysical data and<br />
subsequently to an early experiment in geographical database access (W. Menke’s<br />
Geographical <strong>Data</strong>base Browser XGB). In the mid to late 1990s, the RIDGE Multibeam<br />
Synthesis project and the PetDB database for geochemical data of ocean floor igneous<br />
and metamorphic rocks were developed and started to serve scientific data over the web<br />
to a broad audience. The new projects described in this <strong>Project</strong> Execution <strong>Plan</strong>, build<br />
upon and extend these earlier efforts. Most of our projects were proposed based on<br />
recommendations from community workshops that outlined objectives, scope, and<br />
functionality of data management systems that would best serve science and education<br />
(Michael et al. 1991, Smith et al., 2001, Shipley et al., 1999, Cervato et al. 2004).<br />
The significant growth in data management activities at LDEO, which is based on the<br />
success of the initial efforts, was intentionally pursued to be able to establish a critical<br />
mass of professionals who can take on the wide range of tasks necessary to design,<br />
develop, and maintain digital data collections. <strong>Data</strong> management activities at LDEO have<br />
grown incrementally, but have now reached a level where a well-defined management<br />
plan is required that documents the approaches used to accomplish design and<br />
development tasks, and a sustained operation of the data collections. This plan also<br />
exposes the significant synergies among the individual projects and project clusters that<br />
have been critical for the efficiency with which the systems are built and operated. These<br />
synergies will be further exploited in the future through improved integration of our<br />
operations, management and advisory structure.<br />
Our primary goal for the future is to ensure the highest level of quality and long-term<br />
sustainability of our systems, data products, and of the services they provide to the<br />
community. In order to achieve this goal, we need to maintain a level of funding that will<br />
allow us to support our operations with the superior expert teams that we have gathered,<br />
and to continue infusion of relevant new technologies. A potential decrease of resources<br />
due to termination of projects or reduction of funding will need to be balanced by new<br />
4
endeavors that complement the existing capabilities. We will primarily focus on<br />
expanding collaborations with other Geoinformatics efforts to leverage ongoing<br />
developments and further contribute to establishing a cyberinfrastructure for <strong>Geoscience</strong><br />
research and education.<br />
2.2. General <strong>Management</strong> Structure<br />
The activities are managed in two thematically and organizationally distinct clusters,<br />
the <strong>Marine</strong> <strong>Geoscience</strong> <strong>Data</strong> <strong>System</strong> (MGDS) and the Geoinformatics for Geochemistry<br />
(GfG) system. The GfG projects are executed by a joint team from LDEO and CIESIN<br />
with management responsibilities for the IT team at CIESIN, while all IT and science<br />
directive activities for the MGDS are executed by a team from LDEO.<br />
Figure 1 General<br />
<strong>Management</strong> and<br />
Advisory Structure for<br />
the MGDS and GfG<br />
In order to simplify management of the activities in the future, we envision the projects<br />
within the two clusters to merge into two larger efforts. We plan to achieve this by<br />
combining operation, maintenance, and further development of the individual systems in<br />
upcoming renewal proposals.<br />
2.3. Advisory Structure<br />
All digital data collections and information systems built and maintained under this<br />
<strong>Project</strong> Execution <strong>Plan</strong> are community resources for the broad <strong>Geoscience</strong>s that manage,<br />
archive, and serve scientific data to better benefit research and education. As such, they<br />
need to address user concerns, and respond to broad scientific and educational needs. The<br />
recent NSB report on Long-lived Digital <strong>Data</strong> Collections emphasizes the obligation of<br />
data managers to maintain an “open and effective communication with the served<br />
community” and to “gain the trust of the community that the collection serves”.<br />
5
To establish the needed community input and oversight, and to ensure efficient<br />
communication with the community and other stakeholders, advisory committees have<br />
been established for individual projects and groups of projects with broad disciplinary,<br />
institutional, and agency representation: An Advisory Committee was set up in 2004 for<br />
the projects funded out of the MG&G program at NSF (all efforts combined under the<br />
MGDS plus PetDB). This committee met twice, in 2004 and 2005, to discuss progress of<br />
the projects and provide advice for ongoing and future development efforts. The<br />
EarthChem project set up an Advisory Committee in 2005 that advises on tools,<br />
functions, and data sets that may be desired to best benefit the user community, and to<br />
define and assist with implementation of data policies and metadata standards. The<br />
EarthChem AC met in December 2005 for the first time. The R2K and MARGINS data<br />
management systems also report to their respective program steering committees on a<br />
regular basis.<br />
We propose to establish a single Advisory Committee for all data management projects<br />
executed under this plan to enhance integration of the disciplinary data sets, their goals,<br />
strategies. This Advisory Committee will be composed of a broad range of disciplinary<br />
scientists representing marine and terrestrial geology, geophysics, and geochemistry,<br />
Geoinformatics, and Information Technology practice.<br />
2.4. Stakeholders<br />
The following list outlines our current view of the stakeholders in this effort:<br />
• <strong>Data</strong> Providers (system operators and scientists) require that data contribution be as<br />
easy and as efficient as possible<br />
• <strong>Data</strong> Users (researchers, educators, and the public) – need metadata and effective<br />
search tools<br />
• Researchers need to be able to access, visualize, and download data for further<br />
analysis and the ability to limit access to some data<br />
• Operating institutions (including vessel operators) need to be able to find data they<br />
collected and to be recognized for their efforts.<br />
• Oceanographic institutions have vested interest with respect to ownership and<br />
recognition<br />
• Shipboard technical support personnel who maintain, operate and document the data<br />
acquisition equipment<br />
• Mission scientists who have report-writing responsibilities<br />
• Foreign collaborators with data submission and distribution requirements<br />
• Educators use data and derived products for a broad range of activities<br />
• Federal agencies including NSF, USGS, MMS, and NOAA who use and who<br />
contribute data<br />
• General public who pay the bills.<br />
2.5. Synergies<br />
The construction and maintenance of the MGDS and GfG data systems benefit<br />
substantially from the natural synergies that exist among the individual projects as well as<br />
the two project clusters. Objectives, science applications, and user communities overlap<br />
6
widely, as do the technical requirements related to data and metadata types, tools, and<br />
interoperability, offering extensive opportunities for collaboration and integration that<br />
positively impact the execution of the projects and, more importantly, the quality of the<br />
products. The synergies allow us to share design approaches, experiences, expertise,<br />
personnel, and management tools to continuously support each other’s operation, ensure<br />
compatibility, accelerate progress, and promote the broadest application of our data<br />
collections. Figure 1 highlights the most prominent synergies among the projects.<br />
Examples of synergies among MGDS and GfG include the application of GeoMapApp<br />
(MGDS data visualization tool) as a map interface and plotting tool for the geochemical<br />
data systems, the use of sample metadata schemes developed by PetDB for the MGDS<br />
metadata catalog, seamless data exchange between PetDB and the MGDS, and the use of<br />
MGDS cruise metadata for SESAR sample profiles. We will further augment shared<br />
activities by integrating our advisory structure, and building a joint web site that will<br />
serve as a common portal to all projects.<br />
Figure 2 Synergies between projects in the MGDS and the GfG program<br />
Additional benefits arise from close interaction with the broader data and sample<br />
management activities at LDEO and CIESIN that include the IODP logging services data<br />
management, the Socioeconomic <strong>Data</strong> and Applications Center, SEDAC, one of the<br />
7
distributed active archive centers (DAACs) as part of NASA’s Earth Observing <strong>System</strong>,<br />
the IRI <strong>Data</strong> Library, and the LDEO core repository.<br />
8
3. PEP for the <strong>Marine</strong> <strong>Geoscience</strong> <strong>Data</strong> <strong>System</strong><br />
3.1. <strong>Project</strong> Definition<br />
The projects grouped within the MGDS include the Antarctic Multibeam and<br />
Geophysical <strong>Data</strong> Synthesis, the Ridge2000 <strong>Data</strong> <strong>Management</strong> <strong>System</strong> (DMS) the<br />
MARGINS DMS, Seismic Reflection Field DMS and the Legacy project. Although these<br />
individual projects serve different sections of the geoscience community, the underlying<br />
goals of each effort are fully complementary. Thus we have evolved these projects as a<br />
single unified data system. The overall scope of these projects is four fold 1) to serve as a<br />
primary sensor database for underway geophysical data, multibeam sonar and seismic<br />
reflection data from the R/V Palmer, Gould, Ewing, and Langseth 2) to develop a<br />
comprehensive metadata catalog and data repository including derived data products for<br />
Ridge2000 and MARGINS programs and for the wider spectrum of future and historical<br />
marine geoscience programs 3) to provide a synthesis of global ocean bathymetry at full<br />
spatial resolution of the available data, 4) to provide tools for data access tailored to<br />
science needs. The projects within the MGDS are being developed as a single data<br />
system with a common backend database and common front-end tools for access.<br />
Because of the integrated nature of these efforts, developments funded under each project<br />
have been leveraged to the benefit of all the others, providing significant economies of<br />
scale. Development of the integrated data system has been underway since 2003 in<br />
collaboration with researchers at other academic institutions (Table 3.1). The core data<br />
system architecture is established and public access to project components has been<br />
available since 2003/2004. The data system currently (Aug 2006) provides access to<br />
~1.9TeraBytes of data, corresponding to over 196,000 individual data objects, and<br />
associated with 1429 cruises dating back to the 1970’s. <strong>Data</strong> system use has been tracked<br />
since public access began. During 2006 we track ~500-1500 data download sessions per<br />
month and total monthly downloads of 160,000-510,000 individual data files.<br />
Table 3.1 MGDS <strong>Project</strong>s and Collaborators<br />
<strong>Project</strong><br />
Seismic<br />
Reflection DMS<br />
Lead<br />
Institution<br />
UTIG (T.<br />
Shipley)<br />
Collaborator<br />
LDEO<br />
Non-LDEO<br />
Responsibilities<br />
Development and operation of<br />
Seismic Reflection Processed<br />
DMS<br />
RODES LDEO WHOI (T. Shank)* Scientific guidance for<br />
biological data, biological data<br />
compilation<br />
MARGINS<br />
DMS<br />
LDEO TAMU (D. Becker)* Linkages with ODP database<br />
LEGACY LDEO NGDC (Chris Fox), University<br />
of New Hampshire (L. Mayer)<br />
No cost collaborations to<br />
implement multibeam data<br />
exchange (NGDC) and develop<br />
multibeam QC metrics (UNH)<br />
9
R2K/MARGINS LDEO<br />
renewal<br />
(proposed Feb<br />
2006)<br />
* Subcontract<br />
SIO (E. Simms)*<br />
WHOI (V. Ferrini, S. Lerner)*<br />
Vent Image <strong>Data</strong> Bank – image<br />
compilation<br />
NDSF post-processing Alvin/J2<br />
nav, metadata delivery<br />
1) Geophysical Sensor <strong>Data</strong>base for R/V Ewing, Langseth, Palmer and Gould<br />
Unlike other kinds of data which can be preserved and accessed via publications,<br />
digital data derived from sensors require a dedicated repository to ensure long-term data<br />
access and to provide needed functions of data migration to new digital media, and data<br />
documentation to facilitate re-use. The MGDS serves the marine geoscience community<br />
working on problems within the global oceans including the Southern Ocean to preserve<br />
key marine geoscience digital datasets collected from the RV Ewing, Langseth, Palmer<br />
and Gould. Geophysical sensor data which fall within the scope of the MGDS include<br />
multi-channel seismic (MCS) reflection, multibeam sonar and underway geophysics<br />
(gravity, magnetics) data. Acquisition costs alone for a typical multi-channel seismic or<br />
polar ocean expedition are close to $1M per cruise but prior to our effort, no central<br />
repository for these data existed and secondary reuse of these data has been minimal.<br />
<strong>Data</strong> have resided with scientist originators and re-use has required prior knowledge of a<br />
data set and direct contact with a PI who may or may not be able to read tapes and<br />
transfer the data. The MGDS undertakes data validation and quality assessment (QA) for<br />
these data, assumes responsibility for adequately documenting these data using<br />
community defined standards, and providing open public data access, secure storage and<br />
long-term data preservation.<br />
Under the AMBS, all multibeam sonar data from the Palmer and underway<br />
geophysical data from the Palmer and Gould have now been aggregated and are served.<br />
Under the SRDMS, we have established direct transfer of MCS reflection field data from<br />
the Ewing for programs during the final two years of operation and will implement a<br />
similar procedure for upcoming programs of the Langseth. Older MCS field data from the<br />
Ewing and processed MCS data, which reside with LDEO PIs are being recovered from<br />
tape and incorporated into the data system. Under the LEGACY project, all multibeam<br />
sonar data from the Langseth will be processed, validated, and archived.<br />
2) Metadata Catalog and <strong>Data</strong> Repository for R2K, MARGINS, historical and future<br />
MG&G programs.<br />
The scope of the Metadata Catalog effort covers R2K and MARGINS funded field<br />
expeditions, future MG&G programs outside these program areas, as well as historical<br />
expeditions and data sets, which can be accessed from the community and adequately<br />
documented. The primary goal for the Metadata Catalog component is to enable scientists<br />
to discover what data were collected, by who, where, when, and the location where the<br />
data reside. For the geophysical sensor data listed above, as well as for submitted data for<br />
which no designated national repository exists, data are served locally from our<br />
repository. For data for which a designated national repository does exist, the MGDS<br />
links to these distributed data resources, rather than duplicating data holdings. An<br />
important function of this component is to serve as a repository for derived data products,<br />
10
e.g. gridded data sets, velocity models, tomography solutions, GIS project files etc, which<br />
are of high value to the science user community for future re-use. The metadata catalog<br />
effort provides essential proxy functions for the science user community with respect to<br />
defining vocabularies and ontologies for marine geoscience data and establishing<br />
protocols for data exchange.<br />
3) Synthesis of seafloor bathymetry<br />
This component is primarily a function of the AMBS and LEGACY projects, and<br />
involves synthesis of publicly available multibeam bathymetry data from the ocean floor<br />
into an easy to access multi-resolution gridded global digital elevation model (DEM).<br />
Multibeam bathymetry data are unique among the marine geophysical data types in their<br />
relevance for a broad range of scientific investigations as well as for non-academic uses.<br />
For example, they provide fundamental characterization of the physical environment for<br />
studies ranging from ocean bottom circulation to biological studies, as well as serving as<br />
primary base maps for multidisciplinary programs. Detailed bathymetry maps are also<br />
relevant for applications beyond academic research including management of marine<br />
fisheries and other coastal resources as well as marine navigation (e.g. Jan 2005 collision<br />
of the USS San Francisco into an uncharted seamount off Guam). At present, specialist<br />
expertise is needed to access and manipulate multibeam bathymetry data, which is<br />
typically available only as individual survey areas. The only existing broad access and<br />
global compilations of seafloor bathymetry do not include multibeam sonar data at their<br />
full spatial resolution (e.g. 2x2 minute compilation of Smith and Sandwell, 1997). The<br />
MGDS provides synthesis of expedition based multibeam datasets at their full spatial<br />
resolution data for non-specialist use by maintaining a continually updated dynamic<br />
gridded global compilation.<br />
4) Tools for data access tailored to science users needs.<br />
In parallel with the data catalog and archiving functions, the MGDS works to provide<br />
tools for data access and interactive visualization tailored to science user needs. The goal<br />
of this component is to lower the barriers to data access and enable users to explore data<br />
holdings without requiring a specialist understanding of the underlying data structures.<br />
Our approach has been to develop a range of options for searching and accessing data. To<br />
enable specialist users to find particular data sets of interest or to quickly see what data<br />
have been collected in a region, we have developed a server-side text-based keyword<br />
search interface integrated with a server-side mapping tool (MapServer). To enable visual<br />
exploration for both specialist and non-specialist users, we have developed a domainaware<br />
client side application, GeoMapApp, which permits dynamic interaction with a<br />
variety of marine geoscience data including the multi-resolution global DEM. Open<br />
Geospatial Consortium compliant Web Services are being developed to enable MGDS<br />
data holdings to be accessed by other visualization tools.<br />
Long Term Goals<br />
The projects within the MGDS cluster are being developed as a community data<br />
resource as defined by NSB report 0540 on Long-Lived Digital <strong>Data</strong> Collections. The<br />
11
long-term goal of the MGDS is to establish core databases for marine geoscience data<br />
and serve as an active data system for these data. Related goals are to:<br />
• Enable marine geoscience data to be discovered and reused by a diverse<br />
community for present and future use.<br />
• Develop more than a data resource for marine geoscientist specialists by<br />
establishing links to other geoinformatics activities, and pursuing developments in<br />
interoperability.<br />
• Compile global datasets to facilitate global syntheses.<br />
Requirements arising from these goals drive the development of the data system and<br />
include the need to:<br />
• Handle diverse and large multidisciplinary data types (e.g. seismic, sonar,<br />
geological, fluid, biological, rock, and sediment samples, temperature, photo<br />
imagery)<br />
• Provide access for both specialist and non-specialist users<br />
• Take advantage of new emerging technologies<br />
• Respond to evolving user needs<br />
3.2. <strong>Project</strong> <strong>Management</strong><br />
3.2.1. Organizational Structure<br />
One of the strengths of the MGDS has been the capacity to gather adequate resources<br />
to build an expert team with the diverse expertise needed to handle the variety of MGDS<br />
activities. This includes science experts and science community liaisons who can guide<br />
the database development; data scientists and database and application programmers with<br />
the essential domain expertise to contribute to database design; and data specialists who<br />
gather and format data and metadata for input to the database. Any single project of the<br />
MGDS would be unable to support a team with this needed range of expertise. Figure 3.1<br />
illustrates the organizational structure for the MGDS with the teams identified that<br />
perform tasks within six functional areas.<br />
12
Figure 3.1 Organizational Structure of the <strong>Marine</strong> <strong>Geoscience</strong> <strong>Data</strong> <strong>System</strong><br />
As noted in section 3.1, several of the projects within the MGDS cluster include<br />
collaborations with investigators at other institutions (Table 3.1) who participate in<br />
Scientific Directorate activities as well as providing specific database components.<br />
3.2.2. Roles and Responsibilities<br />
The roles along with primary responsibilities of the MGDS team members are outlined<br />
below.<br />
Program Director: Oversees and coordinates all project activities; provides scientific<br />
guidance for data system design and priorities; manages and oversees resources;<br />
responsible for reporting, interfaces with NSF program managers for project components.<br />
coordinates activities with collaborating partners; scientific community liaison.<br />
Senior Scientist: Guides interface development (GeoMapApp); contributes to data<br />
system design and priorities; scientific community liaison; coordinates activities with<br />
collaborating partners; supervises undergraduate and graduate research experiences<br />
associated with data system.<br />
Science Advisors: Contributes to data system design and priorities including metadata<br />
requirements.<br />
13
Senior Engineer: Contributes to data system design and priorities, with focus on<br />
system engineering; coordinates activities with collaborating partners; directs multibeam<br />
QA effort.<br />
Applications Programmers: Responsible for development and maintenance of<br />
applications for data visualization and mapping (GeoMapApp); web services for<br />
implementing interoperability.<br />
<strong>Data</strong>base Developer and Programmers: Responsible for database design and<br />
implementation; web search tool for data access; aid data managers in development of<br />
data loading and validation procedures.<br />
<strong>Project</strong> <strong>Data</strong> Managers: Responsible for data solicitation, formatting, QC and entry;<br />
work with database programmers to develop procedures for data loading and validation;<br />
provides supervision for data specialists who work with team on data entry; science user<br />
support and liaison.<br />
<strong>System</strong> Analyst: Administration of servers including upgrades and monitoring;<br />
software maintenance; system backup. Ensures system reliability, performance and<br />
security.<br />
<strong>Data</strong> Specialists: Aids in data and metadata entry including reformatting and compiling<br />
metadata from legacy sources; recovers legacy data from tape; user support.<br />
14
3.2.3. Work Tasks<br />
Activities of the MGDS are grouped into tasks related to the development of the shared<br />
infrastructure as well as project specific tasks. The overarching activities that serve all<br />
projects represent the synergies gained by building a unified data system to serve the<br />
various science needs addressed by the projects within the MGDS. Significant cost<br />
savings are achieved by having all of these elements developed once rather than by<br />
multiple groups. Table 3.2 shows general Work Tasks of the MGDS with shared<br />
infrastructure and project specific tasks indicated, as well as primary staff members<br />
responsible for each component<br />
Work Task<br />
Staff<br />
<strong>Management</strong><br />
Program <strong>Management</strong> & Directive • Carbotte, Ryan<br />
Program Administration& Coordination Assistance<br />
Taylor (hire expected<br />
• November 2006)<br />
<strong>System</strong> Engineering & Development<br />
<strong>Data</strong>base design, development, and review • Arko, OHara +Team<br />
Web site development • sub contract<br />
Search interface development & deployment • Arko, OHara +Team<br />
Interoperability development (web services, controlled • Arko, Melkonian<br />
Map Interfaces (GeoMapApp, GoogleEarth,<br />
Melkonian, Ryan, Arko<br />
MapServer)<br />
•<br />
<strong>Data</strong> loading procedures/scripts OHara, Goodwillie<br />
MB quality assessment metrics development Chayes<br />
<strong>System</strong> Operation<br />
<strong>Data</strong>base administration • Arko<br />
<strong>Data</strong> ingest Goodwillie, OHara<br />
<strong>System</strong> maintenance, security, backups<br />
Arko, Chayes +LDEO<br />
• network support<br />
Web site maintenance • OHara, Goodwillie<br />
<strong>Data</strong> stewardship (long-term archiving, metadata<br />
Arko, OHara<br />
documentation)<br />
•<br />
<strong>Data</strong> Development<br />
<strong>Data</strong> compilation/solicitation: Legacy + Modern<br />
<br />
Carbotte (R2K), Goodwillie<br />
(Margins/Legacy), OHara<br />
(AMBS), Alsop (SRFDMS),<br />
Weissel, Barone<br />
15
<strong>Data</strong> and metadata entry & quality control<br />
User support<br />
<br />
<br />
Goodwillie<br />
(R2K/Margins/Legacy),<br />
OHara (AMBS), Alsop<br />
(SRFDMS),<br />
Weissel<br />
Goodwillie<br />
(R2K/Margins/Legacy),<br />
OHara (AMBS), Alsop<br />
(SRFDMS),<br />
Weissel (RT) +Team<br />
Outreach & Community Relations, Education<br />
Publications & presentations<br />
<br />
Carbotte, Ryan, Goodwillie,<br />
Arko, Chayes<br />
Workshops Team<br />
Exhibits • Team<br />
Education<br />
<br />
Ryan, Kastens<br />
Table 3.2: Work Tasks for the MGDS program.<br />
Legend: • = overarching activities; = project-specific tasks; = partly overarching<br />
Development of specific components of the shared infrastructure has been funded<br />
through the individual projects of the MGDS as reflected in the combined schedule for<br />
the system shown in section 3.10. The project schedule shows work tasks grouped by the<br />
project funding the task, rather than by the broad grouping shown above.<br />
3.2.4. <strong>System</strong> Development<br />
<strong>System</strong> development for the MGDS has been guided by community input regarding<br />
science needs as well as new technological developments. The core architecture of the<br />
MGDS is designed to follow recommendations of community workshop reports (Smith et<br />
al., 2001, Shipley et al., 2000) as well as requirements defined in R2K and MARGINS<br />
data policy documents. The MGDS is based on open, well documented, standards and<br />
implemented almost entirely with open source free software. Outreach and reporting to<br />
the science user community has been an integral part of system development (see section<br />
3.2.7) along with participation via meetings and workshops with geoinformatics activities<br />
in other areas (www.marine-geo.org/meetings). Our most closely aligned activity is with<br />
the <strong>Marine</strong> Metadata Initiative (MMI), a community-based effort lead by John Graybeal<br />
at MBARI to develop standard vocabularies and ontologies for metadata across the<br />
spectrum of marine science activities. Our involvement in this effort ensures that our<br />
metadata developments are coordinated with broader community efforts.<br />
3.2.5. Operations and Sustaining Engineering<br />
<strong>System</strong> operations include periodic review and refinement of database design, data<br />
solicitation procedures (digital cruise metadata forms), as well as data ingestion,<br />
16
validation and archiving procedures. We currently operate under version 89 of our<br />
database schema. Our digital cruise metadata forms have been through two cycles of<br />
revision based on user feedback and will continue to be assessed on an annual basis.<br />
Periodic review of search interfaces and GeoMapApp are conducted based on user<br />
feedback provided through advisory and science steering committee feedback. We are<br />
scheduled for a major new release of our text-based search interface, <strong>Data</strong> Link, in spring<br />
2006. Updates to GeoMapApp functionality have been provided on a roughly quarterly<br />
basis.<br />
<strong>Project</strong> goals, tasks, timelines, critical paths and milestones are planned with FastTrack<br />
Schedule 9® software chosen because it runs in both our network Windows and<br />
Macintosh environments.<br />
Document management is handled using Plone®, a web-based content management<br />
system with a full audit trail that lets the team enter, view, and edit documents. The<br />
MGDS currently uses Plone to manage documents describing data ingestion procedures,<br />
dynamic task lists, as well as documents associated with team meetings (agendas, action<br />
items),<br />
Request Tracker® is employed for handling all user feedback and inquiries and also<br />
provides an audit trail of all user communications. Each user request is logged via<br />
Request Tracker along with subsequent correspondence, comments and solutions.<br />
Routine security checks and updates are done in collaboration with the network support<br />
group at Lamont. Software updates to our core services are applied after testing on a nonproduction<br />
system.<br />
3.2.6. Education<br />
The MGDS contributes to the goal of educating a new science workforce with an<br />
understanding of data stewardship by involving student interns in project activities. Our<br />
approach is to develop science research projects for these interns which include data<br />
compilation and entry into the database, Since project inception, graduate students and<br />
undergraduate summer interns as well as local high school students have been involved in<br />
many aspects of data harvesting and programming (including the Java code for<br />
GeoMapApp). Two of these individuals have presented their work at AGU and GSA<br />
meetings (Ryan, W B, Muhlenkamp et al., 2003; Schon et al., 2004). The experience of<br />
these individuals contributes to the education of a new generation that understands the<br />
importance of metadata as well as data contribution as part of the scientific research<br />
process.<br />
The R2KDMS includes a dedicated Education Component, which has proceeded in<br />
parallel with development of the data system. This component has been lead by Kim<br />
Kastens and involved collaboration with Tamara Ledley of the DLESE <strong>Data</strong> Services<br />
Group to host a workshop focused on obtaining input from education users into the<br />
design of data system tools. This workshop was held at Lamont on July 21-22, 2004 and<br />
is documented on the web at http://swiki.dlese.org/<strong>Data</strong>SvcsWkshp-04/140. Development<br />
of data-rich activities for K-16 education was supported by this component and has<br />
resulted in the following curriculum activities: a module on the Dynamics and<br />
Geomorphology of Mid Ocean Ridges by Jeff Thomas of Teachers College<br />
17
(http://serc.carleton.edu/dev/eet/rodes_6/index.html), a Mid Ocean Ridge Basalt example<br />
using PetDB and Excel by Matt Smith and Mike Perfit of the University of Florida and a<br />
Calcium Carbonate exercise based on GeoMapApp assembled by Bill Ryan and Peter<br />
DeMenocal of LDEO/Columbia University. Additional examples include an evaluation<br />
of earth science awareness developed by Sandra Swenson, a graduate student at Teachers<br />
College, working with 8 th , 10 th and 12th grade Earth Science students, spring 2005<br />
3.2.7. Reporting<br />
An external Advisory Committee which covers all projects within the MGDS cluster<br />
has been established to provide advice and oversight from the scientific community<br />
(http://www.marine-geo.org/advisory.html). This committee meets annually to provide the<br />
MGDS team with advice on priorities, directions and science community needs as well as<br />
advice to NSF regarding functioning of the DMS.<br />
Reporting to the Ridge2000 and MARGINS science user communities occurs twice<br />
yearly with biannual updates provided at the steering committee meetings of both of<br />
these programs. This frequency of reporting has kept the data system closely connected<br />
to the science user community and has contributed to the success of data submission to<br />
the MGDS. It also ensures system development in response to needs and priorities<br />
defined by these science communities.<br />
The MGDS has operated a booth on the Exhibition Hall for the past two fall AGU<br />
meetings and will continue this activity in future years. This activity contributes<br />
significantly to advertising the data system and engaging new users.<br />
Annual reports are written for NSF for each project. We favor moving to a single<br />
annual report to NSF for all projects of the MGDS. This annual report could also serve as<br />
a summary document for the AC and record of annual progress. NSF oversight of the<br />
MGDS could include facility site visits coincident with annual Advisory Committee<br />
meetings.<br />
3.2.8. Metrics of Success<br />
The success of the MGDS is monitored by tracking web site and GeoMapApp use, as<br />
well as growth of the data holdings. The number of site visits and volumes of file<br />
downloads are tracked on a monthly basis using the Webalizer site use statistics package.<br />
For GeoMapApp usage, we track number of new and returning users, and volumes<br />
downloaded. User feedback was directly solicited at our annual booth on the AGU<br />
Exhibition Hall in 2005 with a user logbook. Unsolicited user feedback received via<br />
email queries and comments are logged using Request Tracker. Other methods to solicit<br />
user feedback including email surveys and web site questionnaires will also be<br />
implemented in the future.<br />
Given the nature of the data system as a data catalog as well as a repository of sensor<br />
data, growth of the data system is monitored through a variety of metrics, including<br />
number of data objects within the DMS, number of field programs cataloged, as well as<br />
growth in data file volumes. Each data object cataloged within the MGDS is tied to the<br />
relevant project that funded the data ingestion so that growth in data holdings can be<br />
tracked via MGDS project.<br />
18
An additional metric of success will be the willingness of the user community to<br />
submit data to the MGDS. Since development of our digital metadata forms we have<br />
observed a growth in number of users who submit metadata forms without direct contact<br />
and solicitation from us. As the data system grows more useful to the user community we<br />
hope data submission may become more automatic and timely with less direct solicitation<br />
required.<br />
Citations in publications are the standard measure of scientific impact, and will be<br />
tracked as an additional measure of success. However, the nature of many of the data<br />
services provided by the MGDS, are unlikely to lead to citations, and publications are<br />
likely to reflect only a small percentage of DMS impact. For example, use of the system<br />
to find the whereabouts of data collected within a region of interest as part of a pre-cruise<br />
or proposal-planning effort would not lead to a citation. Downloaded bathymetry data<br />
used as a base map for a study are rarely cited. Given the typical publication cycle of a<br />
manuscript, citations also become a more useful measure of impact of a data system<br />
through time.<br />
3.2.9. Risk and Contingency <strong>Management</strong><br />
This section outlines our existing strategy for dealing with some possible risks<br />
including physical damage, malicious hacking, personnel changes, etc.<br />
1. Risk of physical destruction of facility by fire, flood or other catastrophic event.<br />
Mitigation: Risk of physical destruction is mitigated by maintenance of off-site data<br />
backups and off-site redundant server as described in the MGDS Infrastructure that is<br />
described in Appendix 5.1. In the event of physical destruction of our primary servers (or<br />
loss of physical access to the building) we would switch service to the off-campus<br />
system, which we are in the process of installing on the Morningside Campus of<br />
Columbia. We would then acquire new hardware and re-establish our servers in a<br />
different (temporary) location on our campus. In the event of a campus-wide destruction,<br />
we would rely on our Morningside location. In either case, we would restore from our<br />
off-campus backups that are stored by Iron Mountain.<br />
2. Network vandalism.<br />
Mitigation: In the event of malicious network vandalism, we assume that such an attack<br />
would hit both our primary servers at Lamont and the Morningside campus. Such an<br />
attack would most likely not damage the hardware so recovery would be relatively quick.<br />
This would take us off-line for a period on the order of 12-24 hours while we understood<br />
the breach, protected ourselves from a future instance and restore from our backups.<br />
3. Loss of key personnel. Due to limited funding the data system operates with minimal<br />
redundancy in expertise.<br />
Mitigation: The data system is operated with overlap in expertise in essential areas of<br />
database programming (O’Hara and Arko), applications (Arko, Melkonian, and new<br />
personnel), and science directive (Carbotte, Ryan). <strong>Project</strong> leadership, reporting and<br />
19
liaison with the science community have been shared within the team ensuring that loss<br />
of one individual can be absorbed. Adequate documentation of system components<br />
including database structure, ingest procedures, and interfaces ensures that DMS<br />
operations can be transitioned to new personnel as required.<br />
4. The short-term and uncertain nature of the funding under which the MGDS has<br />
evolved means that the longevity and security of the projects is a risk. <strong>Project</strong>s are<br />
competed through peer review and are subject to the interests and priorities of the science<br />
community regarding need for open data access and data preservation. In a tight funding<br />
climate, needs to ensure data preservation and data access may be perceived as in conflict<br />
with the need to fund new scientific research. <strong>Project</strong>s require highly expert personnel<br />
with training in both science and IT domains. Such people are difficult to find and keep<br />
in an uncertain and short term funding climate.<br />
Mitigation: The MGDS aims to ensure that the data system provides high quality<br />
service to build the community support needed to secure ongoing funding under renewal<br />
grant periods. In the event of a short-term gap in funding, bridging support can be<br />
requested from the Observatory. The MGDS Advisory Committee (currently configured<br />
to include PetDB and SedDB) voiced their support for future funding of the DMS under a<br />
Cooperative Agreement (see recommendations from October 2005, www.marinegeo.org/advisory).<br />
More secure long-term funding would contribute significantly to<br />
ensuring that high quality personnel can be trained and retained.<br />
In the event that all NSF funding for the MGDS ceases, operation of the data system<br />
would continue as long as servers remain operational. Addition of new data, user support,<br />
and further development of functionality would cease.<br />
5. Enforcement of data policies for data submission fall outside the purview of the<br />
MGDS and data submission requires cooperation of the science community. <strong>Data</strong><br />
submission requirements are defined by NSF and community data policies, and in the<br />
case of the R2K and MARGINS communities, include required submission to the<br />
MGDS. However, enforcement of these submission requirements fall outside the purview<br />
of the DMS and data contribution can only be solicited from the community on a<br />
voluntary basis. This leads to partial and in some cases difficult to obtain data<br />
submission, and a large time investment in solicitation.<br />
Mitigation: Develop a multi-prong approach to improving data submission in the longterm<br />
that includes direct solicitation of scientists, as well as working towards new<br />
automatic procedures for data documentation at sea and direct data transfer from the ship<br />
operators. Automatic transfer of data from the Palmer by Raytheon Polar Services<br />
personnel has been established through the directive of our OPP Program Manager.<br />
Direct transfer of seismic and multibeam data from the Langseth will be established in<br />
support of the SRDMS and LEGACY through our direct link to LDEO ship operations.<br />
The R2K and MARGINS DMS projects have dedicated science communities that support<br />
the data effort and can provide community pressure. The legacy data components of<br />
MGDS projects are focused on data sets within Lamont holdings as well as other public<br />
centers (NGDC) where data access is known. Mitigation also requires more rigorous NSF<br />
enforcement of data submission requirements as well as engagement of NSF Program<br />
20
Managers for Ship Operations in devising multi-prong data submission procedures that<br />
include both scientists and improved data documentation at sea.<br />
6. <strong>Project</strong> scope is large and project activities associated with data documentation and<br />
validation and all aspects of legacy data recovery are manpower intensive.<br />
Mitigation: Target activities of maximum benefit to science community as determined<br />
through the proposal review process and via an advisory structure that participates in<br />
establishing priorities.<br />
3.2.10. Schedule/Milestones<br />
Figure 3.3 provides a combined schedule for all project tasks for the MGDS for the<br />
duration of current funding (until September 2010). Note that the current R2K and<br />
MARGINS DMS grants expire in summer of 2006 and most project activities for these<br />
grants are completed. A renewal proposal for a combined R2K and MARGINS DMS was<br />
submitted to the February 15 2006 MG&G deadline and is currently pending.<br />
21
4. Appendix<br />
4.1. <strong>System</strong> Architecture and Infrastructure for the <strong>Marine</strong> <strong>Geoscience</strong><br />
<strong>Data</strong> <strong>System</strong> (MGDS)<br />
The diagram below illustrates the core architecture on which MGDS systems are<br />
developed, tested, and operated. It consists of backend Platforms (Web and <strong>Data</strong>base),<br />
connecting Services, and end-user Applications. This infrastructure serves both<br />
production and development.<br />
Hardware<br />
The production servers are Intel/Linux (Fedora Core), which yield excellent priceperformance<br />
and benefits from the open-source developer community. The development<br />
23
servers are a mix of Linux, Solaris, Mac OS X, and Windows systems, which allow us to<br />
test applications for cross-platform compatibility. The complete list is shown below:<br />
Purpose Operating <strong>System</strong> Manufacturer/Model<br />
Production #1 Linux Dell Precision 8300<br />
Production #2 Linux Dell Precision 8300<br />
Production #3 Linux Dell Precision 8300<br />
Development Linux Dell Precision 450N<br />
Development Solaris Sun Blade 100<br />
Development Mac OS X Apple iMac G5<br />
Development Windows Dell OptiPlex GX280<br />
A master copy of all MGDS data objects is stored in the LDEO Mass Store system, an<br />
enterprise-class “deep archive” consisting of a quad-dual-core Sun Fire V890 server and<br />
ADIC Scalar i2000 tape library. The system holds 2 TB of online disk cache and ~80 TB<br />
of near-line LTO tape storage, replicated to secure, climate-controlled sites in two<br />
different buildings on campus via redundant gigabit switches to the campus backbone.<br />
A wide range of magnetic/optical drives are available to transfer data packages, directly<br />
attached to the MGDS cluster or adjoining subnet, including DLT (SuperDLT 320 and<br />
DLT-4000/7000), IBM 3480/3490, 8mm tape (Exabyte 82xx, 85xx, and 87xx), 4mm tape<br />
(DDS-2, 3, and 4), removable disk (DynaMO 2300, Zip100 and 250, Jaz 1- and 2-GB)<br />
and optical disc (8x DVD+/-RW and 40x CD-RW).<br />
Software<br />
A wide range of open-source and commercial software supports MGDS development<br />
and operation. The complete list is shown below:<br />
Package/Version<br />
Apache HTTP 2.2<br />
Apache Tomcat 5.5<br />
PHP 5.1<br />
PostgreSQL 8.1<br />
PostGIS 1.1<br />
MapServer 4.8<br />
Eclipse 3.1<br />
Subversion 1.4<br />
Plone 2.1<br />
Request Tracker 3.6<br />
NetWorker 7.3<br />
Purpose<br />
Web server<br />
Java servlet engine<br />
Web script language<br />
Relational database<br />
<strong>Data</strong>base geospatial extensions<br />
Web GIS<br />
Integrated development environment<br />
Version control<br />
Content management system<br />
User feedback<br />
Backup/recovery<br />
MGDS development also utilitizes CU/LDEO site licenses for commercial scientific<br />
software packages including ArcGIS (ESRI), ENVI (RSI), and MATLAB (MathWorks).<br />
24
Networking and Security<br />
The MGDS server cluster is connected to the Lamont campus gigabit fiber backbone,<br />
which is in turn connected to the commodity Internet via redundant land (to the<br />
Morningside campus in New York City and Internet 2) and microwave (to an alternate<br />
ISP routed through Boston) T3 links.<br />
The entire Lamont campus network is protected by a Juniper NetScreen-208<br />
(www.juniper.net) enterprise firewall. External access to MGDS production machines,<br />
such as user logins and incoming file transfers, is segregated or disabled.<br />
Automatic backups are performed nightly on all DMS systems (both production and<br />
development). EMC NetWorker (www.legato.com) is used for general filesystem<br />
backups, and built-in (application-specific) export functions are used for the PostgreSQL<br />
relational database system and Plone content management system. All hardware systems<br />
are covered by extended vendor warranties with on-site service. All software systems are<br />
reviewed and updated as needed on a quarterly basis.<br />
Contingency<br />
An agreement has been reached with the Columbia University Information Technology<br />
(CUIT) department to install a remotely managed backup server at the CUIT Computer<br />
Center on the Morningside campus in New York City. A contract has been signed with<br />
Iron Mountain, Inc., to provide secure long-term digital media storage at their <strong>Data</strong> Vault<br />
facility in New Paltz, NY. Transfers to the Vault occur monthly by courier.<br />
25