Data Curation & Information Management

This long-term project illustrates how modern data archives cannot be staticly stored in a single medium for the lifetime of the project. Original photographic film slides were later digitized. Video transects became series of image frames. Legacy data has been characterized to align with current data by defining a heirarchy of cover categories. The data model was designed to expand as legacy image data are re-analyzed at finer taxonomic resolution.

The National Science Foundation recognises the need for foresight in data management. This project uses the principles and technolgy developed within the NSF Long Term Ecological Research (LTER) community, specifically collaborating with the Moorea Coral Reef LTER project which shares data of similar types.

Types of data

Coral cover raw data are digital images, whether directly photographed, captured from video, or scanned from film slides. Derived data are coral photo quadrat cover analysis results in tabular form. Juvenile coral density raw data are tabular in situsurvey counts.

Data standards

Data tables in the catalog are described in Ecological Metadata Language Oxygen XML Editor(EML) to integration level using the program oXygen. Keywords are selected from the NBII Thesaurus where available. Units are selected from the LTER Unit Registry. Data packages, the data tables and their metadata combined, are registered with the Knowledge Network for Biocomplexity (KNB) with system-wide unique package identifiers. The most recent revision appears by default and previous revisions are archived and accessible by specifying the revision number.

Data policy

Data users are required to agree to the Data Use Agreement which ensures proper attribution and allowed uses.

Data handling, Quality assurance

Image data are stored on the filesystem and backed up regularly disk-to-disk as well as archived in offline media storage. Stored originally in Excel files, data were reformatted to upload into a relational database in normalized tables. Data tables in the data catalog are cached on the fileystem from stored queries in the database. The process of modeling the structure inherent in data and implementing that data model in a relational database ensures a high level of consistency, as any data inconsistent with the expected structure will not load. Taxonomic codes and sample sites are linked to a controlled vocabulary in the database.

Coral database

The Entity-Relationship Diagram below illustrates the data model. Sample site locations and cover type classification, including phylogenic taxonomy, are stored in self-referential tables enforcing a tree structure in levels of aggregation. Input coral cover analysis data are loaded in a denormalized (“wide-format”) table, then queried into a normalized (“long-format”) table (a process commonly referred to as a “reverse pivot”).