This is a summary of the Marine Life Data Network (MLDN) data flow for tabular data and metadata.
Edit me

Marine Life Data Network Data Flow

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'primaryColor': '#007396',
      'primaryTextColor': '#fff',
      'primaryBorderColor': '#003087',
      'lineColor': '#003087',
      'secondaryColor': '#007396',
      'tertiaryColor': '#CCD1D1'
    },
   'flowchart': { 'curve': 'basis' }
  }
}%%

flowchart LR

A["Marine Life data 
& 
metadata"] 

B[("Raw Data 
Access Point
(eg. RA ERDDAP)")]

C("Darwin Core
Alignment")

D[("NOAA's National Centers
for Environmental Information (NCEI)")]

E[("Ocean Biodiversity
Information System 
node")]

G[("Ocean Biodiversity
Information System (OBIS)")]

H[("Global Biodiversity Information Facility (GBIF)")]

I[("IOOS Data Catalog
(data.ioos.us)
(metadata only)")]

A -.-> B
B ----> I
B -.-> C
B -..-> D



C -.-> E
E --> G
E --> H
E -- OBIS-USA --> D

For data collected/managed by the US IOOS community, the project should ensure data and information are readily available to resource managers, scientists, educators, and the public in an easily digestible way. To that end, making these data available via ERDDAP services meets these goals and facilitates automated integration into the IOOS Data Catalog and subsequent Federal Catalogs (eg. Data.gov and NOAA OneStop). Using the services that ERDDAP provides, a data manager can develop a reproducible workflow for aligning the data which observes an animal at a place and time to the Darwin Core standard to be shared with an OBIS node and subsequently shared to OBIS, GBIF, and (for OBIS-USA) archived at NOAA’s NCEI through automated processes (solid arrows). Finally, submission of the raw data to NCEI ensures that no observations are lost and there is long-term stewardship of the source data, as well as meeting our PARR requirements. The sections below provide more context as well as tips and tricks for each of the elements in the diagram above.

RA ERDDAP

For IOOS DMAC, ERDDAP is used as a mechanism for quickly and efficiently sharing observations with the broader community. While ERDDAP can provide data access following the FAIR principles, further alignment to Darwin Core and submission to OBIS is necessary to make these observations more useful to a broader audience. Essentially, serving data through an ERDDAP is one part of a larger process and should be treated as such.

Key principles for data

When preparing a dataset to be served via ERDDAP it is recommended to follow a few key principles for data management.

  • For organizing your data files, follow the Tidy data recommendations:
    • Variables as columns
    • Observations as rows
    • Don’t embed data in the column headers.
  • Follow ISO-8601 for dates
    • YYYY-MM-DDTHH:mm:ssZ (eg. 2021-08-19T12:38:22Z)
    • Include the time zone.
  • Latitude and Longitude in decimal degrees (WGS84 preferred)
  • Identify units of measure
  • Check species names against WoRMS.

Additional Resources

ERDDAP Requirements

Below is a list of the absolute bare minimum pieces of metadata required by ERDDAP. Some dataset types might have other requirements specific to the data file formats.

ERDDAP Tips and Tricks

  • Date/time - It is not a requirement to have a variable assigned to time (ie. <destinationName>time</destinationName>) in ERDDAP. If a variable’s destinationName is set to time, ERDDAP will use the units attribute to attempt to interpret the datum. See the documentation on How ERDDAP Deals with Time for more information.
    • Trick - If you don’t want the variable interpreted as a time, set the <destinationName> to something other than time. For example, in your source file the coumn time has a value of 2020-01-01, but you don’t want that interpreted by ERDDAP. Then, set the destinationName to time2 and ERDDAP will treat the field as a string.
    • Caution - If you do not have an assigned time variable in a dataset, some of the access formats might not be available (e.g. .esricsv, .odvtxt). But, data will still be available in non geospatial formats (e.g. csv, .nc)
  • Latitude/Longitude - Similar to date/time above, it is not a requirement to have latitude/longitude variables. However, the dataset will have a limited amount of access formats.
  • Trick - ERDDAP now has the capability to create derived variables from existing fields (since v2.10). See the documentation on Script SourceNames / Derrived Variables.
  • Trick - ERDDAP can handle media files, such as image, audio and video files. See MediaFiles for more information.
    • Bonus Trick - EDDTableFromFileNames allows you to create a dataset from information about files in the file system. While it doesn’t serve data from within the files it does provide a mechanism for sharing data in other formats (eg. zip packages, Word docs, Excel spreadsheets, etc.). The resultant dataset in ERDDAP is composed of the following columns: url, name, lastModified, and size.

Darwin Core alignment

When aligning a dataset to Darwin Core it is recommended that a data manager starts with serving the data via ERDDAP or some comparable online system which has an API (or a way to programmatically grab the data) and preferrably follows some set of OGC standards. When working through the Darwin Core alignment using a scripting language (eg. R or Python) which uses the data served via ERDDAP (or comparable service) is highly recommended. A scripting language provides provenance, transparency, and reproducibility for the translation. This helps reduce the amount of errors and back-and-forth between data managers and OBIS. It is highly recommended that, if using a scripting language, the scripts are shared via distributed version control systems like GitHub.

Throughout the Darwin Core alignment process, it is essential to collect as much metadata as possible about the data. We have build a How-to guide which describes the various metadata elements a data producer should be striving to collect.

Recommendations TL;DR;

Additional Resources

Sending to OBIS-USA

Below are the various options for sending your data to OBIS-USA. OBIS-USA is part of an international data sharing network (Ocean Biodiversity Information System, OBIS) coordinated by the Intergovernmental Oceanographic Commission, of UNESCO (United Nations Educational, Science and Cultural Organization International Oceanographic Data and Information Exchange. OBIS-USA is the US node to OBIS and uses the Integrated Publishing Toolkit (IPT) as a platform to publish data which can then be registered and automatically harvested by OBIS and GBIF. For more details on the publishing process, please review the Marine Biological Data Mobilization Workshop lesson on Metadata and Publishing

Sending to NCEI

When planning on submitting data to NCEI, the data provider should coordinate submissions through the IOOS Office to identify which submission system should be used. This will ensure that the dataset is appropriately identified, tracked, and stewarded through the submission process.

Ideally, the raw data should be archived at NCEI. Typically, this will be the dataset served through the IOOS RA ERDDAP and following the key principles laid out above. Archiving the dataset in its more raw form (vs the Darwin Core aligned form) ensures that no information is lost. This also ensures that data providers can always go back to the source data if issues arise.

For more information about archiving data at NCEI, see https://www.ncei.noaa.gov/archive.

Briefly, the submission package sent to NCEI should indicate that the observations have some affiliation with IOOS. This will ensure there is appropriate tracking for the submission.

Below is a short summary of the two submission systems at NCEI and their intended uses:

  • ATRAC - Use the Advanced Tracking and Resource Tool for Archive Collections (ATRAC) to submit repeating or multiple delivery data, or data that exceeds 20 GB.
  • S2N - Use Send2NCEI to submit non-repeating or single delivery data less than 20 GB.
Tags: