MBON Data Flow
Idealized data flow for IOOS biological observations.
For data collected/managed by an IOOS MBON project, the project should ensure data and information are readily available to resource managers, scientists, educators, and the public in an easily digestible way. To that end, coordination with an IOOS Regional Association to make the data available via ERDDAP services meets these goals. Using the services that ERDDAP provides, a data manager can develop a reproducible workflow for aligning the data to the Darwin Core standard. Finally, submission to NCEI ensures that no observations are lost and there is long-term stewardship of these data, as well as meeting our PARR requirements. The sections below provide more context as well as tips and tricks for each of the elements in the diagram above.
For the IOOS MBON projects ERDDAP is used as a mechanism for quickly and efficiently sharing biological observations with the broader community. While ERDDAP can provide data access following the FAIR principles, further alignment to Darwin Core and submission to OBIS is necessary to make these observations more useful to a broader audience. Essentially, serving data through an RA ERDDAP is one part of a larger process and should be treated as such.
Key principles for data
When preparing a dataset to be served via ERDDAP it is recommended to follow a few key principles for data management.
- For organizing your data files, follow the Tidy data recommendations:
- Variables as columns
- Observations as rows
- Don’t embed data in the column headers.
- Follow ISO-8601 for dates
- Include the time zone.
- Latitude and Longitude in decimal degrees (WGS84 preferred)
- Identify units of measure
- Check species names against WoRMS.
- See the sections on Data and File Formatting and Metadata and Documentation for more recommendations and best practices.
- Configuring datasets.xml - The ERDDAP manual for configuring a dataset.
- ERDDAP Google Group - A great place to search for questions and ask your questions.
Below is a list of the absolute bare minimum pieces of metadata required by ERDDAP. Some dataset types might have other requirements specific to the data file formats.
- Global attributes
- Variable attributes
ERDDAP Tips and Tricks
- Date/time - It is not a requirement to have a variable assigned to
<destinationName>time</destinationName>) in ERDDAP. If a variable’s destinationName is set to
time, ERDDAP will use the
unitsattribute to attempt to interpret the datum. See the documentation on How ERDDAP Deals with Time for more information.
- Trick - If you don’t want the variable interpreted as a time, set the
<destinationName>to something other than
time. For example, in your source file the coumn
timehas a value of
2020-01-01, but you don’t want that interpreted by ERDDAP. Then, set the
time2and ERDDAP will treat the field as a string.
- Caution - If you do not have an assigned
timevariable in a dataset, some of the access formats might not be available (eg. .esricsv, .odvtxt).
- Trick - If you don’t want the variable interpreted as a time, set the
- Latitude/Longitude - Similar to date/time above, it is not a requirement to have latitude/longitude variables. However, the dataset will have a limited amount of access formats.
- Trick - ERDDAP now has the capability to create derived variables from existing fields (since v2.10). See the documentation on Script SourceNames / Derrived Variables.
- Trick - ERDDAP can handle media files, such as image, audio and video files. See
MediaFiles for more information.
- Bonus Trick -
EDDTableFromFileNames allows you to
create a dataset from information about files in the file system. While it doesn’t serve data from within the files it
does provide a mechanism for sharing data in other formats (eg. zip packages, Word docs, Excel spreadsheets, etc.). The
resultant dataset in ERDDAP is composed of the following columns:
- Bonus Trick - EDDTableFromFileNames allows you to create a dataset from information about files in the file system. While it doesn’t serve data from within the files it does provide a mechanism for sharing data in other formats (eg. zip packages, Word docs, Excel spreadsheets, etc.). The resultant dataset in ERDDAP is composed of the following columns:
Darwin Core alignment
When aligning a dataset to Darwin Core it is recommended that a data manager starts with serving the data via ERDDAP or some comparable online system which has an API (or a way to programmatically grab the data). When working through the Darwin Core alignment using a scripting language (eg. R or Python) which uses the data served via ERDDAP (or comparable service) is highly recommended. A scripting language provides provenance, transparency, and reproducibility for the translation. This helps reduce the amount of errors and back-and-forth between data managers and OBIS. It is highly recommended that, if using a scripting language, the scripts are shared via distributed version control systems like GitHub.
- Follow the guidance at TDWG’s Darwin Core quire reference guide.
- Use a scripting language.
- Script should point to source data on a hosted web service.
- Scripts should be shared via GitHub.
- Standardizing Marine Biological Data Guide - A guide and examples of aligning datasets to Darwin Core.
- Aligning data to Darwin Core notebook in IOOS CodeLab - A Python notebook for aligning a dataset to Darwin Core.
Sending to OBIS-USA
Below are the various options for sending your data to OBIS-USA.
- Attend the monthly Standardizing Marine Biological Data Working Group meeting and discuss transfer options.
- Contribute your dataset (and code) to the
datasets/directory in the ioos/bio_data_guide repository (here). See the Contribute example applications documentation for more information.
- Email Darwin Core aligned files to Abby Benson.
Sending to NCEI
When planning on submitting data to NCEI, the data provider should coordinate submissions through the IOOS Office to identify which submission system should be used. This will ensure that the dataset is appropriately identified, tracked, and stewarded through the submission process.
Ideally, the raw data should be archived at NCEI. Typically, this will be the dataset served through the IOOS RA ERDDAP and following the key principles laid out above. Archiving the dataset in its more raw form (vs the Darwin Core aligned form) ensures that no information is lost. This also ensures that data providers can always go back to the source data if issues arise.
For more information about archiving data at NCEI, see https://www.ncei.noaa.gov/archive.
Briefly, the submission package sent to NCEI should indicate that the observations are from an IOOS MBON project (or has some affiliation with IOOS). Below is a short summary of the two submission systems at NCEI and their intended uses.
- ATRAC - Use the Advanced Tracking and Resource Tool for Archive Collections (ATRAC) to submit repeating or multiple delivery data, or data that exceeds 20 GB.
- S2N - Use Send2NCEI to submit non-repeating or single delivery data less than 20 GB.
Note: NCEI and OBIS-USA are developing a pathway to archive the datasets from the OBIS-USA IPT. This will archive the Darwin Core Archive version of the dataset. While this is an extremely valuable product, the raw data should be archived at NCEI as well.
Loading into MBON Portal
As depicted in the data flow diagram, the MBON data portal can retrieve data from a variety of sources. The two preferred sources for data include OBIS (or GBIF) and/or ERDDAP (hosted by a Regional Association), however other web services could be acceptable to bring data in. In some cases, the MBON data portal might bring in occurrence data through OBIS as well as additional observations that are served through ERDDAP. Below are the recommended steps to load data into the MBON Portal:
- The dataset should be registered in the MBON dataset registration form.
- This will ensure that we are aware of the dataset and have identified next actions to take.
- Identify that you would like the dataset visualized in the MBON portal and include a description of what that visualization might be.
- Share the dataset through OBIS/GBIF and/or through ERDDAP.
- Iterate with the MBON portal development team to ensure the visualizations are appropriate for the observations.
To note There are additional pathways to share data with the MBON portal using the Research Workspace. For more information on that pathway see Contribute Data in the MBON portal documentation.