7  Converting ATN netCDF file to Darwin Core

An R Markdown document converted from “atn_satellite_telemetry_netCDF2DwC.ipynb”

Created: 2022-03-23 Updated: 2023-11-16

Credit: Stephen Formel, Mathew Biddle

This notebook walks through downloading an example netCDF file from the an Archive package at NCEI and translating it to a Darwin Core Archive compliant package for easy loading and publishing via the Integrated Publishing Toolkit (IPT). The example file follows a specific specification for ATN satellite trajectory observations as documented here. More information about the ATN netCDF specification can be found in the repository https://github.com/ioos/ioos-atn-data.

This example uses the tidync package to work with netCDF data.

Data used in this notebook are available from NCEI at the following link https://www.ncei.noaa.gov/archive/accession/0282699.

#Load libraries

library(tidync)
library(obistools)
library(ncdf4)
library(tidyverse) #includes stringr
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(maps)

Attaching package: 'maps'

The following object is masked from 'package:purrr':

    map
library(mapdata)

7.1 Downloading and preprocessing the source data

See https://www.ncei.noaa.gov/archive/accession/0282699

# paths ----
url_nc = 'https://www.nodc.noaa.gov/archive/arc0217/0282699/1.1/data/0-data/atn_45866_great-white-shark_trajectory_20090923-20091123.nc'
dir_data <- here::here("datasets/atn_satellite_telemetry/data")
file_nc <- file.path(dir_data, "src", basename(url_nc))
stopifnot(dir.exists(dir_data))

if (!file.exists(file_nc))
  download.file(url_nc, file_nc, mode = "wb")

7.1.1 Open the netCDF file

Once the file is opened, we print out the details of what the netCDF file contains.

atn <- nc_open(file_nc)
atn
File /Users/runner/work/bio_data_guide/bio_data_guide/datasets/atn_satellite_telemetry/data/src/atn_45866_great-white-shark_trajectory_20090923-20091123.nc (NC_FORMAT_NETCDF4):

     36 variables (excluding dimension variables):
        string deploy_id[]   (Contiguous storage)  
            long_name: id for this deployment. This is typically the tag ptt
            comment: Friendly name given to the tag by the user. If no specific friendly name is given, this is the PTT id.
            coordinates: time z lon lat
            instrument: instrument_location
            platform: animal
            coverage_content_type: referenceInformation
            _FillValue: -9999
        double time[obs]   (Contiguous storage)  
            units: seconds since 1990-01-01 00:00:00Z
            standard_name: time
            axis: T
            _CoordinateAxisType: Time
            calendar: standard
            long_name: Time of the measurement, in seconds since 1990-01-01
            actual_min: 2009-09-23T00:00:00Z
            actual_max: 2009-11-23T05:12:00Z
            ancillary_variables: qartod_time_flag qartod_rollup_flag qartod_speed_flag
            instrument: instrument_location
            platform: animal
            coverage_content_type: coordinate
            _FillValue: NaN
        int z[obs]   (Contiguous storage)  
            _FillValue: -9999
            axis: Z
            long_name: depth of measurement
            positive: down
            standard_name: depth
            units: m
            actual_min: 0
            actual_max: 0
            instrument: 
            platform: animal
            comment: This variable is synthetically generated to represent the depth of observations
            coverage_content_type: coordinate
        double lat[obs]   (Contiguous storage)  
            axis: Y
            _CoordinateAxisType: Lat
            long_name: Latitude portion of location in decimal degrees North
            standard_name: latitude
            units: degrees_north
            valid_max: 90
            valid_min: -90
            actual_min: 23.59
            actual_max: 34.045
            ancillary_variables: qartod_location_flag qartod_rollup_flag qartod_speed_flag error_radius semi_major_axis semi_minor_axis ellipse_orientation offset offset_orientation
            instrument: instrument_location
            platform: animal
            coverage_content_type: coordinate
            _FillValue: NaN
        double lon[obs]   (Contiguous storage)  
            axis: X
            _CoordinateAxisType: Lon
            long_name: Longitude portion of location in decimal degrees East
            standard_name: longitude
            units: degrees_east
            valid_max: 180
            valid_min: -180
            actual_min: -166.18
            actual_max: -118.504
            ancillary_variables: qartod_location_flag qartod_rollup_flag qartod_speed_flag error_radius semi_major_axis semi_minor_axis ellipse_orientation offset offset_orientation
            instrument: instrument_location
            platform: animal
            coverage_content_type: coordinate
            _FillValue: NaN
        int ptt[obs]   (Chunking: [29])  (Compression: shuffle,level 1)
            _FillValue: -9999
            coordinates: time z lon lat
            long_name: Platform Transmitter Terminal (PTT) id used for Argos transmissions
            comment: PTT id for this deployment. PTT ids may be used on multiple deployments, but not concurrently. When combined with deployment dates, PTTs can uniquely identify a deployment.
            coverage_content_type: referenceInformation
            instrument: instrument_location
            platform: animal
        string instrument[obs]   (Contiguous storage)  
            coordinates: time z lon lat
            comment: Wildlife Computers instrument family. Variable may report manufacturer default values (e.g., Mk10) and may not match correctly defined instrument_location or instrument_tag variables and attributes.
            long_name: Instrument family
            instrument: instrument_location
            platform: animal
            coverage_content_type: referenceInformation
        string type[obs]   (Contiguous storage)  
            coordinates: time z lon lat
            comment: Type of location: Argos, FastGPS or User
            long_name: Type of location information - Argos, GPS satellite or user provided location
            instrument: instrument_location
            platform: animal
            coverage_content_type: referenceInformation
        string location_class[obs]   (Contiguous storage)  
            coordinates: time z lon lat
            standard_name: quality_flag
            comment: Quality codes from the ARGOS satellite (in meters): G,3,2,1,0,A,B,Z. See http://www.argos-system.org/manual/3-location/34_location_classes.htm
            long_name: Location Quality Code from ARGOS satellite system
            code_values: G,3,2,1,0,A,B,Z
            code_meanings: estimated error less than 100m and 1+ messages received per satellite pass, estimated error less than 250m and 4+ messages received per satellite pass, estimated error between 250m and 500m and 4+ messages per satellite pass, estimated error between 500m and 1500m and 4+ messages per satellite pass, estimated error greater than 1500m and 4+ messages received per satellite pass, no least squares estimated error or unbounded kalman filter estimated error and 3 messages received per satellite pass, no least squares estimated error or unbounded kalman filter estimated error and 1 or 2 messages received per satellite pass, invalid location (available for Service Plus or Auxilliary Location Processing)
            instrument: instrument_location
            platform: animal
            ancillary_variables: lat lon
            coverage_content_type: qualityInformation
        int error_radius[obs]   (Chunking: [29])  (Compression: shuffle,level 1)
            _FillValue: -9999
            coordinates: time z lon lat
            long_name: Error radius
            units: m
            comment: If the position is best represented as a circle, this field gives the radius of that circle in meters.
            instrument: instrument_location
            platform: animal
            ancillary_variables: lat lon offset offset_orientation
            coverage_content_type: qualityInformation
        int semi_major_axis[obs]   (Chunking: [29])  (Compression: shuffle,level 1)
            _FillValue: -9999
            coordinates: time z lon lat
            long_name: Error - ellipse semi-major axis
            units: m
            comment: If the estimated position error is best expressed as an ellipse, this field gives the length in meters of the semi-major elliptical axis (one half of the major axis).
            instrument: instrument_location
            platform: animal
            ancillary_variables: lat lon ellipse_orientation offset offset_orientation
            coverage_content_type: qualityInformation
        int semi_minor_axis[obs]   (Chunking: [29])  (Compression: shuffle,level 1)
            _FillValue: -9999
            coordinates: time z lon lat
            long_name: Error - ellipse semi-minor axis
            units: m
            comment: If the estimated position error is best expressed as an ellipse, this field gives the length in meters of the semi-minor elliptical axis (one half of the minor axis).
            instrument: instrument_location
            platform: animal
            ancillary_variables: lat lon ellipse_orientation offset offset_orientation
            coverage_content_type: qualityInformation
        int ellipse_orientation[obs]   (Chunking: [29])  (Compression: shuffle,level 1)
            _FillValue: -9999
            coordinates: time z lon lat
            long_name: Error - ellipse orientation in degrees clockwise from true north
            units: degrees
            comment: The angle in degrees of the ellipse from true north, proceeding clockwise (0 to 360). A blank field represents 0 degrees.
            instrument: instrument_location
            platform: animal
            ancillary_variables: lat lon semi_major_axis semi_minor_axis offset offset_orientation
            coverage_content_type: qualityInformation
        int offset[obs]   (Chunking: [29])  (Compression: shuffle,level 1)
            _FillValue: -9999
            coordinates: time z lon lat
            long_name: Error - offset in meters to center of error ellipse or circle
            units: m
            comment: This field is non-zero if the circle or ellipse are not centered on the (Latitude, Longitude) values on this row. "Offset" gives the distance in meters from (Latitude, Longitude) to the center of the ellipse.
            instrument: instrument_location
            platform: animal
            ancillary_variables: lat lon error_radius semi_major_axis semi_minor_axis offset_orientation
            coverage_content_type: qualityInformation
        int offset_orientation[obs]   (Chunking: [29])  (Compression: shuffle,level 1)
            _FillValue: -9999
            coordinates: time z lon lat
            long_name: Error - offset orientation angle to ellipse center
            units: degrees
            comment: If the "Offset" field is non-zero, this field is the angle in degrees from (Latitude, Longitude) to the center of the ellipse. Zero degrees is true north; a blank field represents 0 degrees.
            instrument: instrument_location
            platform: animal
            ancillary_variables: lat lon error_radius semi_major_axis semi_minor_axis offset
            coverage_content_type: qualityInformation
        double gpe_msd[obs]   (Chunking: [29])  (Compression: shuffle,level 1)
            coordinates: time z lon lat
            comment: Historical. No longer applicable.
            long_name: 
            units: 
            instrument: instrument_location
            platform: animal
            coverage_content_type: auxillaryInformation
            _FillValue: NaN
        double gpe_u[obs]   (Chunking: [29])  (Compression: shuffle,level 1)
            coordinates: time z lon lat
            comment: Historical. No longer applicable.
            long_name: 
            units: 
            instrument: instrument_location
            platform: animal
            coverage_content_type: auxillaryInformation
            _FillValue: NaN
        int count[obs]   (Chunking: [29])  (Compression: shuffle,level 1)
            _FillValue: -9999
            coordinates: time z lon lat
            comment: Total number of times a particular data item was received, verified, and successfully decoded.
            long_name: Count
            units: count
            instrument: instrument_location
            platform: animal
            coverage_content_type: auxillaryInformation
        unsigned byte qartod_time_flag[obs]   (Chunking: [29])  (Compression: shuffle,level 1)
            _FillValue: 241
            coordinates: time z lon lat
            standard_name: gross_range_test_quality_flag
            long_name: Time QC test - gross range test
            implementation: https://github.com/ioos/ioos_qc/
            flag_meanings: PASS NOT_EVALUATED SUSPECT FAIL MISSING
            flag_values: 1
             flag_values: 2
             flag_values: 3
             flag_values: 4
             flag_values: 9
            references: https://cdn.ioos.noaa.gov/media/2020/03/QARTOD_TS_Manual_Update2_200324_final.pdf
            coverage_content_type: qualityInformation
        unsigned byte qartod_speed_flag[obs]   (Chunking: [29])  (Compression: shuffle,level 1)
            _FillValue: 241
            coordinates: time z lon lat
            standard_name: gross_range_test_quality_flag
            long_name: Speed QC test - gross range test
            references: https://cdn.ioos.noaa.gov/media/2020/03/QARTOD_TS_Manual_Update2_200324_final.pdf
            implementation: https://github.com/ioos/ioos_qc/
            flag_meanings: PASS NOT_EVALUATED SUSPECT FAIL MISSING
            flag_values: 1
             flag_values: 2
             flag_values: 3
             flag_values: 4
             flag_values: 9
            coverage_content_type: qualityInformation
        unsigned byte qartod_location_flag[obs]   (Chunking: [29])  (Compression: shuffle,level 1)
            _FillValue: 241
            coordinates: time z lon lat
            standard_name: location_test_quality_flag
            long_name: Location QC test - Location test
            implementation: https://github.com/ioos/ioos_qc/
            flag_meanings: PASS NOT_EVALUATED SUSPECT FAIL MISSING
            flag_values: 1
             flag_values: 2
             flag_values: 3
             flag_values: 4
             flag_values: 9
            references: https://cdn.ioos.noaa.gov/media/2020/03/QARTOD_TS_Manual_Update2_200324_final.pdf
            coverage_content_type: qualityInformation
        unsigned byte qartod_rollup_flag[obs]   (Chunking: [29])  (Compression: shuffle,level 1)
            _FillValue: 241
            coordinates: time z lon lat
            standard_name: aggregate_quality_flag
            long_name: Aggregate QC value
            implementation: https://github.com/ioos/ioos_qc/
            flag_meanings: PASS NOT_EVALUATED SUSPECT FAIL MISSING
            flag_values: 1
             flag_values: 2
             flag_values: 3
             flag_values: 4
             flag_values: 9
            references: https://cdn.ioos.noaa.gov/media/2020/03/QARTOD_TS_Manual_Update2_200324_final.pdf
            coverage_content_type: qualityInformation
        int crs[]   (Contiguous storage)  
            epsg_code: EPSG:4326
            grid_mapping_name: latitude_longitude
            inverse_flattening: 298.257223563
            long_name: Coordinate Reference System - http://www.opengis.net/def/crs/EPSG/0/4326
            semi_major_axis: 6378137
            coverage_content_type: referenceInformation
        string trajectory[]   (Contiguous storage)  
            cf_role: trajectory_id
            long_name: trajectory identifier
        int animal_age[]   (Contiguous storage)  
            _FillValue: -9999
            units: 
            long_name: age of the animal as measured or estimated at deployment
            coverage_content_type: referenceInformation
            animal_age: Not provided
        string animal_life_stage[]   (Contiguous storage)  
            animal_life_stage: juvenile
            long_name: Lifestage of the animal at time of deployment 
            coverage_content_type: referenceInformation
        string animal_sex[]   (Contiguous storage)  
            animal_sex: male
            long_name: sex of the animal at time of tag deployment
            coverage_content_type: referenceInformation
        float animal_weight[]   (Contiguous storage)  
            _FillValue: NaN
            units: kg
            long_name: mass of the animal as measured or estimated at deployment
            animal_weight: Not provided
            coverage_content_type: referenceInformation
        float animal_length[]   (Contiguous storage)  
            _FillValue: NaN
            animal_length_type: total length
            units: cm
            animal_length: 213.0 (cm) total length
            long_name: length of the animal as measured or estimated at deployment
            coverage_content_type: referenceInformation
        float animal_length_2[]   (Contiguous storage)  
            _FillValue: NaN
            animal_length_2_type: Not provided
            units: 
            animal_length_2: Not provided
            long_name: length of the animal as measured or estimated at deployment
            coverage_content_type: referenceInformation
        string animal[]   (Contiguous storage)  
            rank: Species
            infraorder: 
            scientificname: Carcharodon carcharias
            long_name: tagged animal id
            superdomain: Biota
            order: Lamniformes
            authority: (Linnaeus, 1758)
            kingdom: Animalia
            species: Carcharodon carcharias
            genus: Carcharodon
            megaclass: 
            family: Lamnidae
            taxonRankID: 220
            class: Elasmobranchii
            cf_role: trajectory_id
            coverage_content_type: referenceInformation
            subphylum: Vertebrata
            phylum: Chordata
            AphiaID: 105838
            valid_name: Carcharodon carcharias
            infraphylum: Gnathostomata
            subclass: Neoselachii
            suborder: 
        string instrument_tag[]   (Contiguous storage)  
            manufacturer: Wildlife Computers
            make_model: SPOT5
            serial_number: 07S0230
            long_name: telemetry tag applied to animal
            coverage_content_type: referenceInformation
            calibration_date: Not Provided
        string instrument_location[]   (Contiguous storage)  
            manufacturer: Wildlife Computers
            make_model: SPOT5
            serial_number: 07S0230
            long_name: Wildlife Computers SPOT5
            location_type: argos / modeled
            comment: Location
            coverage_content_type: referenceInformation
            calibration_date: Not Provided
        string taxon_name[]   (Contiguous storage)  
            standard_name: biological_taxon_name
            long_name: most precise taxonomic classification for the tagged animal
            coverage_content_type: referenceInformation
            source: Froese, R. and D. Pauly. Editors. (2023). FishBase. Carcharodon carcharias (Linnaeus, 1758). Accessed through: World Register of Marine Species at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=105838 on 2023-08-16
            url: https://www.marinespecies.org/aphia.php?p=taxdetails&id=105838
        string taxon_lsid[]   (Contiguous storage)  
            standard_name: biological_taxon_lsid
            long_name: Namespaced Taxon Identifier for the tagged animal
            coverage_content_type: referenceInformation
            source: Froese, R. and D. Pauly. Editors. (2023). FishBase. Carcharodon carcharias (Linnaeus, 1758). Accessed through: World Register of Marine Species at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=105838 on 2023-08-16
            url: https://www.marinespecies.org/aphia.php?p=taxdetails&id=105838
        string comment[obs]   (Contiguous storage)  
            long_name: Comment
            comment: Optional text field
            coordinates: time z lon lat
            instrument: instrument_location
            platform: animal
            coverage_content_type: auxillaryInformation

     1 dimensions:
        obs  Size:29 (no dimvar)

    89 global attributes:
        date_created: 2023-08-16T20:00:00Z
        featureType: trajectory
        cdm_data_type: Trajectory
        Conventions: CF-1.10, ACDD-1.3, IOOS-1.2
        argos_program_number: 2414
        creator_email: chris.lowe@csulb.edu
        id: 5f0668a86321be13bc7ef628
        tag_type: SPOT5
        source: Service Argos
        acknowledgement: NOAA IOOS, Axiom Data Science, Navy ONR, NOAA NMFS, Wildlife Computers, Argos, IOOS ATN
        creator_name: Chris G. Lowe
        creator_url: 
        geospatial_lat_units: degrees_north
        geospatial_lon_units: degrees_east
        infoUrl: https://portal.atn.ioos.us/#metadata/6e2ba85c-2f61-4bc5-8c2b-34d6734155ed/project
        institution: California State University Long Beach
        keywords: EARTH SCIENCE > AGRICULTURE > ANIMAL SCIENCE > ANIMAL ECOLOGY AND BEHAVIOR, EARTH SCIENCE > BIOSPHERE > ECOLOGICAL DYNAMICS > SPECIES/POPULATION INTERACTIONS > MIGRATORY RATES/ROUTES, EARTH SCIENCE > OCEANS, EARTH SCIENCE > CLIMATE INDICATORS > BIOSPHERIC INDICATORS > SPECIES MIGRATION, EARTH SCIENCE > OCEANS, EARTH SCIENCE > BIOLOGICAL CLASSIFICATION > ANIMALS/VERTEBRATES, EARTH SCIENCE > BIOSPHERE > ECOSYSTEMS > MARINE ECOSYSTEMS, PROVIDERS > GOVERNMENT AGENCIES-U.S. FEDERAL AGENCIES > DOC > NOAA > IOOS, PROVIDERS > COMMERCIAL > Axiom Data Science
        license: These data may be used and redistributed for free, but are not intended for legal use, since they may contain inaccuracies. No person or group associated with these data makes any warranty, expressed or implied, including warranties of merchantability and fitness for a particular purpose, or assumes any legal liability for the accuracy, completeness or usefulness of this information. This disclaimer applies to both individual use of these data and aggregate use with other data. It is strongly recommended that users read and fully comprehend associated metadata prior to use. Please acknowledge the U.S. Animal Telemetry Network (ATN) or the specified citation as the source from which these data were obtained in any publications and/or representations of these data. Communication and collaboration with dataset authors are strongly encouraged.
        metadata_link: 
        naming_authority: com.wildlifecomputers
        platform_category: animal
        platform: fish
        platform_vocabulary: https://vocab.nerc.ac.uk/collection/L06/current/
        processing_level: NetCDF file created from position data obtained from Wildlife Computers API.
        project: Project White Shark: Juvenile Satellite Biotelemetry, 2001-2020
        publisher_email: atndata@ioos.us
        publisher_institution: US Integrated Ocean Observing System Office
        publisher_name: US Integrated Ocean Observing System (IOOS) Animal Telemetry Network (ATN)
        publisher_url: https://atn.ioos.us
        publisher_country: USA
        standard_name_vocabulary: CF-v78
        vendor: Wildlife Computers
        geospatial_lat_min: 23.59
        geospatial_lat_max: 34.045
        geospatial_lon_min: -166.18
        geospatial_lon_max: -118.504
        geospatial_bbox: POLYGON ((-118.504 23.59, -118.504 34.045, -166.18 34.045, -166.18 23.59, -118.504 23.59))
        geospatial_bounds: POLYGON ((-166.18 23.59, -118.581 34.038, -118.53 34.045, -118.504 33.989, -118.534 33.972, -119.75 33.517, -166.18 23.59))
        geospatial_bounds_crs: EPSG:4326
        time_coverage_start: 2009-09-23T00:00:00Z
        time_coverage_end: 2009-11-23T05:12:00Z
        time_coverage_duration: P61DT5H12M0S
        time_coverage_resolution: P2DT2H39M43S
        date_issued: 2023-08-16T20:00:00Z
        date_modified: 2023-08-16T20:00:00Z
        history: 2023-08-07T20:24:04Z - Created by the IOOS ATN DAC from the Wildlife Computers API
        summary: Wildlife Computers SPOT5 tag (ptt id 45866) deployed on a great white shark (Carcharodon carcharias) by Chris G. Lowe in the North Pacific Ocean from 2009-09-23 to 2009-11-23
        title: Great white shark (Carcharodon carcharias) location data from a satellite telemetry tag (ptt id 45866) deployed in the North Pacific Ocean from 2009-09-23 to 2009-11-23, deployment id 5f0668a86321be13bc7ef628
        uuid: ff554ebf-bf4b-5a82-8a90-9c0ceb799d96
        platform_name: Carcharodon carcharias
        platform_id: 105838
        vendor_id: 5f0668a86321be13bc7ef628
        sea_name: North Pacific Ocean
        arbitrary_keywords: ATN, Animal Telemetry Network, IOOS, Integrated Ocean Observing System, trajectory, satellite telemetry tag
        contributor_role_vocabulary: https://vocab.nerc.ac.uk/collection/G04/current/
        creator_role_vocabulary: https://vocab.nerc.ac.uk/collection/G04/current/
        creator_sector_vocabulary: https://mmisw.org/ont/ioos/sector
        creator_type: person
        date_metadata_modified: 20230816
        instrument: Satellite telemetry tag
        instrument_vocabulary: 
        keywords_vocabulary: GCMD Science Keywords v15.1
        ncei_template_version: NCEI_NetCDF_Trajectory_Template_v2.0
        product_version: 
        program: IOOS Animal Telemetry Network
        publisher_type: institution
        references: 
        animal_common_name: great white shark
        animal_id: 09_13
        animal_scientific_name: Carcharodon carcharias
        deployment_id: 5f0668a86321be13bc7ef628
        deployment_start_datetime: 2009-09-23T00:00:00Z
        deployment_end_datetime: 2009-11-23T00:00:00Z
        wmo_platform_code: 
        comment: 09_13-45866
        ptt_id: 45866
        deployment_start_lat: 34.03
        deployment_start_lon: -118.56
        contributor_name: Thomas Farrugia
        contributor_email: tjfarrugia@alaska.edu
        contributor_role: collaborator
        contributor_institution: California State University Long Beach
        contributor_url: 
        creator_role: principalInvestigator
        creator_sector: academic
        creator_country: USA
        creator_institution: California State University Long Beach
        creator_institution_url: https://www.csulb.edu/shark-lab
        citation: Lowe, Chris G.; Farrugia, Thomas. (2023) great white shark (Carcharodon carcharias) location data from a satellite telemetry tag (ptt id 45866) deployed in the North Pacific Ocean from 2009-09-23 to 2009-11-23, deployment id 5f0668a86321be13bc7ef628. [Dataset]. US Integrated Ocean Observing System Office.

7.1.2 Collect all the metadata from the netCDF file.

This gathers not only the global attributes, but the variable level attributes as well. As you can see in the variable column the term NC_GLOBAL refers to global attributes.

metadata <- ncmeta::nc_atts(file_nc)
metadata
# A tibble: 381 × 4
      id name                  variable  value       
   <int> <chr>                 <chr>     <named list>
 1     0 long_name             deploy_id <chr [1]>   
 2     1 comment               deploy_id <chr [1]>   
 3     2 coordinates           deploy_id <chr [1]>   
 4     3 instrument            deploy_id <chr [1]>   
 5     4 platform              deploy_id <chr [1]>   
 6     5 coverage_content_type deploy_id <chr [1]>   
 7     6 _FillValue            deploy_id <dbl [1]>   
 8     0 units                 time      <chr [1]>   
 9     1 standard_name         time      <chr [1]>   
10     2 axis                  time      <chr [1]>   
# ℹ 371 more rows

7.1.3 Store the data as a tibble

Collect the data dimensioned by time from the netCDF file as a tibble. Then, print the first ten rows.

atn <- tidync(file_nc)

atn_tbl <- atn %>% hyper_tibble(force=TRUE)

head(atn_tbl, n=4)
# A tibble: 4 × 23
       time     z   lat   lon   ptt instrument type  location_class error_radius
      <dbl> <int> <dbl> <dbl> <int> <chr>      <chr> <chr>                 <int>
1 622512000     0  34.0 -119. 45866 SPOT       User  nan                      NA
2 622708920     0  23.6 -166. 45866 SPOT       Argos A                        NA
3 622724940     0  34.0 -119. 45866 SPOT       Argos 1                        NA
4 622725060     0  34.0 -119. 45866 SPOT       Argos 0                        NA
# ℹ 14 more variables: semi_major_axis <int>, semi_minor_axis <int>,
#   ellipse_orientation <int>, offset <int>, offset_orientation <int>,
#   gpe_msd <dbl>, gpe_u <dbl>, count <int>, qartod_time_flag <int>,
#   qartod_speed_flag <int>, qartod_location_flag <int>,
#   qartod_rollup_flag <int>, comment <chr>, obs <chr>

7.1.4 Dealing with time

Notice the data in the time column aren’t formatted as times. We need to read the metadata associated with the time variable to understand what the units are. Below, we print a tibble of all the attributes from the time variable.

Notice the units attribute and it’s value of seconds since 1990-01-01 00:00:00Z. We need to use that information to convert the time variable to something useful that ggplot can handle.

time_attrs <- metadata %>% dplyr::filter(variable == "time")
time_attrs
# A tibble: 13 × 4
      id name                  variable value       
   <int> <chr>                 <chr>    <named list>
 1     0 units                 time     <chr [1]>   
 2     1 standard_name         time     <chr [1]>   
 3     2 axis                  time     <chr [1]>   
 4     3 _CoordinateAxisType   time     <chr [1]>   
 5     4 calendar              time     <chr [1]>   
 6     5 long_name             time     <chr [1]>   
 7     6 actual_min            time     <chr [1]>   
 8     7 actual_max            time     <chr [1]>   
 9     8 ancillary_variables   time     <chr [1]>   
10     9 instrument            time     <chr [1]>   
11    10 platform              time     <chr [1]>   
12    11 coverage_content_type time     <chr [1]>   
13    12 _FillValue            time     <dbl [1]>   

So, we grab the value from the units attribute, split the string to collect the date information, and apply that to a time conversion function as.POSIXct.

#library(stringr) - loaded with tidyverse
# grab origin date from time variable units attribute
tunit <- time_attrs %>% dplyr::filter(name == "units")
lunit <- str_split(tunit$value,' ')[[1]]
atn_tbl$time <- as.POSIXct(atn_tbl$time, origin=lunit[3], tz="GMT")

str(atn_tbl)
tibble [29 × 23] (S3: tbl_df/tbl/data.frame)
 $ time                : POSIXct[1:29], format: "2009-09-23 00:00:00" "2009-09-25 06:42:00" ...
 $ z                   : int [1:29] 0 0 0 0 0 0 0 0 0 0 ...
 $ lat                 : num [1:29] 34 23.6 34 34 34 ...
 $ lon                 : num [1:29] -119 -166 -119 -119 -119 ...
 $ ptt                 : int [1:29] 45866 45866 45866 45866 45866 45866 45866 45866 45866 45866 ...
 $ instrument          : chr [1:29] "SPOT" "SPOT" "SPOT" "SPOT" ...
 $ type                : chr [1:29] "User" "Argos" "Argos" "Argos" ...
 $ location_class      : chr [1:29] "nan" "A" "1" "0" ...
 $ error_radius        : int [1:29] NA NA NA NA NA NA NA NA NA NA ...
 $ semi_major_axis     : int [1:29] NA NA NA NA NA NA NA NA NA NA ...
 $ semi_minor_axis     : int [1:29] NA NA NA NA NA NA NA NA NA NA ...
 $ ellipse_orientation : int [1:29] NA NA NA NA NA NA NA NA NA NA ...
 $ offset              : int [1:29] NA NA NA NA NA NA NA NA NA NA ...
 $ offset_orientation  : int [1:29] NA NA NA NA NA NA NA NA NA NA ...
 $ gpe_msd             : num [1:29] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
 $ gpe_u               : num [1:29] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
 $ count               : int [1:29] NA NA NA NA NA NA NA NA NA NA ...
 $ qartod_time_flag    : int [1:29] 1 1 1 1 1 1 1 1 1 1 ...
 $ qartod_speed_flag   : int [1:29] 2 4 4 4 1 1 1 1 1 1 ...
 $ qartod_location_flag: int [1:29] 1 1 1 1 1 1 1 1 1 1 ...
 $ qartod_rollup_flag  : int [1:29] 1 4 4 4 1 1 1 1 1 1 ...
 $ comment             : chr [1:29] "" "" "" "" ...
 $ obs                 : chr [1:29] "1" "2" "3" "4" ...

7.2 Converting to Darwin Core

Now let’s work through converting this netCDF file to Darwin Core. Following the guidance published at https://github.com/tdwg/dwc-for-biologging/wiki/Data-guidelines and https://github.com/ocean-tracking-network/biologging_standardization/tree/master/examples/braun-blueshark/darwincore-example

7.2.1 Occurrence Core

Below is the mapping table from DarwinCore to the netCDF file.

DarwinCore Term Status netCDF source
occurrenceStatus Required hardcoded to present.
basisOfRecord Required data contained in the type variable where type of User = HumanObservation and Argos = MachineObservation.
occurrenceID Required eventDate, plus data contained in z variable, plus animal_common_name global attribute.
organismID Required platform_id global attribute plus the animal_common_name global attribute.
eventDate Required data contained in time variable. Converted to ISO8601.
decimalLatitude & decimalLongitude Required data in lat and lon variable, respectively.
geodeticDatum Required attribute epsg_code in the crs variable.
scientificName Required data from the variable taxon_name.
scientificNameID data from the variable taxon_lsid.
eventID Strongly recommended animal_common_name global attribute plus the eventDate.
samplingProtocol Strongly recommended
kingdom Strongly recommended kingdom attribute in the animal variable.
taxonRank Strongly recommended rank attribute in the animal variable.
coordinateUncertaintyInMeters Share if available maximum value of the data from the variables error_radius, semi_major_axis, and offset.
lifeStage Share if available data from the variable animal_life_stage.
sex Share if available data from the variable animal_sex.

Now start working through the crosswalk. A few thoughts about some of the functions we use:

  1. case_when is a function from dplyr that is essentially a ‘vectorized’ ifelse function. The take-home is that it plays nice with other tidyverse functions, like mutate and IMO is a bit more readable than a complex ifelse statement.
  2. rename is another nice dplyr function for renaming columns. It workes well following mutate because you can see the mutation applied to a column and then the column renamed, rather than a complex creation of a new column and dropping of the old column.
# Defined to grab attributes in subsequent code
nc <- nc_open(file_nc)

occurrencedf <- atn_tbl %>%  
    select( # Select desired columns
        
        time, 
        lat,
        lon,
        type,
        location_class,
        qartod_time_flag,
        qartod_speed_flag,
        qartod_location_flag,
        qartod_rollup_flag
        
          ) %>%
    mutate( # add and mutate columns.
        
        type = case_when(type == 'User' ~ 'HumanObservation',
                         type == 'Argos' ~ 'MachineObservation'),
        
        time = format(time, '%Y-%m-%dT%H:%M:%SZ'),
        
        kingdom = metadata %>% dplyr::filter(variable == "animal" & name == "kingdom") %>% pull(value) %>% unlist(use.names = FALSE),
        
        taxonRank = metadata %>% dplyr::filter(variable == "animal" & name == "rank") %>% pull(value) %>% unlist(use.names = FALSE),
        
        occurrenceStatus = "present",
        
        sex = ncvar_get( nc, 'animal_sex'),
        
        lifeStage = ncvar_get( nc, 'animal_life_stage'),
        
        scientificName = ncvar_get( nc, 'taxon_name'),
        
        scientificNameID = ncvar_get( nc, "taxon_lsid")
        
        ) %>%

    rename(  # rename columns to Darwin Core terms
        
        basisOfRecord = type,
        eventDate = time,
        decimalLatitude = lat,
        decimalLongitude = lon) %>% 

    arrange(eventDate) #arrange by increasing date

# minimumDepthInMeters = z,
occurrencedf$minimumDepthInMeters = atn_tbl$z

# maximumDepthInMeters = z,
occurrencedf$maximumDepthInMeters = atn_tbl$z

# organismID - {platformID}_{common_name}
common_name_tbl <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "animal_common_name")
common_name <- chartr(" ", "_", common_name_tbl$value)
platform_id_tbl <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "platform_id")
platform_id <- chartr(" ", "_", platform_id_tbl$value)
occurrencedf$organismID <- paste(platform_id , common_name, sep = "_") 

# occurrenceID - {eventDate}_{depth}_{common_name}
occurrencedf$occurrenceID <- sub(" ", "_", paste(occurrencedf$eventDate, atn_tbl$z, common_name, sep = "_"))

# geodeticDatum
gd_tbl <- metadata %>% dplyr::filter(variable == "crs") %>% dplyr::filter(name == "epsg_code")
occurrencedf$geodeticDatum <- paste(gd_tbl$value)

# eventID
#eventID - {common_name}_{dateTime}
cname = metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "animal_common_name")
occurrencedf$eventID <- sub(" ", "_", paste0(cname$value, "_", occurrencedf$eventDate))
str(occurrencedf)
tibble [29 × 22] (S3: tbl_df/tbl/data.frame)
 $ eventDate           : chr [1:29] "2009-09-23T00:00:00Z" "2009-09-25T06:42:00Z" "2009-09-25T11:09:00Z" "2009-09-25T11:11:00Z" ...
 $ decimalLatitude     : num [1:29] 34 23.6 34 34 34 ...
 $ decimalLongitude    : num [1:29] -119 -166 -119 -119 -119 ...
 $ basisOfRecord       : chr [1:29] "HumanObservation" "MachineObservation" "MachineObservation" "MachineObservation" ...
 $ location_class      : chr [1:29] "nan" "A" "1" "0" ...
 $ qartod_time_flag    : int [1:29] 1 1 1 1 1 1 1 1 1 1 ...
 $ qartod_speed_flag   : int [1:29] 2 4 4 4 1 1 1 1 1 1 ...
 $ qartod_location_flag: int [1:29] 1 1 1 1 1 1 1 1 1 1 ...
 $ qartod_rollup_flag  : int [1:29] 1 4 4 4 1 1 1 1 1 1 ...
 $ kingdom             : chr [1:29] "Animalia" "Animalia" "Animalia" "Animalia" ...
 $ taxonRank           : chr [1:29] "Species" "Species" "Species" "Species" ...
 $ occurrenceStatus    : chr [1:29] "present" "present" "present" "present" ...
 $ sex                 : chr [1:29] "male" "male" "male" "male" ...
 $ lifeStage           : chr [1:29] "juvenile" "juvenile" "juvenile" "juvenile" ...
 $ scientificName      : chr [1:29] "Carcharodon carcharias" "Carcharodon carcharias" "Carcharodon carcharias" "Carcharodon carcharias" ...
 $ scientificNameID    : chr [1:29] "urn:lsid:marinespecies.org:taxname:105838" "urn:lsid:marinespecies.org:taxname:105838" "urn:lsid:marinespecies.org:taxname:105838" "urn:lsid:marinespecies.org:taxname:105838" ...
 $ minimumDepthInMeters: int [1:29] 0 0 0 0 0 0 0 0 0 0 ...
 $ maximumDepthInMeters: int [1:29] 0 0 0 0 0 0 0 0 0 0 ...
 $ organismID          : chr [1:29] "105838_great_white_shark" "105838_great_white_shark" "105838_great_white_shark" "105838_great_white_shark" ...
 $ occurrenceID        : chr [1:29] "2009-09-23T00:00:00Z_0_great_white_shark" "2009-09-25T06:42:00Z_0_great_white_shark" "2009-09-25T11:09:00Z_0_great_white_shark" "2009-09-25T11:11:00Z_0_great_white_shark" ...
 $ geodeticDatum       : chr [1:29] "EPSG:4326" "EPSG:4326" "EPSG:4326" "EPSG:4326" ...
 $ eventID             : chr [1:29] "great_white shark_2009-09-23T00:00:00Z" "great_white shark_2009-09-25T06:42:00Z" "great_white shark_2009-09-25T11:09:00Z" "great_white shark_2009-09-25T11:11:00Z" ...

7.2.1.1 Add coordinateUncertaintyInMeters AND filter by location_class

When we add coordinateUncertaintyInMeters we are also filtering out where location_class == A,B,or Z.

In these data we also have additional information about the Location Quality Code from ARGOS satellite system. Below are the codes and those meanings.

code_values code meanings
G estimated error less than 100m and 1+ messages received per satellite pass
3 estimated error less than 250m and 4+ messages received per satellite pass
2 estimated error between 250m and 500m and 4+ messages per satellite pass
1 estimated error between 500m and 1500m and 4+ messages per satellite pass
0 estimated error greater than 1500m and 4+ messages received per satellite pass
A no least squares estimated error or unbounded kalman filter estimated error and 3 messages received per satellite pass
B no least squares estimated error or unbounded kalman filter estimated error and 1 or 2 messages received per satellite pass
Z invalid location (available for Service Plus or Auxilliary Location Processing)

Since codes A, B, and Z are essentially bad values, I propose that we filter those out.

Also, create a mapping table for coordinateUncertaintyInMeters that corresponds to the ARGOS code maximum error as shown in the table below:

code coordinateUncertaintyInMeters
G 100
3 250
2 500
1 1500
0 10000 (ref)

Below we create a lookup table for the location_class values we agree are good, which contains the coordinateUncertaintyInMeters for the appropriate location class. When we merge that table with our raw data, the observations that don’t match the location_classes in our lookup table will not be transfered over (ie. they will be filtered out).

occurrencedf <- occurrencedf %>%
    filter(location_class %in% c('nan','G','3','2','1','0')) %>%
    mutate(  # This returns NA for any other values than those defined below
        coordinateUncertaintyInMeters = case_when(location_class == 'nan' ~ 0,
                                                     location_class == 'G' ~ 200,
                                                     location_class == '3' ~ 250,
                                                     location_class == '2' ~ 500,
                                                     location_class == '1' ~ 1500,
                                                     location_class == '0' ~ 10000) # https://github.com/ioos/bio_data_guide/issues/145#issuecomment-1805739244
          ) %>% 
    arrange(eventDate) # arrange by increasing date


occurrencedf
# A tibble: 19 × 23
   eventDate       decimalLatitude decimalLongitude basisOfRecord location_class
   <chr>                     <dbl>            <dbl> <chr>         <chr>         
 1 2009-09-23T00:…            34.0            -119. HumanObserva… nan           
 2 2009-09-25T11:…            34.0            -119. MachineObser… 1             
 3 2009-09-25T11:…            34.0            -119. MachineObser… 0             
 4 2009-09-27T17:…            34.0            -119. MachineObser… 1             
 5 2009-10-08T20:…            34.0            -119. MachineObser… 2             
 6 2009-10-15T11:…            34.0            -119. MachineObser… 0             
 7 2009-10-17T06:…            34.0            -119. MachineObser… 0             
 8 2009-10-17T09:…            34.0            -119. MachineObser… 2             
 9 2009-10-17T10:…            34.0            -119. MachineObser… 3             
10 2009-10-18T08:…            34.0            -119. MachineObser… 1             
11 2009-10-18T10:…            34.0            -119. MachineObser… 2             
12 2009-10-18T11:…            34.0            -119. MachineObser… 0             
13 2009-10-23T23:…            34.0            -119. MachineObser… 2             
14 2009-10-24T00:…            34.0            -119. MachineObser… 0             
15 2009-10-26T10:…            34.0            -119. MachineObser… 3             
16 2009-10-27T16:…            34.0            -119. MachineObser… 1             
17 2009-10-27T16:…            34.0            -119. MachineObser… 2             
18 2009-10-29T11:…            34.0            -119. MachineObser… 2             
19 2009-10-31T21:…            34.0            -119. MachineObser… 0             
# ℹ 18 more variables: qartod_time_flag <int>, qartod_speed_flag <int>,
#   qartod_location_flag <int>, qartod_rollup_flag <int>, kingdom <chr>,
#   taxonRank <chr>, occurrenceStatus <chr>, sex <chr>, lifeStage <chr>,
#   scientificName <chr>, scientificNameID <chr>, minimumDepthInMeters <int>,
#   maximumDepthInMeters <int>, organismID <chr>, occurrenceID <chr>,
#   geodeticDatum <chr>, eventID <chr>, coordinateUncertaintyInMeters <dbl>

Notice how we went from 29 rows down to 19 rows by only selecting specific the location_class.

7.2.2 Create a dataGeneralizations column to describe how many duplicates were found for each deprecation series

Add a dataGeneralizations column containing a string like ‘first of # records’ to indicate there are more records in the raw dataset to be discovered by the super-curious.

The dataGeneralizations string is compiled by counting the number of consecutive duplicates and inserting that into a standard string. That string is “first of [n] records” which will make more sense once we’ve filtered down to keep the first occurrence of the hour.

The next step below this, we filter out only the first observation of the hour.

# sort by date
occurrencedf <- occurrencedf %>% arrange(eventDate)

occurrencedf <- occurrencedf %>%
    mutate(eventDateHrs = format(as.POSIXct(eventDate, format="%Y-%m-%dT%H:%M:%SZ"),"%Y-%m-%dT%H")
           ) %>%
    add_count(eventDateHrs) %>%
    mutate(dataGeneralizations = case_when(n == 1 ~ "",
                                           TRUE ~ paste("first of ", n ,"records")
                                           )
           ) %>%
    select(-n)

occurrencedf
# A tibble: 19 × 25
   eventDate       decimalLatitude decimalLongitude basisOfRecord location_class
   <chr>                     <dbl>            <dbl> <chr>         <chr>         
 1 2009-09-23T00:…            34.0            -119. HumanObserva… nan           
 2 2009-09-25T11:…            34.0            -119. MachineObser… 1             
 3 2009-09-25T11:…            34.0            -119. MachineObser… 0             
 4 2009-09-27T17:…            34.0            -119. MachineObser… 1             
 5 2009-10-08T20:…            34.0            -119. MachineObser… 2             
 6 2009-10-15T11:…            34.0            -119. MachineObser… 0             
 7 2009-10-17T06:…            34.0            -119. MachineObser… 0             
 8 2009-10-17T09:…            34.0            -119. MachineObser… 2             
 9 2009-10-17T10:…            34.0            -119. MachineObser… 3             
10 2009-10-18T08:…            34.0            -119. MachineObser… 1             
11 2009-10-18T10:…            34.0            -119. MachineObser… 2             
12 2009-10-18T11:…            34.0            -119. MachineObser… 0             
13 2009-10-23T23:…            34.0            -119. MachineObser… 2             
14 2009-10-24T00:…            34.0            -119. MachineObser… 0             
15 2009-10-26T10:…            34.0            -119. MachineObser… 3             
16 2009-10-27T16:…            34.0            -119. MachineObser… 1             
17 2009-10-27T16:…            34.0            -119. MachineObser… 2             
18 2009-10-29T11:…            34.0            -119. MachineObser… 2             
19 2009-10-31T21:…            34.0            -119. MachineObser… 0             
# ℹ 20 more variables: qartod_time_flag <int>, qartod_speed_flag <int>,
#   qartod_location_flag <int>, qartod_rollup_flag <int>, kingdom <chr>,
#   taxonRank <chr>, occurrenceStatus <chr>, sex <chr>, lifeStage <chr>,
#   scientificName <chr>, scientificNameID <chr>, minimumDepthInMeters <int>,
#   maximumDepthInMeters <int>, organismID <chr>, occurrenceID <chr>,
#   geodeticDatum <chr>, eventID <chr>, coordinateUncertaintyInMeters <dbl>,
#   eventDateHrs <chr>, dataGeneralizations <chr>
7.2.2.0.1 Decimate occurrences down to the first detection/location per hour

Here we’ve done the decimation in Python: https://gist.github.com/MathewBiddle/d434ac2b538b2728aa80c6a7945f94be

Essentially we build a new colum that is the date plus the two digit hour. Then we find where that column has duplicates and keep the first entry.

In R, we do something slightly different as we only keep the distinct (ie. unique) rows and if there are duplicates, pick the first row of the duplicate.

# sort by date
occurrencedf_dec <- occurrencedf %>% arrange(eventDate)

# filter table to only unique date + hour and pick the first row.
occurrencedf_dec <- distinct(occurrencedf_dec,eventDateHrs,.keep_all = TRUE) %>%
    select(-eventDateHrs)

occurrencedf_dec
# A tibble: 17 × 24
   eventDate       decimalLatitude decimalLongitude basisOfRecord location_class
   <chr>                     <dbl>            <dbl> <chr>         <chr>         
 1 2009-09-23T00:…            34.0            -119. HumanObserva… nan           
 2 2009-09-25T11:…            34.0            -119. MachineObser… 1             
 3 2009-09-27T17:…            34.0            -119. MachineObser… 1             
 4 2009-10-08T20:…            34.0            -119. MachineObser… 2             
 5 2009-10-15T11:…            34.0            -119. MachineObser… 0             
 6 2009-10-17T06:…            34.0            -119. MachineObser… 0             
 7 2009-10-17T09:…            34.0            -119. MachineObser… 2             
 8 2009-10-17T10:…            34.0            -119. MachineObser… 3             
 9 2009-10-18T08:…            34.0            -119. MachineObser… 1             
10 2009-10-18T10:…            34.0            -119. MachineObser… 2             
11 2009-10-18T11:…            34.0            -119. MachineObser… 0             
12 2009-10-23T23:…            34.0            -119. MachineObser… 2             
13 2009-10-24T00:…            34.0            -119. MachineObser… 0             
14 2009-10-26T10:…            34.0            -119. MachineObser… 3             
15 2009-10-27T16:…            34.0            -119. MachineObser… 1             
16 2009-10-29T11:…            34.0            -119. MachineObser… 2             
17 2009-10-31T21:…            34.0            -119. MachineObser… 0             
# ℹ 19 more variables: qartod_time_flag <int>, qartod_speed_flag <int>,
#   qartod_location_flag <int>, qartod_rollup_flag <int>, kingdom <chr>,
#   taxonRank <chr>, occurrenceStatus <chr>, sex <chr>, lifeStage <chr>,
#   scientificName <chr>, scientificNameID <chr>, minimumDepthInMeters <int>,
#   maximumDepthInMeters <int>, organismID <chr>, occurrenceID <chr>,
#   geodeticDatum <chr>, eventID <chr>, coordinateUncertaintyInMeters <dbl>,
#   dataGeneralizations <chr>

Notice that we have gone from 19 rows to 17 rows. Removing rows observed on 2009-09-25T11:11:00Z and 2009-10-27T16:22:00Z as they were the second points within that specifc hour.

7.2.2.0.2 Filter on QARTOD flags?

We also have QARTOD flags and they are as follows:

value meaning
1 PASS
2 NOT_EVALUATED
3 SUSPECT
4 FAIL
9 MISSING

The QARTOD tests are:

variable long_name
qartod_time_flag Time QC test - gross range test
qartod_speed_flag Speed QC test - gross range test
qartod_location_flag Location QC test - Location test
qartod_rollup_flag Aggregate QC value

I’m not sure what to do here. My preference would be to include all rows where qartod_rollup_flag == 1 and drop the rest. But I’m open to suggestions.

# perform filter but don't save it.
filter(occurrencedf_dec, qartod_rollup_flag == 1)
# A tibble: 16 × 24
   eventDate       decimalLatitude decimalLongitude basisOfRecord location_class
   <chr>                     <dbl>            <dbl> <chr>         <chr>         
 1 2009-09-23T00:…            34.0            -119. HumanObserva… nan           
 2 2009-09-27T17:…            34.0            -119. MachineObser… 1             
 3 2009-10-08T20:…            34.0            -119. MachineObser… 2             
 4 2009-10-15T11:…            34.0            -119. MachineObser… 0             
 5 2009-10-17T06:…            34.0            -119. MachineObser… 0             
 6 2009-10-17T09:…            34.0            -119. MachineObser… 2             
 7 2009-10-17T10:…            34.0            -119. MachineObser… 3             
 8 2009-10-18T08:…            34.0            -119. MachineObser… 1             
 9 2009-10-18T10:…            34.0            -119. MachineObser… 2             
10 2009-10-18T11:…            34.0            -119. MachineObser… 0             
11 2009-10-23T23:…            34.0            -119. MachineObser… 2             
12 2009-10-24T00:…            34.0            -119. MachineObser… 0             
13 2009-10-26T10:…            34.0            -119. MachineObser… 3             
14 2009-10-27T16:…            34.0            -119. MachineObser… 1             
15 2009-10-29T11:…            34.0            -119. MachineObser… 2             
16 2009-10-31T21:…            34.0            -119. MachineObser… 0             
# ℹ 19 more variables: qartod_time_flag <int>, qartod_speed_flag <int>,
#   qartod_location_flag <int>, qartod_rollup_flag <int>, kingdom <chr>,
#   taxonRank <chr>, occurrenceStatus <chr>, sex <chr>, lifeStage <chr>,
#   scientificName <chr>, scientificNameID <chr>, minimumDepthInMeters <int>,
#   maximumDepthInMeters <int>, organismID <chr>, occurrenceID <chr>,
#   geodeticDatum <chr>, eventID <chr>, coordinateUncertaintyInMeters <dbl>,
#   dataGeneralizations <chr>

Drop the quality flag columns to align with DarwinCore standard.

occurrencedf_dec <- occurrencedf_dec %>%
    select(
        -c(location_class,
           qartod_time_flag,
           qartod_speed_flag,
           qartod_location_flag,
           qartod_rollup_flag
           ))
        
names(occurrencedf_dec)
 [1] "eventDate"                     "decimalLatitude"              
 [3] "decimalLongitude"              "basisOfRecord"                
 [5] "kingdom"                       "taxonRank"                    
 [7] "occurrenceStatus"              "sex"                          
 [9] "lifeStage"                     "scientificName"               
[11] "scientificNameID"              "minimumDepthInMeters"         
[13] "maximumDepthInMeters"          "organismID"                   
[15] "occurrenceID"                  "geodeticDatum"                
[17] "eventID"                       "coordinateUncertaintyInMeters"
[19] "dataGeneralizations"          
7.2.2.0.3 Write decimated occurrence file as csv
tag_id <- metadata %>% dplyr::filter(variable == "NC_GLOBAL" & name == "ptt_id")

occurrencedf_dec_csv <- glue::glue("{dir_data}/dwc/atn_{tag_id$value}_occurrence.csv")

write.csv(occurrencedf_dec, file=occurrencedf_dec_csv, row.names=FALSE, fileEncoding="UTF-8", quote=TRUE, na="")

7.2.2.1 Measurement or Fact

Since we do have any additional observations, we can create a measurement or fact file to include those data. Might be worthwhile to include tag/device metadata, some of the animal measurements, and the detachment information. Each term should have a definition URI.

The measurementOrFact file will only contain information referencing the basisOfRecord = HumanObservation as these observations were made when the animal was directly tagged, in person (ie. when basisOfRecord == HumanObservation).

DarwinCore Term Status netCDF
organismID The platform_id global attribute plus the animal_common_name global attribute.
occurrenceID Required eventDate, plus data contained in z variable, plus animal_common_name global attribute.
measurementType Required long_name attribute of the animal_weight, animal_length, animal_length_2 variables.
measurementValue Required The data from the animal_weight, animal_length, animal_length_2 variables.
eventID Strongly Recommended animal_common_name global attribute plus the eventDate.
measurementUnit Strongly Recommended unit attribute of the animal_weight, animal_length, animal_length_2 variables.
measurementMethod Strongly Recommended animal_weight, animal_length, animal_length_2 attributes of their respective variables.
measurementTypeID Strongly Recommended mapping table somewhere?
measurementMethodID Strongly Recommended mapping table somewhere?
measurementUnitID Strongly Recommended mapping table somewhere?
measurementAccuracy Share if available
measurementDeterminedDate Share if available
measurementDeterminedBy Share if available
measurementRemarks Share if available
measurementValueID Share if available
7.2.2.1.1 Extracting variables for Extended Measurement Or Fact (eMOF)

Here there are two approaches to transforming a variable to the eMOF Darwin Core extension. The goal is to collapse the measurement name, value, unit, related identifiers and remarks into a generalized long format that can be linked to occurrences and events. For more info see:

  1. The OBIS manual
  2. The Marine Biological Data Mobilization Workshop 2023 (SF:Not sure if it’s cool to reference the workshop like this)

The first several lines of the below code show an example of pulling out the variable attributes and individually mapping them to the eMOF terms. However, this can be done more efficiently (although less readable) via this chunk of code:

# Supply vector of variable names
c("animal_length",
  "animal_length_2",
  "animal_weight") %>%

      # Create a named list of the variable attributes and convert it into a data frame, for each name in the above vector.
      purrr::map_df(function(x) {
        0list(measurementValue = ncvar_get( nc, x),
             measurementType = ncatt_get( nc, x)$long_name,
             measurementUnit = ncatt_get( nc, x)$units,
             measurementMethod = ncatt_get( nc, x)[[paste0(x,'_type')]])
        })
# # Measurement or Fact extension
# # Need to find the occurrence where basisOfRecord == HumanObservation, then pull the organism.

emof_data <- #var_names %>%
    #filter(str_starts(name, pattern = "animal_[lw]e")) %>% #example using regex to parse names
    #    pull(name) %>%

    # Example using vector of variables
    c("animal_length",
      "animal_length_2",
      "animal_weight") %>%
    purrr::map_df(function(x) {
        list(measurementValue = ncvar_get( nc, x),
             measurementType = ncatt_get( nc, x)$long_name,
             measurementUnit = ncatt_get( nc, x)$units,
             measurementMethod = ncatt_get( nc, x)[[paste0(x,'_type')]])
    }) %>%
    
    filter(measurementValue != "NaN")


emofdf <- occurrencedf %>%
    filter(basisOfRecord == 'HumanObservation') %>%
    select(organismID, eventID, occurrenceID) %>%
    cbind(emof_data)

str(emofdf)
'data.frame':   1 obs. of  7 variables:
 $ organismID       : chr "105838_great_white_shark"
 $ eventID          : chr "great_white shark_2009-09-23T00:00:00Z"
 $ occurrenceID     : chr "2009-09-23T00:00:00Z_0_great_white_shark"
 $ measurementValue : num 213
 $ measurementType  : chr "length of the animal as measured or estimated at deployment"
 $ measurementUnit  : chr "cm"
 $ measurementMethod: chr "total length"
7.2.2.1.2 Write emof file as csv
tag_id <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "ptt_id")

emof_csv <- glue::glue("{dir_data}/dwc/atn_{tag_id$value}_emof.csv")

write.csv(emofdf, file=emof_csv, row.names=FALSE, fileEncoding="UTF-8", quote=TRUE, na="")

7.2.2.2 Metadata creation

Now that we know our data are aligned to Darwin Core, we can start collecting metadata. Using the R package EML we can create the EML metadata to associate with the data above.

Some good sources to help identify what requirements we need in the EML metadata can be found at:

  • https://github.com/gbif/ipt/wiki/GMPHowToGuide

  • https://github.com/gbif/ipt/wiki/GMPHowToGuide#dataset-resource

# library(EML)

The first thing we need to do is collect all of the relevant pieces of metadata for our EML record.

# me <- list(individualName = list(givenName = "Matt", surName = "Biddle"))

# my_eml <- list(dataset = list(

#                           title = "A Minimal Valid EML Dataset",

#                           creator = me,

#                           contact = me

#                             )

#                 )
# geographicDescription <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "sea_name")

# west <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "geospatial_lon_min")

# east <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "geospatial_lon_max")

# north <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "geospatial_lat_max")

# south <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "geospatial_lat_min")

# altitudeMin <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "geospatial_vertical_min")

# altitudeMax <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "geospatial_vertical_max")

# altitudeUnits <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "geospatial_vertical_units")


# coverage <-

#   set_coverage(begin = format(min(atn_tbl$time),'%Y-%m-%d'), end = format(max(atn_tbl$time), '%Y-%m-%d'),

#                sci_names = RNetCDF::var.get.nc(RNetCDF::open.nc("atn_trajectory_template.nc"), "taxon_name"),

#                geographicDescription = paste(geographicDescription$value),

#                west = paste(west$value),

#                east = paste(east$value) ,

#                north = paste(north$value) ,

#                south = paste(south$value) ,

#                altitudeMin = paste(altitudeMin$value),

#                altitudeMaximum = paste(altitudeMax$value),

#                altitudeUnits = ifelse (paste(altitudeUnits$value) == 'm', "meter", "?"))
# creator_name <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "creator_name")

# creator_email <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "creator_email")

# creator_sector <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "creator_sector")

# creator <- eml$creator(

#             eml$individualName(

#                 givenName = paste(creator_name$value),

#                 surName = paste(creator_name$value)

#                 ),

#             position = paste(creator_sector$value),

#             electronicMailAddress = paste(creator_email$value)

#             )
# #contact_name = metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "contact_name")

# contact <- eml$contact(

#             eml$individualName(

#             givenName = paste(creator_name$value),

#             surName = paste(creator_name$value)),

#             position = paste(creator_sector$value),

#             electronicMailAddress = paste(creator_email$value)

#             )
# #metadata_name

# metadataProvider <- eml$metadataProvider(

#             eml$individualName(

#                 givenName = paste(creator_name$value),

#                 surName = paste(creator_name$value)),

#             position = paste(creator_sector$value),

#             electronicMailAddress = paste(creator_email$value)

#             )
# ## these are the entries in contributor, need to iterate since comma separated list.

# contrib_name <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "contributor_name")

# contrib_position <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "contributor_role")

# contrib_email <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "contributor_email")
# associatedParty <- eml$associatedParty(

#                     eml$individualName(

#                     givenName = paste(contrib_name$value),

#                     surName = paste(contrib_name$value)),

#                     position = paste(contrib_position$value),

#                     electronicMailAddress = paste(contrib_email$value)

#                     )
# abstract <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "summary")
# # keywords

# keywords <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "keywords")

# kw_vocab <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "keywords_vocabulary")

# keywordSet <- list(

#     list(

#         keywordThesaurus = kw_vocab$value$keywords_vocabulary,

#         keyword = as.list(strsplit(keywords$value$keywords, ", "))

#         ))
# title <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "title")
# methods <- "NEED TO MAP FROM NCFILE"
# license <- metadata %>% dplyr::filter(variable == "NC_GLOBAL") %>% dplyr::filter(name == "license")

Now build the eml file.

# library(uuid)


# physical <- set_physical(file_name_occur)


# # attributeList <-

# #   set_attributes(attributes,

# #                  factors,

# #                  col_classes = c("character",

# #                                  "Date",

# #                                  "Date",

# #                                  "Date",

# #                                  "factor",

# #                                  "factor",

# #                                  "factor",

# #                                  "numeric"))


# my_eml <- eml$eml(

#            packageId = paste(uuid_tbl$value),

#            system = "uuid",

#            dataset = eml$dataset(

#                alternateIdentifier = UUIDgenerate(use.time = TRUE),

#                title = title$value,

#                creator = creator,

#                metadataProvider = metadataProvider,

#                #associatedParty = associatedParty,

#                contact = contact,

#                pubDate = format(Sys.time(),'%Y-%m-%d'),

#                language = "English",

#                intellectualRights = eml$intellectualRights(

#                                     para = "To the extent possible under law, the publisher has waived all rights to these data and has dedicated them to the <ulink url=\"http://creativecommons.org/publicdomain/zero/1.0/legalcode\"><citetitle>Public Domain (CC0 1.0)</citetitle></ulink>. Users may copy, modify, distribute and use the work, including for commercial purposes, without restriction."

#                                     #para = paste(license$value),

#                                                            ),

#                abstract = eml$abstract(

#                                para = abstract$value$summary,

#                                        ),

#                keywordSet = keywordSet,

#                coverage = coverage,

# #                license = eml$license(

# #                            licenseName = "CC0 1.0",

# #                            #licenseName = paste(license$value),

# #                            ),

#                #dataTable = eml$dataTable(

#                #  entityName = file_name_occur,

#                #  entityDescription = "Occurrences",

#                #  physical = physical)

#                ))

Validate EML

# val <- eml_validate(my_eml)

# attr(val,"errors")

Write eml to file.

# file_name_eml <- 'eml.xml'

# write_eml(my_eml, file_name_eml)

Raw EML

# my_eml
7.2.2.2.1 Create meta.xml

Below is an example of the contents of meta.xml:


<archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="eml.xml">

  <core encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">

    <files>

      <location>occurrence.txt</location>

    </files>

    <id index="0" />

    <field index="1" term="http://rs.tdwg.org/dwc/terms/datasetID"/>

    <field index="2" term="http://rs.tdwg.org/dwc/terms/institutionCode"/>

    <field index="3" term="http://rs.tdwg.org/dwc/terms/collectionCode"/>

    <field index="4" term="http://rs.tdwg.org/dwc/terms/basisOfRecord"/>

    <field index="5" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>

    <field index="6" term="http://rs.tdwg.org/dwc/terms/catalogNumber"/>

    <field index="7" term="http://rs.tdwg.org/dwc/terms/occurrenceRemarks"/>

    <field index="8" term="http://rs.tdwg.org/dwc/terms/individualCount"/>

    <field index="9" term="http://rs.tdwg.org/dwc/terms/sex"/>

    <field index="10" term="http://rs.tdwg.org/dwc/terms/occurrenceStatus"/>

    <field index="11" term="http://rs.tdwg.org/dwc/terms/eventDate"/>

    <field index="12" term="http://rs.tdwg.org/dwc/terms/year"/>

    <field index="13" term="http://rs.tdwg.org/dwc/terms/decimalLatitude"/>

    <field index="14" term="http://rs.tdwg.org/dwc/terms/decimalLongitude"/>

    <field index="15" term="http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters"/>

    <field index="16" term="http://rs.tdwg.org/dwc/terms/scientificNameID"/>

    <field index="17" term="http://rs.tdwg.org/dwc/terms/scientificName"/>

  </core>

  <extension encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.iobis.org/obis/terms/ExtendedMeasurementOrFact">

    <files>

      <location>extendedmeasurementorfact.txt</location>

    </files>

    <coreid index="0" />

    <field index="1" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>

    <field index="2" term="http://rs.tdwg.org/dwc/terms/measurementType"/>

    <field index="3" term="http://rs.tdwg.org/dwc/terms/measurementValue"/>

    <field index="4" term="http://rs.tdwg.org/dwc/terms/measurementUnit"/>

    <field index="5" term="http://rs.iobis.org/obis/terms/measurementUnitID"/>

    <field index="6" term="http://rs.tdwg.org/dwc/terms/measurementDeterminedDate"/>

  </extension>

</archive>

Checkout XML package for R.

conda install -c conda-forge r-xml

Another example in this github repository.

Or use the gui here to create meta.xml.

# library(XML)


# doc = newXMLDoc()

# archiveNode = newXMLNode("archive", attrs = c(metadata=file_name_eml), namespaceDefinitions=c("http://rs.tdwg.org/dwc/text/"), doc=doc )


# ## For the core occurrence

# coreNode = newXMLNode("core", attrs = c(encoding="UTF-8", linesTerminatedBy="\\r\\n", fieldsTerminatedBy=",", fieldsEnclosedBy='\"', ignoreHeaderLines="1", rowType="http://rs.tdwg.org/dwc/terms/Occurrence"), parent = archiveNode)

# filesNode = newXMLNode("files", parent = coreNode)

# locationNode = newXMLNode("location", file_name_occur, parent = filesNode)

# idnode = newXMLNode("id", attrs = c(index="9"), parent = coreNode)


# # iterate over the columns in occurrence file to create field elements

# i=0

# for (col in colnames(occurrencedf))

#     {

#     termstr = paste("http://rs.tdwg.org/dwc/terms/",col, sep="")

#     i=i+1

#     fieldnode = newXMLNode("field", attrs = c(index=i, term=termstr), parent=coreNode)

# }



# ## for the extensions

# extensionNode = newXMLNode("extension", attrs = c(encoding="UTF-8", linesTerminatedBy="\\r\\n", fieldsTerminatedBy=",", fieldsEnclosedBy='\"', ignoreHeaderLines="1", rowType="http://rs.tdwg.org/dwc/terms/Event"), parent = archiveNode)

# filesNode = newXMLNode("files", parent = extensionNode)

# locationNode = newXMLNode("location", file_name_event, parent = filesNode)

# idnode = newXMLNode("id", attrs = c(index="0"), parent = extensionNode)


# # iterate over the columns in occurrence file to create field elements

# i=0

# for (col in colnames(eventdf))

#     {

#     if (col == 'modified'){

#         termstr = paste("http://purl.org/dc/terms/", col, sep="")

#     } else {

#         termstr = paste("http://rs.tdwg.org/dwc/terms/",col, sep="")

#         }


#     i=i+1


#     fieldnode = newXMLNode("field", attrs = c(index=i, term=termstr), parent=extensionNode)

# }



# print(doc)



# saveXML(doc, file="meta.xml")

7.2.2.3 Build the DarwinCore-Archive zip package

# library(zip)


# files = c(file_name_occur, file_name_event, file_name_eml, "meta.xml")

# zip::zip(

#     "atn.zip",

#     files,

#     root = ".",

#     mode = "mirror",

# )


# zip_list("atn.zip")

7.2.3 sessionInfo()

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: UTC
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] mapdata_2.3.1   maps_3.4.2      lubridate_1.9.3 forcats_1.0.0  
 [5] stringr_1.5.1   dplyr_1.1.4     purrr_1.0.2     readr_2.1.5    
 [9] tidyr_1.3.1     tibble_3.2.1    ggplot2_3.5.1   tidyverse_2.0.0
[13] ncdf4_1.23      obistools_0.1.0 tidync_0.4.0   

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3    utf8_1.2.4        generics_0.1.3    xml2_1.3.6       
 [5] stringi_1.8.4     hms_1.1.3         digest_0.6.37     magrittr_2.0.3   
 [9] evaluate_1.0.0    grid_4.4.1        timechange_0.3.0  fastmap_1.2.0    
[13] rprojroot_2.0.4   jsonlite_1.8.9    ncmeta_0.4.0      fansi_1.0.6      
[17] crosstalk_1.2.1   scales_1.3.0      cli_3.6.3         RNetCDF_2.9-2    
[21] rlang_1.1.4       munsell_0.5.1     withr_3.0.1       yaml_2.3.10      
[25] tools_4.4.1       tzdb_0.4.0        colorspace_2.1-1  here_1.0.1       
[29] vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4   leaflet_2.2.2    
[33] htmlwidgets_1.6.4 pkgconfig_2.0.3   pillar_1.9.0      gtable_0.3.5     
[37] glue_1.8.0        xfun_0.48         tidyselect_1.2.1  data.tree_1.1.0  
[41] knitr_1.48        htmltools_0.5.8.1 rmarkdown_2.28    compiler_4.4.1