Chapter 2 Applications

This chapter contains a series of example applications to convert source data to the Darwin Core standard. You can find these examples (and more!) in the GitHub repository under the datasets/ directory.

2.1 Aligning Data to Darwin Core - Event Core with Extended Measurement or Fact

Abby Benson
January 9, 2022

2.1.1 General information about this notebook

Script to process the Texas Parks and Wildlife Department (TPWD) Aransas Bay bag seine data from the format used by the Houston Advanced Research Center (HARC) for bays in Texas. Taxonomy was processed using a separate script (TPWD_Taxonomy.R) using a taxa list pulled from the pdf “2009 Resource Monitoring Operations Manual”. All original data, processed data and scripts are stored on an item in USGS ScienceBase.

# Load some of the libraries
library(reshape2)
library(tidyverse)
library(readr)
# Load the data
BagSeine <- read.csv("https://www.sciencebase.gov/catalog/file/get/53a887f4e4b075096c60cfdd?f=__disk__6e%2F6a%2F67%2F6e6a678c41cf928e025fd30339789cc8b893a815&allowOpen=true", stringsAsFactors=FALSE, strip.white = TRUE)

Note that if not already done you’ll need to run the TPWD_Taxonomy.R script to get the taxaList file squared away or load the taxonomy file to the World Register of Marine Species Taxon Match Tool https://www.marinespecies.org/aphia.php?p=match

2.1.2 Event file

To start we will create the Darwin Core Event file. This is the file that will have all the information about the sampling event such as date, location, depth, sampling protocol. Basically anything about the cruise or the way the sampling was done will go in this file. You can see all the Darwin Core terms that are part of the event file here http://tools.gbif.org/dwca-validator/extension.do?id=dwc:Event.

The original format for these TPWD HARC files has all of the information associated as the event in the first approximately 50 columns and then all of the information about the occurrence (species) as columns for each species. We will need to start by limiting to the event information only.

event <- BagSeine[,1:47]

Next there are several pieces of information that need 1) to be added like the geodeticDatum 2) to be pieced together from multiple columns like datasetID or 3) minor changes like the minimum and maximum depth.

event <- event %>%
  mutate(type = "Event",
         modified = lubridate::today(),
         language = "en",
         license = "http://creativecommons.org/publicdomain/zero/1.0/legalcode",
         institutionCode = "TPWD",
         ownerInstitutionCode = "HARC",
         coordinateUncertaintyInMeters = "100",
         geodeticDatum = "WGS84",
         georeferenceProtocol = "Handheld GPS",
         country = "United States",
         countryCode = "US",
         stateProvince = "Texas",
         datasetID = gsub(" ", "_", paste("TPWD_HARC_Texas", event$Bay, event$Gear_Type)),
         eventID = paste("Station", event$station_code, "Date", event$completion_dttm, sep = "_"),
         sampleSizeUnit = "hectares",
         CompDate = lubridate::mdy_hms(event$CompDate, tz="America/Chicago"), 
         StartDate = lubridate::mdy_hms(event$StartDate, tz="America/Chicago"),
         minimumDepthInMeters = ifelse(start_shallow_water_depth_num < start_deep_water_depth_num, 
                                       start_shallow_water_depth_num, start_deep_water_depth_num),
         maximumDepthInMeters = ifelse(start_deep_water_depth_num > start_shallow_water_depth_num,
                                       start_deep_water_depth_num, start_shallow_water_depth_num))
head(event[,48:64], n = 10)
    type   modified language                                                    license institutionCode
1  Event 2022-01-09       en http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD
2  Event 2022-01-09       en http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD
3  Event 2022-01-09       en http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD
4  Event 2022-01-09       en http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD
5  Event 2022-01-09       en http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD
6  Event 2022-01-09       en http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD
7  Event 2022-01-09       en http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD
8  Event 2022-01-09       en http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD
9  Event 2022-01-09       en http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD
10 Event 2022-01-09       en http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD
   ownerInstitutionCode coordinateUncertaintyInMeters geodeticDatum georeferenceProtocol       country
1                  HARC                           100         WGS84         Handheld GPS United States
2                  HARC                           100         WGS84         Handheld GPS United States
3                  HARC                           100         WGS84         Handheld GPS United States
4                  HARC                           100         WGS84         Handheld GPS United States
5                  HARC                           100         WGS84         Handheld GPS United States
6                  HARC                           100         WGS84         Handheld GPS United States
7                  HARC                           100         WGS84         Handheld GPS United States
8                  HARC                           100         WGS84         Handheld GPS United States
9                  HARC                           100         WGS84         Handheld GPS United States
10                 HARC                           100         WGS84         Handheld GPS United States
   countryCode stateProvince                             datasetID                                eventID
1           US         Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_95_Date_09JAN1997:14:35:00.000
2           US         Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_95_Date_18AUG2000:11:02:00.000
3           US         Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_28JUN2005:08:41:00.000
4           US         Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_23AUG2006:11:47:00.000
5           US         Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_17OCT2006:14:23:00.000
6           US         Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_19FEB1996:10:27:00.000
7           US         Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_11JUN2001:14:12:00.000
8           US         Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_16MAR1992:09:46:00.000
9           US         Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_25SEP1996:11:28:00.000
10          US         Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_08MAY1997:13:20:00.000
   sampleSizeUnit minimumDepthInMeters maximumDepthInMeters
1        hectares                  0.0                  0.6
2        hectares                  0.1                  0.5
3        hectares                  0.4                  0.6
4        hectares                  0.2                  0.4
5        hectares                  0.7                  0.8
6        hectares                  0.1                  0.3
7        hectares                  0.4                  0.5
8        hectares                  0.0                  0.4
9        hectares                  0.3                  0.7
10       hectares                  0.4                  0.6

For this dataset there was a start timestamp and end timestamp that we can use to identify the sampling effort which can be really valuable information for downstream users when trying to reuse data from multiple projects.

## Calculate duration of bag seine event
event$samplingEffort <- ""
for (i in 1:nrow(event)){
  event[i,]$samplingEffort <- abs(lubridate::as.duration(event[i,]$CompDate - event[i,]$StartDate))
}
event$samplingEffort <- paste(event$samplingEffort, "seconds", sep = " ")

Finally there were a few columns that were a direct match to a Darwin Core term and therefore just need to be renamed to follow the standard.

event <- event %>%
  rename(samplingProtocol = Gear_Type,
         locality = Estuary,
         waterBody = SubBay,
         decimalLatitude = Latitude,
         decimalLongitude = Longitude,
         sampleSizeValue = surface_area_num,
         eventDate = CompDate)

2.1.3 Occurrence file

The next file we need to create is the Occurrence file. This file includes all the information about the species that were observed. An occurrence in Darwin Core is the intersection of an organism at a time and a place. We have already done the work to identify the time and place in the event file so we don’t need to do that again here. What we do need to is identify all the information about the organisms. Another piece of information that goes in here is basisOfRecord which is a required field and has a controlled vocabulary. For the data we work with you’ll usually put HumanObservation or MachineObservation. If it’s eDNA data you’ll use MaterialSample. If your data are part of a museum collection you’ll use PreservedSpecimen.

Important to note that there is overlap in the Darwin Core terms that “allowed” to be in the event file and in the occurrence file. This is because data can be submitted as “Occurrence Only” where you don’t have a separate event file. In that case, the location and date information will need to be included in the occurrence file. Since we are formatting this dataset as a sampling event we will not include location and date information in the occurrence file. To see all the Darwin Core terms that can go in the occurrence file go here https://tools.gbif.org/dwca-validator/extension.do?id=dwc:occurrence.

This dataset in its original format is in “wide format”. All that means is that data that we would expect to be encoded as values in the rows are instead column headers. We have to pull all the scientific names out of the column headers and turn them into actual values in the data.

occurrence <- melt(BagSeine, id=1:47, measure=48:109, variable.name="vernacularName", value.name="relativeAbundance")

You’ll notice when we did that step we went from 5481 obs (or rows) in the data to 334341 obs. We went from wide to long.

dim(BagSeine)
[1] 5481  109
dim(occurrence)
[1] 334341     49

Now as with the event file we have several pieces of information that need to be added or changed to make sure the data are following Darwin Core. We always want to include as much information as possible to make the data as reusable as possible.

occurrence <- occurrence %>%
  mutate(vernacularName = gsub("\\.",' ', vernacularName),
         eventID = paste("Station", station_code, "Date", completion_dttm, sep = "_"),
         occurrenceStatus = ifelse(relativeAbundance == 0, "Absent", "Present"),
         basisOfRecord = "HumanObservation",
         organismQuantityType = "Relative Abundance",
         collectionCode = paste(Bay, Gear_Type, sep = " "))

We will match the taxa list with our occurrence file data to bring in the taxonomic information that we pulled from WoRMS. To save time you’ll just import the processed taxa list which includes the taxonomic hierarchy and the required term scientificNameID which is one of the most important pieces of information to include for OBIS.

taxaList <- read.csv("https://www.sciencebase.gov/catalog/file/get/53a887f4e4b075096c60cfdd?f=__disk__49%2F0a%2F73%2F490a7337fa94039715809496b22f5d003b8a79a2&allowOpen=true", stringsAsFactors = FALSE)
## Merge taxaList with occurrence
occurrence <- merge(occurrence, taxaList, by = "vernacularName", all.x = T)
## Test that all the vernacularNames found a match in taxaList_updated
Hmisc::describe(occurrence$scientificNameID)
       n  missing distinct 
  334341        0       61 

lowest : urn:lsid:marinespecies.org:taxname:105792 urn:lsid:marinespecies.org:taxname:107034 urn:lsid:marinespecies.org:taxname:107379 urn:lsid:marinespecies.org:taxname:126983 urn:lsid:marinespecies.org:taxname:127089
highest: urn:lsid:marinespecies.org:taxname:367528 urn:lsid:marinespecies.org:taxname:396707 urn:lsid:marinespecies.org:taxname:421784 urn:lsid:marinespecies.org:taxname:422069 urn:lsid:marinespecies.org:taxname:443955

For that last line of code we are expecting to see no missing values for scientificNameID. Every row in the file should have a value in scientificNameID which should be a WoRMS LSID that look like this urn:lsid:marinespecies.org:taxname:144531

We need to create a unique ID for each row in the occurrence file. This is known as the occurrenceID and is a required term. The occurrenceID needs to be globally unique and needs to be permanent and kept in place if any updates to the dataset are made. You should not create brand new occurrenceIDs when you update a dataset. To facilitate this I like to build the occurrenceID from pieces of information available in the dataset to create a unique ID for each row in the occurrence file. For this dataset I used the eventID (Station + Date) plus the scientific name. This only works if there is only one scientific name per station per date so if you have different ages or sexes of species at the same station and date this method of creating the occurrenceID won’t work for you.

occurrence$occurrenceID <- paste(occurrence$eventID, gsub(" ", "_",occurrence$scientificName), sep = "_")
occurrence[1,]$occurrenceID
[1] "Station_95_Date_09JAN1997:14:35:00.000_Atractosteus_spatula"

For the occurrence file we only have one column to rename. We could have avoided this step if we had named it organismQuantity up above but I kept this to remind me what the data providers had called this.

occurrence <- occurrence %>%
  rename(organismQuantity = relativeAbundance)

2.1.4 Extended Measurement or Fact extension file

The final file we are going to create is the Extended Measurement or Fact extension (emof). This is a bit like a catch all for any measurements or facts that are not captured in Darwin Core. Darwin Core does not have terms for things like temperature, salinity, gear type, cruise number, length, weight, etc. We are going to create a long format file where each of these is a set of rows in the extended measurement or fact file. You can find all the terms in this extension here https://tools.gbif.org/dwca-validator/extension.do?id=http://rs.iobis.org/obis/terms/ExtendedMeasurementOrFact.

OBIS uses the BODC NERC Vocabulary Server to provide explicit definitions for each of the measurements https://vocab.nerc.ac.uk/search_nvs/.

For this dataset I was only able to find code definitions provided by the data providers for some of the measurements. I included the ones that I was able to find code definitions and left out any that I couldn’t find those for. The ones I was able to find code definitions for were Total.Of.Samples_Count, gear_size, start_wind_speed_num, start_barometric_pressure_num, start_temperature_num, start_salinity_num, start_dissolved_oxygen_num. All the others I left out.

totalOfSamples <- event[c("Total.Of.Samples_Count", "eventID")]
totalOfSamples <- totalOfSamples[which(!is.na(totalOfSamples$Total.Of.Samples_Count)),]
totalOfSamples <- totalOfSamples %>% 
  mutate(measurementType = "Total number of samples used to calculate relative abundance",
         measurementUnit = "",
         measurementTypeID = "",
         measurementUnitID = "",
         occurrenceID = "") %>%
  rename(measurementValue = Total.Of.Samples_Count)

gear_size <- event[c("gear_size", "eventID")]
gear_size <- gear_size[which(!is.na(gear_size$gear_size)),]
gear_size <- gear_size %>% 
  mutate(measurementType = "gear size",
         measurementUnit = "meters",
         measurementTypeID = "http://vocab.nerc.ac.uk/collection/P01/current/MTHAREA1/",
         measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/ULAA/",
         occurrenceID = "") %>%
  rename(measurementValue = gear_size)

start_wind_speed_num <- event[c("start_wind_speed_num", "eventID")]
start_wind_speed_num <- start_wind_speed_num[which(!is.na(start_wind_speed_num$start_wind_speed_num)),]
start_wind_speed_num <- start_wind_speed_num %>% 
  mutate(measurementType = "wind speed",
         measurementUnit = "not provided",
         measurementTypeID = "http://vocab.nerc.ac.uk/collection/P01/current/EWSBZZ01/",
         measurementUnitID = "",
         occurrenceID = "") %>%
  rename(measurementValue = start_wind_speed_num)

start_barometric_pressure_num <- event[c("start_barometric_pressure_num", "eventID")]
start_barometric_pressure_num <- start_barometric_pressure_num[which(!is.na(start_barometric_pressure_num$start_barometric_pressure_num)),]
start_barometric_pressure_num <- start_barometric_pressure_num %>% 
  mutate(measurementType = "barometric pressure",
         measurementUnit = "not provided",
         measurementTypeID = "http://vocab.nerc.ac.uk/collection/P07/current/CFSN0015/",
         measurementUnitID = "",
         occurrenceID = "") %>%
  rename(measurementValue = start_barometric_pressure_num)

start_temperature_num <- event[c("start_temperature_num", "eventID")]
start_temperature_num <- start_temperature_num[which(!is.na(start_temperature_num$start_temperature_num)),]
start_temperature_num <- start_temperature_num %>% 
  mutate(measurementType = "water temperature",
         measurementUnit = "Celsius",
         measurementTypeID = "http://vocab.nerc.ac.uk/collection/P01/current/TEMPPR01/",
         measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/UPAA/",
         occurrenceID = "") %>%
  rename(measurementValue = start_temperature_num)

start_salinity_num <- event[c("start_salinity_num", "eventID")]
start_salinity_num <- start_salinity_num[which(!is.na(start_salinity_num$start_salinity_num)),]
start_salinity_num <- start_salinity_num %>% 
  mutate(measurementType = "salinity",
         measurementUnit = "ppt",
         measurementTypeID = "http://vocab.nerc.ac.uk/collection/P01/current/ODSDM021/",
         measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/UPPT/",
         occurrenceID = "") %>%
  rename(measurementValue = start_salinity_num)

start_dissolved_oxygen_num <- event[c("start_dissolved_oxygen_num", "eventID")]
start_dissolved_oxygen_num <- start_dissolved_oxygen_num[which(!is.na(start_dissolved_oxygen_num$start_dissolved_oxygen_num)),]
start_dissolved_oxygen_num <- start_dissolved_oxygen_num %>% 
  mutate(measurementType = "dissolved oxygen",
         measurementUnit = "ppm",
         measurementTypeID = "http://vocab.nerc.ac.uk/collection/P09/current/DOX2/",
         measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/UPPM/",
         occurrenceID = "") %>%
  rename(measurementValue = start_dissolved_oxygen_num)

alternate_station_code <- event[c("alternate_station_code", "eventID")]
alternate_station_code <- alternate_station_code[which(!is.na(alternate_station_code$alternate_station_code)),]
alternate_station_code <- alternate_station_code %>% 
  mutate(measurementType = "alternate station code",
         measurementUnit = "",
         measurementTypeID = "",
         measurementUnitID = "",
         occurrenceID = "") %>%
  rename(measurementValue = alternate_station_code)

organismQuantity <- occurrence[c("organismQuantity", "eventID", "occurrenceID")]
organismQuantity <- organismQuantity[which(!is.na(organismQuantity$organismQuantity)),]
organismQuantity <- organismQuantity %>% 
  mutate(measurementType = "relative abundance",
         measurementUnit = "",
         measurementTypeID = "http://vocab.nerc.ac.uk/collection/S06/current/S0600020/",
         measurementUnitID = "") %>%
  rename(measurementValue = organismQuantity)

# Bind the separate measurements together into one file  
mof <- rbind(totalOfSamples, start_barometric_pressure_num, start_dissolved_oxygen_num, 
             start_salinity_num, start_temperature_num, start_wind_speed_num, gear_size,
             alternate_station_code, organismQuantity)
head(mof)
 measurementValue                                eventID
1               18 Station_95_Date_09JAN1997:14:35:00.000
2              103 Station_95_Date_18AUG2000:11:02:00.000
3              401 Station_96_Date_28JUN2005:08:41:00.000
4               35 Station_96_Date_23AUG2006:11:47:00.000
5               57 Station_96_Date_17OCT2006:14:23:00.000
6                5 Station_96_Date_19FEB1996:10:27:00.000
                                               measurementType measurementUnit measurementTypeID
1 Total number of samples used to calculate relative abundance                                  
2 Total number of samples used to calculate relative abundance                                  
3 Total number of samples used to calculate relative abundance                                  
4 Total number of samples used to calculate relative abundance                                  
5 Total number of samples used to calculate relative abundance                                  
6 Total number of samples used to calculate relative abundance                                  
  measurementUnitID occurrenceID
1                               
2                               
3                               
4                               
5                               
6                               
tail(mof)
       measurementValue                                 eventID    measurementType measurementUnit
334336        0.0000000 Station_217_Date_03APR2003:13:28:00.000 relative abundance                
334337        0.0000000 Station_217_Date_24FEB2006:10:12:00.000 relative abundance                
334338        0.1428571 Station_217_Date_23JUN2001:12:28:00.000 relative abundance                
334339        0.0000000 Station_212_Date_23MAY1990:10:43:00.000 relative abundance                
334340        0.1224490 Station_212_Date_24JUL1990:09:34:00.000 relative abundance                
334341        0.0000000 Station_212_Date_21MAR2001:11:52:00.000 relative abundance                
                                              measurementTypeID measurementUnitID
334336 http://vocab.nerc.ac.uk/collection/S06/current/S0600020/                  
334337 http://vocab.nerc.ac.uk/collection/S06/current/S0600020/                  
334338 http://vocab.nerc.ac.uk/collection/S06/current/S0600020/                  
334339 http://vocab.nerc.ac.uk/collection/S06/current/S0600020/                  
334340 http://vocab.nerc.ac.uk/collection/S06/current/S0600020/                  
334341 http://vocab.nerc.ac.uk/collection/S06/current/S0600020/                  
                                                        occurrenceID
334336 Station_217_Date_03APR2003:13:28:00.000_Litopenaeus_setiferus
334337 Station_217_Date_24FEB2006:10:12:00.000_Litopenaeus_setiferus
334338 Station_217_Date_23JUN2001:12:28:00.000_Litopenaeus_setiferus
334339 Station_212_Date_23MAY1990:10:43:00.000_Litopenaeus_setiferus
334340 Station_212_Date_24JUL1990:09:34:00.000_Litopenaeus_setiferus
334341 Station_212_Date_21MAR2001:11:52:00.000_Litopenaeus_setiferus

# Write out the file
write.csv(mof, file = (paste0(event[1,]$datasetID, "_mof_", lubridate::today(),".csv")), fileEncoding = "UTF-8", row.names = F, na = "")

2.1.5 Cleaning up Event and Occurrence files

Now that we have all of our files created we can clean up the Event and Occurrence files to remove the columns that are not following Darwin Core. We had to leave the extra bits in before because we needed them to create the emof file above.

event <- event[c("samplingProtocol","locality","waterBody","decimalLatitude","decimalLongitude",
                 "eventDate","sampleSizeValue","minimumDepthInMeters",
                 "maximumDepthInMeters","type","modified","language","license","institutionCode",
                 "ownerInstitutionCode","coordinateUncertaintyInMeters",
                 "geodeticDatum", "georeferenceProtocol","country","countryCode","stateProvince",
                 "datasetID","eventID","sampleSizeUnit","samplingEffort")]
head(event)
  samplingProtocol                locality   waterBody decimalLatitude decimalLongitude
1        Bag Seine Mission-Aransas Estuary Aransas Bay        28.13472        -97.00833
2        Bag Seine Mission-Aransas Estuary Aransas Bay        28.13528        -97.00722
3        Bag Seine Mission-Aransas Estuary Aransas Bay        28.13444        -96.99611
4        Bag Seine Mission-Aransas Estuary Aransas Bay        28.13444        -96.99611
5        Bag Seine Mission-Aransas Estuary Aransas Bay        28.13444        -96.99611
6        Bag Seine Mission-Aransas Estuary Aransas Bay        28.13472        -96.99583
            eventDate sampleSizeValue minimumDepthInMeters maximumDepthInMeters  type   modified language
1 1997-01-09 14:35:00            0.03                  0.0                  0.6 Event 2022-01-09       en
2 2000-08-18 11:02:00            0.03                  0.1                  0.5 Event 2022-01-09       en
3 2005-06-28 08:41:00            0.03                  0.4                  0.6 Event 2022-01-09       en
4 2006-08-23 11:47:00            0.03                  0.2                  0.4 Event 2022-01-09       en
5 2006-10-17 14:23:00            0.03                  0.7                  0.8 Event 2022-01-09       en
6 1996-02-19 10:27:00            0.03                  0.1                  0.3 Event 2022-01-09       en
                                                     license institutionCode ownerInstitutionCode
1 http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD                 HARC
2 http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD                 HARC
3 http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD                 HARC
4 http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD                 HARC
5 http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD                 HARC
6 http://creativecommons.org/publicdomain/zero/1.0/legalcode            TPWD                 HARC
  coordinateUncertaintyInMeters geodeticDatum georeferenceProtocol       country countryCode stateProvince
1                           100         WGS84         Handheld GPS United States          US         Texas
2                           100         WGS84         Handheld GPS United States          US         Texas
3                           100         WGS84         Handheld GPS United States          US         Texas
4                           100         WGS84         Handheld GPS United States          US         Texas
5                           100         WGS84         Handheld GPS United States          US         Texas
6                           100         WGS84         Handheld GPS United States          US         Texas
                              datasetID                                eventID sampleSizeUnit
1 TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_95_Date_09JAN1997:14:35:00.000       hectares
2 TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_95_Date_18AUG2000:11:02:00.000       hectares
3 TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_28JUN2005:08:41:00.000       hectares
4 TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_23AUG2006:11:47:00.000       hectares
5 TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_17OCT2006:14:23:00.000       hectares
6 TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_19FEB1996:10:27:00.000       hectares
  samplingEffort
1    120 seconds
2    120 seconds
3    120 seconds
4    120 seconds
5    120 seconds
6    120 seconds

write.csv(event, file = paste0(event[1,]$datasetID, "_event_", lubridate::today(),".csv"), fileEncoding = "UTF-8", row.names = F, na = "")                    

occurrence <- occurrence[c("vernacularName","eventID","occurrenceStatus","basisOfRecord",
                           "scientificName","scientificNameID","kingdom","phylum","class",
                           "order","family","genus",
                           "scientificNameAuthorship","taxonRank", "organismQuantity",
                           "organismQuantityType", "occurrenceID","collectionCode")]
head(occurrence)
  vernacularName                                eventID occurrenceStatus    basisOfRecord
1  Alligator gar Station_95_Date_09JAN1997:14:35:00.000           Absent HumanObservation
2  Alligator gar Station_95_Date_18AUG2000:11:02:00.000           Absent HumanObservation
3  Alligator gar Station_96_Date_28JUN2005:08:41:00.000           Absent HumanObservation
4  Alligator gar Station_96_Date_23AUG2006:11:47:00.000           Absent HumanObservation
5  Alligator gar Station_96_Date_17OCT2006:14:23:00.000           Absent HumanObservation
6  Alligator gar Station_96_Date_19FEB1996:10:27:00.000           Absent HumanObservation
        scientificName                          scientificNameID  kingdom   phylum       class
1 Atractosteus spatula urn:lsid:marinespecies.org:taxname:279822 Animalia Chordata Actinopteri
2 Atractosteus spatula urn:lsid:marinespecies.org:taxname:279822 Animalia Chordata Actinopteri
3 Atractosteus spatula urn:lsid:marinespecies.org:taxname:279822 Animalia Chordata Actinopteri
4 Atractosteus spatula urn:lsid:marinespecies.org:taxname:279822 Animalia Chordata Actinopteri
5 Atractosteus spatula urn:lsid:marinespecies.org:taxname:279822 Animalia Chordata Actinopteri
6 Atractosteus spatula urn:lsid:marinespecies.org:taxname:279822 Animalia Chordata Actinopteri
             order        family        genus scientificNameAuthorship taxonRank organismQuantity
1 Lepisosteiformes Lepisosteidae Atractosteus         (Lacepède, 1803)   Species                0
2 Lepisosteiformes Lepisosteidae Atractosteus         (Lacepède, 1803)   Species                0
3 Lepisosteiformes Lepisosteidae Atractosteus         (Lacepède, 1803)   Species                0
4 Lepisosteiformes Lepisosteidae Atractosteus         (Lacepède, 1803)   Species                0
5 Lepisosteiformes Lepisosteidae Atractosteus         (Lacepède, 1803)   Species                0
6 Lepisosteiformes Lepisosteidae Atractosteus         (Lacepède, 1803)   Species                0
  organismQuantityType                                                occurrenceID        collectionCode
1   Relative Abundance Station_95_Date_09JAN1997:14:35:00.000_Atractosteus_spatula Aransas Bay Bag Seine
2   Relative Abundance Station_95_Date_18AUG2000:11:02:00.000_Atractosteus_spatula Aransas Bay Bag Seine
3   Relative Abundance Station_96_Date_28JUN2005:08:41:00.000_Atractosteus_spatula Aransas Bay Bag Seine
4   Relative Abundance Station_96_Date_23AUG2006:11:47:00.000_Atractosteus_spatula Aransas Bay Bag Seine
5   Relative Abundance Station_96_Date_17OCT2006:14:23:00.000_Atractosteus_spatula Aransas Bay Bag Seine
6   Relative Abundance Station_96_Date_19FEB1996:10:27:00.000_Atractosteus_spatula Aransas Bay Bag Seine
                           
write.csv(occurrence, file = paste0(event[1,]$datasetID, "_occurrence_",lubridate::today(),".csv"), fileEncoding = "UTF-8", row.names = F, na = "")

2.2 Salmon Ocean Ecology Data

2.2.1 Intro

One of the goals of the Hakai Institute and the Canadian Integrated Ocean Observing System (CIOOS) is to facilitate Open Science and FAIR (findable, accessible, interoperable, reusable) ecological and oceanographic data. In a concerted effort to adopt or establish how best to do that, several Hakai and CIOOS staff attended an International Ocean Observing System (IOOS) Code Sprint in Ann Arbour, Michigan between October 7–11, 2019, to discuss how to implement FAIR data principles for biological data collected in the marine environment.

The Darwin Core is a highly structured data format that standardizes data table relations, vocabularies, and defines field names. The Darwin Core defines three table types: event, occurrence, and measurementOrFact. This intuitively captures the way most ecologists conduct their research. Typically, a survey (event) is conducted and measurements, counts, or observations (collectively measurementOrFacts) are made regarding a specific habitat or species (occurrence).

In the following script I demonstrate how I go about converting a subset of the data collected from the Hakai Institute Juvenile Salmon Program and discuss challenges, solutions, pros and cons, and when and what’s worthwhile to convert to Darwin Core.

The conversion of a dataset to Darwin Core is much easier if your data are already tidy (normalized) in which you represent your data in separate tables that reflect the hierarchical and related nature of your observations. If your data are not already in a consistent and structured format, the conversion would likely be very arduos and not intuitive.

2.2.2 event

The first step is to consider what you will define as an event in your data set. I defined the capture of fish using a purse seine net as the event. Therefore, each row in the event table is one deployment of a seine net and is assigned a unique eventID.

My process for conversion was to make a new table called event and map the standard Darwin Core column names to pre-existing columns that serve the same purpose in my original seine_data table and populate the other required fields.

event <- tibble(eventID = survey_seines$seine_id,
                eventDate = date(survey_seines$survey_date),
                decimalLatitude = survey_seines$lat,
                decimalLongitude = survey_seines$long,
                geodeticDatum = "EPSG:4326 WGS84",
                minimumDepthInMeters = 0,
                maximumDepthInMeters = 9, # seine depth is 9 m
                samplingProtocol = "http://dx.doi.org/10.21966/1.566666" # This is the DOI for the Hakai Salmon Data Package that contains the smnpling protocol, as well as the complete data package
               ) 

write_csv(event, here::here("datasets", "hakai_salmon_data", "event.csv"))

2.2.3 occurrence

Next you’ll want to determine what constitutes an occurrence for your data set. Because each event caputers fish, I consider each fish to be an occurrence. Therefore, the unit of observation (each row) in the occurrence table is a fish. To link each occurence to an event you need to include the eventID column for every occurrence so that you know what seine (event) each fish (occurrence) came from. You must also provide a globally unique identifier for each occurrence. I already have a locally unique identifier for each fish in the original fish_data table called ufn. To make it globally unique I pre-pend the organization and research program metadata to the ufn column.

#TODO: Include bycatch data as well

## make table long first
seines_total_long <- survey_seines %>% 
  select(seine_id, so_total, pi_total, cu_total, co_total, he_total, ck_total) %>% 
  pivot_longer(-seine_id, names_to = "scientificName", values_to = "n")

seines_total_long$scientificName <- recode(seines_total_long$scientificName, so_total = "Oncorhynchus nerka", pi_total = "Oncorhynchus gorbushca", cu_total = "Oncorhynchus keta", co_total = "Oncorhynchus kisutch", ck_total = "Oncorhynchus tshawytscha", he_total = "Clupea pallasii") 

seines_taken_long <- survey_seines %>%
  select(seine_id, so_taken, pi_taken, cu_taken, co_taken, he_taken, ck_taken) %>% 
  pivot_longer(-seine_id, names_to = "scientificName", values_to = "n_taken") 

seines_taken_long$scientificName <- recode(seines_taken_long$scientificName, so_taken = "Oncorhynchus nerka", pi_taken = "Oncorhynchus gorbushca", cu_taken = "Oncorhynchus keta", co_taken = "Oncorhynchus kisutch", ck_taken = "Oncorhynchus tshawytscha", he_taken = "Clupea pallasii") 

## remove records that have already been assigned an ID  
seines_long <-  full_join(seines_total_long, seines_taken_long, by = c("seine_id", "scientificName")) %>% 
  drop_na() %>% 
  mutate(n_not_taken = n - n_taken) %>% #so_total includes the number taken so I subtract n_taken to get n_not_taken
  select(-n_taken, -n) %>% 
  filter(n_not_taken > 0)

all_fish_caught <-
  seines_long[rep(seq.int(1, nrow(seines_long)), seines_long$n_not_taken), 1:3] %>% 
  select(-n_not_taken) %>% 
  mutate(prefix = "hakai-jsp-",
         suffix = 1:nrow(.),
         occurrenceID = paste0(prefix, suffix)
  ) %>% 
  select(-prefix, -suffix)

#

# Change species names to full Scientific names 
latin <- fct_recode(fish_data$species, "Oncorhynchus nerka" = "SO", "Oncorhynchus gorbuscha" = "PI", "Oncorhynchus keta" = "CU", "Oncorhynchus kisutch" = "CO", "Clupea pallasii" = "HE", "Oncorhynchus tshawytscha" = "CK") %>% 
  as.character()

fish_retained_data <- fish_data %>% 
  mutate(scientificName = latin) %>% 
  select(-species) %>% 
  mutate(prefix = "hakai-jsp-",
         occurrenceID = paste0(prefix, ufn)) %>% 
  select(-semsp_id, -prefix, -ufn, -fork_length_field, -fork_length, -weight, -weight_field)

occurrence <- bind_rows(all_fish_caught, fish_retained_data) %>% 
  mutate(basisOfRecord = "HumanObservation",
        occurenceStatus = "present") %>% 
  rename(eventID = seine_id)

For each occuerence of the six different fish species that I caught I need to match the species name that I provide with the official scientificName that is part of the World Register of Marine Species database http://www.marinespecies.org/

# I went directly to the WoRMS webite (http://www.marinespecies.org/) to download the full taxonomic levels for the salmon species I have and put the WoRMS output (species_matched.xls) table in this project directory which is read in below and joined with the occurrence table

species_matched <- readxl::read_excel(here::here("datasets", "hakai_salmon_data", "raw_data", "species_matched.xls"))

occurrence <- left_join(occurrence, species_matched, by = c("scientificName" = "ScientificName")) %>% 
  select(occurrenceID, basisOfRecord, scientificName, eventID, occurrenceStatus = occurenceStatus, Kingdom, Phylum, Class, Order, Family, Genus, Species)

write_csv(occurrence, here::here("datasets", "hakai_salmon_data", "occurrence.csv"))

2.2.4 measurementOrFact

To convert all your measurements or facts from your normal format to Darwin Core you essentially need to put all your measurements into one column called measurementType and a corresponding column called MeasurementValue. This standardizes the column names are in the measurementOrFact table. There are a number of predefined measurementTypes listed on the NERC database that should be used where possible. I found it difficult to navigate this page to find the correct measurementType.

Here I convert length, and weight measurements that relate to an event and an occurrence and call those measurementTypes as length and weight.

fish_data$weight <- coalesce(fish_data$weight, fish_data$weight_field)
fish_data$fork_length <- coalesce(fish_data$fork_length, fish_data$fork_length_field)

fish_length <- fish_data %>%
  mutate(occurrenceID = paste0("hakai-jsp-", ufn)) %>% 
  select(occurrenceID, eventID = seine_id, fork_length, weight) %>% 
  mutate(measurementType = "fork length", measurementValue = fork_length) %>% 
  select(eventID, occurrenceID, measurementType, measurementValue) %>% 
  mutate(measurementUnit = "millimeters",
         measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/UXMM/")

fish_weight <- fish_data %>% 
  mutate(occurrenceID = paste0("hakai-jsp-", ufn)) %>% 
  select(occurrenceID, eventID = seine_id, fork_length, weight) %>% 
  mutate(measurementType = "mass", measurementValue = weight) %>% 
  select(eventID, occurrenceID, measurementType, measurementValue) %>% 
  mutate(measurementUnit = "grams",
         measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/UGRM/")

measurementOrFact <- bind_rows(fish_length, fish_weight) %>% 
  drop_na(measurementValue)

rm(fish_length, fish_weight)

write_csv(measurementOrFact, here::here("datasets", "hakai_salmon_data", "measurementOrFact.csv"))

2.3 Hakai Seagrass

2.3.1 Setup

This section clears the workspace, checks the working directory, and installs packages (if required) and loads packages, and loads necessary datasets

library("knitr")
# Knitr global chunk options
opts_chunk$set(message = FALSE,
               warning = FALSE,
               error   = FALSE)

2.3.1.1 Load Data

First load the seagrass density survey data, set variable classes, and have a quick look

# Load density data
seagrassDensity <- 
  read.csv("raw_data/seagrass_density_survey.csv",
           colClass = "character") %>%
  mutate(date             = ymd(date),
         depth            = as.numeric(depth),
         transect_dist    = factor(transect_dist),
         collected_start  = ymd_hms(collected_start),
         collected_end    = ymd_hms(collected_end),
         density          = as.numeric(density),
         density_msq      = as.numeric(density_msq),
         canopy_height_cm = as.numeric(canopy_height_cm),
         flowering_shoots = as.numeric(flowering_shoots)) %T>%
  glimpse()
## Rows: 3,031
## Columns: 22
## $ X                <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "1…
## $ organization     <chr> "HAKAI", "HAKAI", "HAKAI", "HAKAI", "HAKAI", "HAKAI",…
## $ work_area        <chr> "CALVERT", "CALVERT", "CALVERT", "CALVERT", "CALVERT"…
## $ project          <chr> "MARINEGEO", "MARINEGEO", "MARINEGEO", "MARINEGEO", "…
## $ survey           <chr> "PRUTH_BAY", "PRUTH_BAY", "PRUTH_BAY", "PRUTH_BAY", "…
## $ site_id          <chr> "PRUTH_BAY_INTERIOR4", "PRUTH_BAY_INTERIOR4", "PRUTH_…
## $ date             <date> 2016-05-13, 2016-05-13, 2016-05-13, 2016-05-13, 2016…
## $ sampling_bout    <chr> "4", "4", "4", "4", "4", "4", "4", "6", "6", "6", "6"…
## $ dive_supervisor  <chr> "Zach", "Zach", "Zach", "Zach", "Zach", "Zach", "Zach…
## $ collector        <chr> "Derek", "Derek", "Derek", "Derek", "Derek", "Derek",…
## $ hakai_id         <chr> "2016-05-13_PRUTH_BAY_INTERIOR4_0", "2016-05-13_PRUTH…
## $ sample_type      <chr> "seagrass_density", "seagrass_density", "seagrass_den…
## $ depth            <dbl> 6.0, 6.0, 6.0, 6.0, 5.0, 6.0, 6.0, 9.1, 9.0, 8.9, 9.0…
## $ transect_dist    <fct> 0, 5, 10, 15, 20, 25, 30, 10, 15, 20, 25, 30, 0, 5, 1…
## $ collected_start  <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ collected_end    <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ density          <dbl> 13, 10, 18, 22, 16, 31, 9, 5, 6, 6, 6, 3, 13, 30, 23,…
## $ density_msq      <dbl> 208, 160, 288, 352, 256, 496, 144, 80, 96, 96, 96, 48…
## $ canopy_height_cm <dbl> 60, 63, 80, 54, 55, 50, 63, 85, 80, 90, 95, 75, 60, 6…
## $ flowering_shoots <dbl> NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, NA, NA, NA…
## $ comments         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quality_log      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Next, load the habitat survey data, and same as above, set variable classes as necessary, and have a quick look.

# load habitat data, set variable classes, have a quick look
seagrassHabitat <-
  read.csv("raw_data/seagrass_habitat_survey.csv",
           colClasses = "character") %>%  
  mutate(date            = ymd(date),
         depth           = as.numeric(depth),
         hakai_id        = str_pad(hakai_id, 5, pad = "0"),
         transect_dist   = factor(transect_dist),
         collected_start = ymd_hms(collected_start),
         collected_end   = ymd_hms(collected_end)) %T>%
  glimpse()
## Rows: 2,052
## Columns: 28
## $ X                <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "1…
## $ organization     <chr> "HAKAI", "HAKAI", "HAKAI", "HAKAI", "HAKAI", "HAKAI",…
## $ work_area        <chr> "CALVERT", "CALVERT", "CALVERT", "CALVERT", "CALVERT"…
## $ project          <chr> "MARINEGEO", "MARINEGEO", "MARINEGEO", "MARINEGEO", "…
## $ survey           <chr> "CHOKED_PASS", "CHOKED_PASS", "CHOKED_PASS", "CHOKED_…
## $ site_id          <chr> "CHOKED_PASS_INTERIOR6", "CHOKED_PASS_INTERIOR6", "CH…
## $ date             <date> 2017-11-22, 2017-11-22, 2017-11-22, 2017-11-22, 2017…
## $ sampling_bout    <chr> "6", "6", "6", "6", "6", "6", "1", "1", "1", "1", "1"…
## $ dive_supervisor  <chr> "gillian", "gillian", "gillian", "gillian", "gillian"…
## $ collector        <chr> "zach", "zach", "zach", "zach", "zach", "zach", "kyle…
## $ hakai_id         <chr> "10883", "2017-11-22_CHOKED_PASS_INTERIOR6_5 - 10", "…
## $ sample_type      <chr> "seagrass_habitat", "seagrass_habitat", "seagrass_hab…
## $ depth            <dbl> 9.2, 9.4, 9.3, 9.0, 9.2, 9.2, 3.4, 3.4, 3.4, 3.4, 3.4…
## $ transect_dist    <fct> 0 - 5, 10-May, 15-Oct, 15 - 20, 20 - 25, 25 - 30, 0 -…
## $ collected_start  <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ collected_end    <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bag_uid          <chr> "10883", NA, NA, "11094", NA, "11182", "7119", NA, "7…
## $ bag_number       <chr> "3557", NA, NA, "3520", NA, "903", "800", NA, "318", …
## $ density_range    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ substrate        <chr> "sand,shell hash", "sand,shell hash", "sand,shell has…
## $ patchiness       <chr> "< 1", "< 1", "02-Jan", "< 1", "< 1", "< 1", "< 1", "…
## $ adj_habitat_1    <chr> "seagrass", "seagrass", "seagrass", "seagrass", "seag…
## $ adj_habitat_2    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ sample_collected <chr> "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "TRUE", "T…
## $ vegetation_1     <chr> NA, NA, NA, NA, NA, NA, "des", NA, "des", NA, NA, NA,…
## $ vegetation_2     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ comments         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quality_log      <chr> "1: Flowering shoots 0 for entire transects", NA, NA,…

Finally, load coordinate data for surveys, and subset necessary variables

coordinates <- 
  read.csv("raw_data/seagrassCoordinates.csv",
           colClass = c("Point.Name" = "character")) %>%
  select(Point.Name, Decimal.Lat, Decimal.Long) %T>%
  glimpse()
## Rows: 70
## Columns: 3
## $ Point.Name   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Decimal.Lat  <dbl> 52.06200, 52.05200, 51.92270, 51.92500, 51.80900, 51.8090…
## $ Decimal.Long <dbl> -128.4120, -128.4030, -128.4648, -128.4540, -128.2360, -1…

2.3.1.2 Merge Datasets

Now all the datasets have been loaded, and briefly formatted, we’ll join together the habitat and density surveys, and the coordinates for these.

The seagrass density surveys collect data at discrete points (ie. 5 metres) along the transects, while the habitat surveys collect data over sections (ie. 0 - 5 metres) along the transects. In order to fit these two surveys together, we’ll narrow the habitat surveys from a range to a point so the locations will match. Based on how the habitat data is collected, the point the habitat survey is applied to will be the distance at the end of the swath (ie. 10-15m will become 15m). To account for no preceeding distance, the 0m distance will use the 0-5m section of the survey.

First, well make the necessary transformations to the habitat dataset.

# Reformat seagrassHabitat to merge with seagrassDensity
## replicate 0 - 5m transect dist to match with 0m in density survey;
## rest of habitat bins can map one to one with density (ie. 5 - 10m -> 10m)
seagrass0tmp <- 
  seagrassHabitat %>%
  filter(transect_dist %in% c("0 - 5", "0 - 2.5")) %>%
  mutate(transect_dist = factor(0))

## collapse various levels to match with seagrassDensity transect_dist
seagrassHabitat$transect_dist <- 
  fct_collapse(seagrassHabitat$transect_dist,
               "5" = c("0 - 5", "2.5 - 7.5"),
               "10" = c("5 - 10", "7.5 - 12.5"),
               "15" = c("10 - 15", "12.5 - 17.5"),
               "20" = c("15 - 20", "17.5 - 22.5"),
               "25" = c("20 - 25", "22.5 - 27.5"),
               "30" = c("25 - 30", "27.5 - 30"))

## merge seagrass0tmp into seagrassHabitat to account for 0m samples,
## set class for date, datetime variables
seagrassHabitatFull <- 
  rbind(seagrass0tmp, seagrassHabitat) %>%
  filter(transect_dist != "0 - 2.5")  %>% # already captured in seagrass0tmp 
  droplevels(.)  # remove now unused factor levels

With the distances of habitat and density surveys now corresponding, we can now merge these two datasets plus there coordinates together, combine redundant fields, and remove unnecessary fields.

# Merge seagrassHabitatFull with seagrassDensity, then coordinates
seagrass <- 
  full_join(seagrassHabitatFull, seagrassDensity, 
            by = c("organization",
                   "work_area",
                   "project",
                   "survey",
                   "site_id", 
                   "date",
                   "transect_dist")) %>%
  # merge hakai_id.x and hakai_id.y into single variable field;
  # use combination of date, site_id, transect_dist, and field uid (hakai_id 
  # when present)
  mutate(field_uid = ifelse(sample_collected == TRUE, hakai_id.x, "NA"),
         hakai_id = paste(date, "HAKAI:CALVERT", site_id, transect_dist, sep = ":"),
         # below, aggregate metadata that didn't merge naturally (ie. due to minor 
         # differences in watch time or depth gauges)
         dive_supervisor = dive_supervisor.x,
         collected_start = ymd_hms(ifelse(is.na(collected_start.x),
                                          collected_start.y, 
                                          collected_start.x)),
         collected_end   = ymd_hms(ifelse(is.na(collected_start.x),
                                          collected_start.y,
                                          collected_start.x)),
         depth_m         = ifelse(is.na(depth.x), depth.y, depth.x),
         sampling_bout   = sampling_bout.x) %>%
  left_join(., coordinates,  # add coordinates
            by = c("site_id" = "Point.Name")) %>%
  select( - c(X.x, X.y, hakai_id.x, hakai_id.y,  # remove unnecessary variables
              dive_supervisor.x, dive_supervisor.y,
              collected_start.x, collected_start.y,
              collected_end.x, collected_end.y,
              depth.x, depth.y,
              sampling_bout.x, sampling_bout.y)) %>%
  mutate(density_msq = as.character(density_msq),
         canopy_height_cm = as.character(canopy_height_cm),
         flowering_shoots = as.character(flowering_shoots),
         depth_m = as.character(depth_m)) %T>%
  glimpse()
## Rows: 3,743
## Columns: 38
## $ organization     <chr> "HAKAI", "HAKAI", "HAKAI", "HAKAI", "HAKAI", "HAKAI",…
## $ work_area        <chr> "CALVERT", "CALVERT", "CALVERT", "CALVERT", "CALVERT"…
## $ project          <chr> "MARINEGEO", "MARINEGEO", "MARINEGEO", "MARINEGEO", "…
## $ survey           <chr> "CHOKED_PASS", "CHOKED_PASS", "CHOKED_PASS", "PRUTH_B…
## $ site_id          <chr> "CHOKED_PASS_INTERIOR6", "CHOKED_PASS_EDGE1", "CHOKED…
## $ date             <date> 2017-11-22, 2017-05-19, 2017-05-19, 2017-07-03, 2017…
## $ collector.x      <chr> "zach", "kyle", NA, "tanya", "zach", "zach", "zach", …
## $ sample_type.x    <chr> "seagrass_habitat", "seagrass_habitat", "seagrass_hab…
## $ transect_dist    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bag_uid          <chr> "10883", "7119", "7031", "2352", "10255", "10023", "1…
## $ bag_number       <chr> "3557", "800", "301", "324", "3506", "3555", "3534", …
## $ density_range    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ substrate        <chr> "sand,shell hash", "sand,shell hash", "sand,shell has…
## $ patchiness       <chr> "< 1", "< 1", "< 1", "< 1", "< 1", "05-Apr", "04-Mar"…
## $ adj_habitat_1    <chr> "seagrass", "sand", "standing kelp", "seagrass", "sea…
## $ adj_habitat_2    <chr> NA, NA, NA, NA, NA, NA, "standing kelp", NA, NA, NA, …
## $ sample_collected <chr> "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE…
## $ vegetation_1     <chr> NA, "des", "des", "zm", "des", NA, NA, NA, NA, NA, NA…
## $ vegetation_2     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "…
## $ comments.x       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quality_log.x    <chr> "1: Flowering shoots 0 for entire transects", NA, NA,…
## $ collector.y      <chr> "derek", "ondine", "ondine", "derek", "derek", "derek…
## $ sample_type.y    <chr> "seagrass_density", "seagrass_density", "seagrass_den…
## $ density          <dbl> 4, 10, 6, 13, 6, 1, 2, 6, 21, 3, 7, 4, 3, 14, 17, 11,…
## $ density_msq      <chr> "64", "160", "96", "208", "96", "16", "32", "96", "33…
## $ canopy_height_cm <chr> "80", "80", "110", "60", "125", "100", "100", "125", …
## $ flowering_shoots <chr> "0", NA, NA, NA, NA, NA, NA, "0", NA, NA, NA, "0", NA…
## $ comments.y       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quality_log.y    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "…
## $ field_uid        <chr> "10883", "07119", "07031", "02352", "10255", "10023",…
## $ hakai_id         <chr> "2017-11-22:HAKAI:CALVERT:CHOKED_PASS_INTERIOR6:0", "…
## $ dive_supervisor  <chr> "gillian", "gillian,gillian.sadlierbrown", "gillian,g…
## $ collected_start  <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ collected_end    <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ depth_m          <chr> "9.2", "3.4", "4.8", "2.4", "5.3", "5.6", "4.4", "2.5…
## $ sampling_bout    <chr> "6", "1", "3", "5", "5", "3", "5", "2", "1", "2", "6"…
## $ Decimal.Lat      <dbl> 51.67482, 51.67882, 51.67493, 51.64532, 51.67349, 51.…
## $ Decimal.Long     <dbl> -128.1195, -128.1148, -128.1237, -128.1193, -128.1180…

2.3.2 Convert Data to Darwin Core - Extended Measurement or Fact format

The Darwin Core ExtendedMeasurementOrFact (eMoF) extension bases records around a core event (rather than occurrence as in standard Darwin Core), allowing for additional measurement variables to be associated with occurrence data.

2.3.2.1 Add Event ID and Occurrence ID variables to dataset

As this dataset will be annually updated, rather than using natural keys (ie. using package::uuid to autogenerate) for event and occurence IDs, here we will use surrogate keys made up of a concatenation of date survey, transect location, observation distance, and sample ID (for occurrenceID, when a sample is present).

# create and populate eventID variable
## currently only event is used, but additional surveys and abiotic data
## are associated with parent events that may be included at a later date
seagrass$eventID <- seagrass$hakai_id

# create and populate occurrenceID; combine eventID with transect_dist 
# and field_uid
## in the event of <NA> field_uid, no sample was collected, but
## measurements and occurrence are still taken; no further subsamples
## are associated with <NA> field_uids
seagrass$occurrenceID <- 
  with(seagrass, 
       paste(eventID, transect_dist, field_uid, sep = ":"))

2.3.2.2 Create Event, Occurrence, and eMoF tables

Now that we’ve created eventIDs and occurrenceIDs to connect all the variables together, we can begin to create the Event, Occurrence, and extended Measurement or Fact table necessary for DarwinCore compliant datasets

2.3.2.2.1 Event Table
# subset seagrass to create event table
seagrassEvent <-
  seagrass %>%
  distinct %>%  # some duplicates in data stemming from database conflicts
  select(date,
         Decimal.Lat, Decimal.Long, transect_dist,
         depth_m, eventID) %>%
  rename(eventDate                     = date,
         decimalLatitude               = Decimal.Lat,
         decimalLongitude              = Decimal.Long,
         coordinateUncertaintyInMeters = transect_dist,
         minimumDepthInMeters          = depth_m,
         maximumDepthInMeters          = depth_m) %>%
  mutate(geodeticDatum  = "WGS84",
         samplingEffort = "30 metre transect") %T>% glimpse
## Rows: 3,659
## Columns: 8
## $ eventDate                     <date> 2017-11-22, 2017-05-19, 2017-05-19, 201…
## $ decimalLatitude               <dbl> 51.67482, 51.67882, 51.67493, 51.64532, …
## $ decimalLongitude              <dbl> -128.1195, -128.1148, -128.1237, -128.11…
## $ coordinateUncertaintyInMeters <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ maximumDepthInMeters          <chr> "9.2", "3.4", "4.8", "2.4", "5.3", "5.6"…
## $ eventID                       <chr> "2017-11-22:HAKAI:CALVERT:CHOKED_PASS_IN…
## $ geodeticDatum                 <chr> "WGS84", "WGS84", "WGS84", "WGS84", "WGS…
## $ samplingEffort                <chr> "30 metre transect", "30 metre transect"…
# save event table to csv
write.csv(seagrassEvent, "processed_data/hakaiSeagrassDwcEvent.csv")
2.3.2.2.2 Occurrence Table
# subset seagrass to create occurrence table
seagrassOccurrence <-
  seagrass %>%
  distinct %>%  # some duplicates in data stemming from database conflicts
  select(eventID, occurrenceID) %>%
  mutate(basisOfRecord = "HumanObservation",
         scientificName   = "Zostera subg. Zostera marina",
         occurrenceStatus = "present")

# Taxonomic name matching
# in addition to the above metadata, DarwinCore format requires further
# taxonomic data that can be acquired through the WoRMS register.
## Load taxonomic info, downloaded via WoRMS tool
# zmWorms <- 
#   read.delim("raw_data/zmworms_matched.txt",
#              header = TRUE,
#              nrows  = 1)

zmWorms <- wm_record(id = 145795)

# join WoRMS name with seagrassOccurrence create above
seagrassOccurrence <- 
  full_join(seagrassOccurrence, zmWorms, 
            by = c("scientificName" = "scientificname")) %>%
  select(eventID, occurrenceID, basisOfRecord, scientificName, occurrenceStatus, AphiaID,
         url, authority, status, unacceptreason, taxonRankID, rank,
         valid_AphiaID, valid_name, valid_authority, parentNameUsageID,
         kingdom, phylum, class, order, family, genus, citation, lsid,
         isMarine, match_type, modified) %T>%
  glimpse
## Rows: 3,659
## Columns: 27
## $ eventID           <chr> "2017-11-22:HAKAI:CALVERT:CHOKED_PASS_INTERIOR6:0", …
## $ occurrenceID      <chr> "2017-11-22:HAKAI:CALVERT:CHOKED_PASS_INTERIOR6:0:0:…
## $ basisOfRecord     <chr> "HumanObservation", "HumanObservation", "HumanObserv…
## $ scientificName    <chr> "Zostera subg. Zostera marina", "Zostera subg. Zoste…
## $ occurrenceStatus  <chr> "present", "present", "present", "present", "present…
## $ AphiaID           <int> 145795, 145795, 145795, 145795, 145795, 145795, 1457…
## $ url               <chr> "https://www.marinespecies.org/aphia.php?p=taxdetail…
## $ authority         <chr> "Linnaeus, 1753", "Linnaeus, 1753", "Linnaeus, 1753"…
## $ status            <chr> "accepted", "accepted", "accepted", "accepted", "acc…
## $ unacceptreason    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ taxonRankID       <int> 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 22…
## $ rank              <chr> "Species", "Species", "Species", "Species", "Species…
## $ valid_AphiaID     <int> 145795, 145795, 145795, 145795, 145795, 145795, 1457…
## $ valid_name        <chr> "Zostera subg. Zostera marina", "Zostera subg. Zoste…
## $ valid_authority   <chr> "Linnaeus, 1753", "Linnaeus, 1753", "Linnaeus, 1753"…
## $ parentNameUsageID <int> 370435, 370435, 370435, 370435, 370435, 370435, 3704…
## $ kingdom           <chr> "Plantae", "Plantae", "Plantae", "Plantae", "Plantae…
## $ phylum            <chr> "Tracheophyta", "Tracheophyta", "Tracheophyta", "Tra…
## $ class             <chr> "Magnoliopsida", "Magnoliopsida", "Magnoliopsida", "…
## $ order             <chr> "Alismatales", "Alismatales", "Alismatales", "Alisma…
## $ family            <chr> "Zosteraceae", "Zosteraceae", "Zosteraceae", "Zoster…
## $ genus             <chr> "Zostera", "Zostera", "Zostera", "Zostera", "Zostera…
## $ citation          <chr> "WoRMS (2023). Zostera subg. Zostera marina Linnaeus…
## $ lsid              <chr> "urn:lsid:marinespecies.org:taxname:145795", "urn:ls…
## $ isMarine          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ match_type        <chr> "exact", "exact", "exact", "exact", "exact", "exact"…
## $ modified          <chr> "2008-12-09T10:03:16.140Z", "2008-12-09T10:03:16.140…
# save occurrence table to csv
write.csv(seagrassOccurrence, "processed_data/hakaiSeagrassDwcOccurrence.csv")
2.3.2.2.3 Extended MeasurementOrFact table
seagrassMof <-
  seagrass %>%
  # select variables for eMoF table
  select(date,
         eventID, survey, site_id, transect_dist,
         substrate, patchiness, adj_habitat_1, adj_habitat_2,
         vegetation_1, vegetation_2,
         density_msq, canopy_height_cm, flowering_shoots) %>%
  # split substrate into two variables (currently holds two substrate type in same variable)
  separate(substrate, sep = ",", into = c("substrate_1", "substrate_2")) %>%
  # change variables names to match NERC database (or to be more descriptive where none exist)
  rename(measurementDeterminedDate   = date,
         SubstrateTypeA              = substrate_1,
         SubstrateTypeB              = substrate_2,
         BarePatchLengthWithinSeagrass = patchiness,
         PrimaryAdjacentHabitat      = adj_habitat_1,
         SecondaryAdjacentHabitat    = adj_habitat_2,
         PrimaryAlgaeSp              = vegetation_1,
         SecondaryAlgaeSp            = vegetation_2,
         BedAbund                    = density_msq,
         CanopyHeight                = canopy_height_cm,
         FloweringBedAbund           = flowering_shoots) %>%  
  # reformat variables into DwC MeasurementOrFact format
  # (single values variable, with measurement type, unit, etc. variables)
  pivot_longer( - c(measurementDeterminedDate, eventID, survey, site_id, transect_dist),
                names_to = "measurementType",
                values_to = "measurementValue",
                values_ptypes = list(measurementValue = "character")) %>% 
  # use measurement type to fill in remainder of variables relating to 
  # NERC vocabulary and metadata fields
  mutate(
    measurementTypeID = case_when(
      measurementType == "BedAbund" ~ "http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL02/",
      measurementType == "CanopyHeight" ~ "http://vocab.nerc.ac.uk/collection/P01/current/OBSMAXLX/",
      # measurementType == "BarePatchWithinSeagrass" ~ "",
      measurementType == "FloweringBedAbund" ~ "http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL02/"),
    measurementUnit = case_when(
      measurementType == "BedAbund" ~ "Number per square metre",
      measurementType == "CanopyHeight" ~ "Centimetres",
      measurementType == "BarePatchhLengthWithinSeagrass" ~ "Metres",
      measurementType == "FloweringBedAbund" ~ "Number per square metre"),
    measurementUnitID = case_when(
      measurementType == "BedAbund" ~ "http://vocab.nerc.ac.uk/collection/P06/current/UPMS/",
      measurementType == "CanopyHeight" ~ "http://vocab.nerc.ac.uk/collection/P06/current/ULCM/",
      measurementType == "BarePatchhLengthWithinSeagrass" ~ "http://vocab.nerc.ac.uk/collection/P06/current/ULAA/2/",
      measurementType == "FloweringBedAbund" ~ "http://vocab.nerc.ac.uk/collection/P06/current/UPMS/"),
    measurementAccuracy = case_when(
      measurementType == "CanopyHeight" ~ 5),
    measurementMethod = case_when(
      measurementType == "BedAbund" ~ "25cmx25cm quadrat count",
      measurementType == "CanopyHeight" ~ "in situ with ruler",
      measurementType == "BarePatchhLengthWithinSeagrass" ~ "estimated along transect line",
      measurementType == "FloweringBedAbund" ~ "25cmx25cm quadrat count")) %>%
  select(eventID, measurementDeterminedDate, measurementType, measurementValue,
         measurementTypeID, measurementUnit, measurementUnitID, measurementAccuracy,
         measurementMethod) %T>%
#  select(!c(survey, site_id, transect_dist)) %T>%
  glimpse()
## Rows: 37,430
## Columns: 9
## $ eventID                   <chr> "2017-11-22:HAKAI:CALVERT:CHOKED_PASS_INTERI…
## $ measurementDeterminedDate <date> 2017-11-22, 2017-11-22, 2017-11-22, 2017-11…
## $ measurementType           <chr> "SubstrateTypeA", "SubstrateTypeB", "BarePat…
## $ measurementValue          <chr> "sand", "shell hash", "< 1", "seagrass", NA,…
## $ measurementTypeID         <chr> NA, NA, NA, NA, NA, NA, NA, "http://vocab.ne…
## $ measurementUnit           <chr> NA, NA, NA, NA, NA, NA, NA, "Number per squa…
## $ measurementUnitID         <chr> NA, NA, NA, NA, NA, NA, NA, "http://vocab.ne…
## $ measurementAccuracy       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 5, NA, NA, N…
## $ measurementMethod         <chr> NA, NA, NA, NA, NA, NA, NA, "25cmx25cm quadr…
# save eMoF table to csv
write.csv(seagrassMof, "processed_data/hakaiSeagrassDwcEmof.csv")

2.3.3 Session Info

Print session information below in case necessary for future reference

# Print Session Info for future reference
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] worrms_0.4.3    magrittr_2.0.3  knitr_1.42      here_1.0.1     
##  [5] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.4    
##  [9] purrr_1.0.2     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
## [13] ggplot2_3.4.4   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0  xfun_0.39         bslib_0.4.2       colorspace_2.1-0 
##  [5] vctrs_0.6.5       generics_0.1.3    htmltools_0.5.7   yaml_2.3.7       
##  [9] utf8_1.2.3        rlang_1.1.1       jquerylib_0.1.4   pillar_1.9.0     
## [13] httpcode_0.3.0    glue_1.6.2        withr_2.5.0       bit64_4.0.5      
## [17] readxl_1.4.3      lifecycle_1.0.3   munsell_0.5.0     gtable_0.3.4     
## [21] cellranger_1.1.0  evaluate_0.21     tzdb_0.3.0        fastmap_1.1.1    
## [25] curl_5.2.0        parallel_4.1.1    fansi_1.0.4       triebeard_0.4.1  
## [29] urltools_1.7.3    Rcpp_1.0.11       scales_1.3.0      cachem_1.0.8     
## [33] vroom_1.6.3       jsonlite_1.8.4    bit_4.0.5         hms_1.1.3        
## [37] digest_0.6.31     stringi_1.7.12    bookdown_0.37     grid_4.1.1       
## [41] rprojroot_2.0.4   cli_3.6.1         tools_4.1.1       sass_0.4.6       
## [45] crul_1.4.0        crayon_1.5.2      pkgconfig_2.0.3   timechange_0.2.0 
## [49] rmarkdown_2.21    rstudioapi_0.15.0 R6_2.5.1          compiler_4.1.1

2.4 Trawl Data

One of the more common datasets that can be standardized to Darwin Core and integrated within OBIS is catch data from e.g. a trawl sampling event, or a zooplankton net tow. Of special concern here are datasets that include both a total (species-specific) catch weight, in addition to individual measurements (for a subset of the overall data). In this case, through our standardization to Darwin Core, we want to ensure that data users understand that the individual measurements are a part of, or subset of, the overall (species-specific) record, whilst at the same time ensure that data providers are not duplicating occurrence records to OBIS.

The GitHub issue related to application is can be found here

2.4.1 Workflow Overview

In our current setup, this relationship between the overall catch data and subsetted information is provided in the resourceRelationship extension. This extension cannot currently be harvested by GBIF. The required terms for this extension are resourceID, relatedResourceID, resourceRelationshipID and relationshipOfResource. The relatedResourceID here refers to the object of the relationship, whereas the resourceID refers to the subject of the relationship:

  • resourceRelationshipID: a unique identifier for the relationship between one resource (the subject) and another (relatedResource, object).
  • resourceID: a unique identifier for the resource that is the subject of the relationship.
  • relatedResourceID: a unique identifier for the resource that is the object of the relationship.
  • relationshipOfResource: The relationship of the subject (identified by the resourceID) to the object (relatedResourceID). The relationshipOfResource is a free text field.

A few resources have been published to OBIS that contain the resourceRelationship extension (examples). Here, I’ll lay out the process and coding used for the Trawl Catch and Species Abundance from the 2019 Gulf of Alaska International Year of the Salmon Expedition. In the following code chunks some details are omitted to improve the readability - the overall code to standardize the catch data can be found here. This dataset includes species-specific total catch data at multiple stations (sampling events). From each catch, individual measurements were also taken. Depending on the number of individual caught in the trawl, this was either the total number of species individuals caught, or only a subset (in case of large numbers of individuals caught).

In this specific data record, we created a single Event Core with three extensions: an occurrence extension, measurement or fact extension, and the resourceRelationship extension. However, in this walk-through I’ll only touch on the Event Core, occurrence extension and resourceRelationship extension.

The trawl data is part of a larger project collecting various data types related to salmon ocean ecology. Therefore, in our Event Core we nested information related to the sampling event in the specific layer. (include a visual representation of the schema). Prior to creating the Event Core, we ensured that e.g. dates and times followed the correct ISO-8601 standards, and converted to the correct time zone.

# Time is recorded numerically (1037 instead of 10:37), so need to change these columns:
trawl2019$END_DEPLOYMENT_TIME <- substr(as.POSIXct(sprintf("%04.0f", trawl2019$END_DEPLOYMENT_TIME), format = "%H%M"), 12, 16)
trawl2019$BEGIN_RETRIEVAL_TIME <- substr(as.POSIXct(sprintf("%04.0f", trawl2019$BEGIN_RETRIEVAL_TIME), format = "%H%M"), 12, 16)
# Additionally, the vessel time is recorded in 'Vladivostok' according to the metadata tab. This has to be converted to UTC.  
trawl2019 <- trawl2019 %>%
  mutate(eventDate_start = format_iso_8601(as.POSIXct(paste(EVENT_DATE_START, END_DEPLOYMENT_TIME),
                                                      tz = "Asia/Vladivostok")),
         eventDate_start = str_replace(eventDate_start, "\\+00:00", "Z"),
         eventDate_finish = format_iso_8601(as.POSIXct(paste(EVENT_DATE_FINISH, BEGIN_RETRIEVAL_TIME),
                                                       tz = "Asia/Vladivostok")),
         eventDate_finish = str_replace(eventDate_finish, "\\+00:00", "Z"),
         eventDate = paste(eventDate_start, eventDate_finish, sep = "/"),
         project = "IYS",
         cruise = paste(project, "GoA2019", sep = ":"), 
         station = paste(cruise, TOW_NUMBER, sep=":Stn"),
         trawl = paste(station, "trawl", sep=":"))

Then we created the various layers of our Event Core. We created these layers/data frames from two separate datasets that data are pulled from - one dataset that contains the overall catch data, and one dataset that contains the specimen data:

trawl2019_allCatch <- read_excel(here("Trawl", "2019", "raw_data", 
                                      "2019_GoA_Fish_Trawl_catchdata.xlsx"), sheet = "CATCH_FINAL") %>%
  mutate(project = "IYS",
         cruise = paste(project, "GoA2019", sep = ":"),
         station = paste(cruise, `TOW_NUMBER (number)`, sep = ":Stn"),
         trawl = paste(station, "trawl", sep = ":"))

trawl2019_specimen <- read_excel(here("Trawl", "2019", "raw_data", "2019_GoA_Fish_Specimen_data.xlsx"), 
                                 sheet = "SPECIMEN_FINAL") %>%
  mutate(project = "IYS",
         cruise = paste(project, "GoA2019", sep = ":"),
         station = paste(cruise, TOW_NUMBER, sep = ":Stn"),
         trawl = paste(station, "trawl", sep = ":"),
         sample = paste(trawl, "sample", sep = ":"),
         sample = paste(sample, row_number(), sep = ""))

Next we created the Event Core, ensuring that we connect the data to the right layer (i.e. date and time should be connected to the layer associated with the sampling event). Please note that because we are creating multiple layers and nesting information, and then at a later stage combining different tables, this results in cells being populated with NA. These have to be removed prior to publishing the Event Core through the IPT.

trawl2019_project <- trawl2019 %>%
  select(eventID = project) %>%
  distinct(eventID) %>%
  mutate(type = "project")

trawl2019_cruise <- trawl2019 %>% 
  select(eventID = cruise,
         parentEventID = project) %>% 
  distinct(eventID, .keep_all = TRUE) %>%
  mutate(type = "cruise")

trawl2019_station <- trawl2019 %>% 
  select(eventID = station,
         parentEventID = cruise) %>% 
  distinct(eventID, .keep_all = TRUE) %>%
  mutate(type = "station")

# The coordinates associated to the trawl need to be presented in a LINESTRING.
# END_LONGITUDE_DD needs to be inverted (has to be between -180 and 180, inclusive). 
trawl2019_coordinates <- trawl2019 %>%
  select(eventID = trawl,
         START_LATITUDE_DD,
         longitude,
         END_LATITUDE_DD,
         END_LONGITUDE_DD) %>%
  mutate(END_LONGITUDE_DD = END_LONGITUDE_DD * -1,
         footprintWKT = paste("LINESTRING (", longitude, START_LATITUDE_DD, ",", 
                              END_LONGITUDE_DD, END_LATITUDE_DD, ")")) 
trawl2019_linestring <- obistools::calculate_centroid(trawl2019_coordinates$footprintWKT)
trawl2019_linestring <- cbind(trawl2019_coordinates, trawl2019_linestring) %>%
  select(eventID, footprintWKT, decimalLatitude, decimalLongitude, coordinateUncertaintyInMeters)

trawl2019_trawl <- trawl2019 %>% 
  select(eventID = trawl,
         parentEventID = station,
         eventDate,
         year,
         month,
         day) %>%
  mutate(minimumDepthInMeters = 0, # headrope was at the surface
         maximumDepthInMeters = trawl2019$MOUTH_OPENING_HEIGHT,
         samplingProtocol = "midwater trawl", # when available add DOI to paper here
         locality = case_when(
           trawl2019$EVENT_SUB_TYPE == "Can EEZ" ~ "Canadian EEZ"),
         locationID = case_when(
           trawl2019$EVENT_SUB_TYPE == "Can EEZ" ~ "http://marineregions.org/mrgid/8493")) %>%
  left_join(trawl2019_linestring, by = "eventID") %>% 
  distinct(eventID, .keep_all = TRUE) %>%
    mutate(type = "midwater trawl")
  
trawl2019_sample <- trawl2019_specimen %>%
  select(eventID = sample,
         parentEventID = trawl) %>%
  distinct(eventID, .keep_all = TRUE) %>%
  mutate(type = "individual sample")

trawl2019_event <- bind_rows(trawl2019_project, 
                             trawl2019_cruise,
                             trawl2019_station,
                             trawl2019_trawl,
                             trawl2019_sample) 

# Remove NAs from the Event Core:
trawl2019_event <- sapply(trawl2019_event, as.character)
trawl2019_event[is.na(trawl2019_event)] <- ""
trawl2019_event <- as.data.frame(trawl2019_event)

TO DO: Add visual of e.g. the top 10 rows of the Event Core.

Now that we created the Event Core, we create the occurrence extension. To do this, we create two separate occurrence data tables: one that includes the occurrence data for the total catch, and one data table for the specimen data. Finally, the Occurrence extension is created by combining these two data frames. Personally, I prefer to re-order it so it makes visual sense to me (nest the specimen occurrence records under their respective overall catch data).

trawl2019_allCatch_worms <- worrms::wm_records_names(unique(trawl2019_allCatch$scientificname))
trawl2019_occ <- left_join(trawl2019_allCatch, trawl2019_allCatch_worms, by = "scientificname") %>%
  rename(eventID = trawl,
         specificEpithet = species,
         scientificNameAuthorship = authority,
         taxonomicStatus = status,
         taxonRank = rank,
         scientificName = scientificname,
         scientificNameID = lsid,
         individualCount = `CATCH_COUNT (pieces)(**includes Russian expansion for some species)`,
         occurrenceRemarks = COMMENTS) %>%
  mutate(occurrenceID = paste(eventID, "occ", sep = ":"),
         occurrenceID = paste(occurrenceID, row_number(), sep = ":"),
         occurrenceStatus = "present",
         sex = "")

trawl2019_catch_ind_worms <- worrms::wm_records_names(unique(trawl2019_catch_ind$scientificname)) %>% bind_rows()
trawl2019_catch_ind_occ <- left_join(trawl2019_catch_ind, trawl2019_catch_ind_worms, by = "scientificname") %>%
  rename(scientificNameAuthorship = authority,
         taxonomicStatus = status,
         taxonRank = rank,
         scientificName = scientificname,
         scientificNameID = lsid) %>%
  mutate(occurrenceID = paste(eventID, "occ", sep = ":"),
         occurrenceStatus = "present",
         individualCount = 1)

# Combine the two occurrence data frames:
trawl2019_occ_ext <- dplyr::bind_rows(trawl2019_occ_fnl, trawl2019_catch_ind_fnl)

# To re-order the occurrenceID, use following code:
order <- stringr::str_sort(trawl2019_occ_ext$occurrenceID, numeric=TRUE)
trawl2019_occ_ext <- trawl2019_occ_ext[match(order, trawl2019_occ_ext$occurrenceID),] %>%
  mutate(basisOfRecord = "HumanObservation")

TO DO: Add visual of e.g. the top 10 rows of the Occurrence extension.

Please note that in the overall species-specific occurrence data frame, individualCount was not included. This term should not be used for abundance studies, but to avoid confusion and the appearance that the specimen records are an additional observation on top of the overall catch record, the individualCount term was left blank for the overall catch data.

A resource relationship extension is created to further highlight that the individual samples in the occurrence extension are part of a larger overall catch that was also listed in the occurrence extension. In this extension, we wanted to make sure to highlight that the specimen occurrence records are a subset of the overall catch data through the field relationshipOfResource1. Each of these relationships gets a unique resourceRelationshipID.

trawl_resourceRelationship <- trawl2019_occ_ext %>%
  select(eventID, occurrenceID, scientificName) %>%
  mutate(resourceID = ifelse(grepl("sample", trawl2019_occ_ext$occurrenceID), trawl2019_occ_ext$occurrenceID, NA)) %>%
  mutate(eventID = gsub(":sample.*", "", trawl2019_occ_ext$eventID)) %>%
  group_by(eventID, scientificName) %>%
  filter(n() != 1) %>%
  ungroup()

trawl_resourceRelationship <- trawl_resourceRelationship %>%
  mutate(relatedResourceID = ifelse(grepl("sample", trawl_resourceRelationship$occurrenceID), NA, trawl_resourceRelationship$occurrenceID)) %>%
  mutate(relationshipOfResource = ifelse(!is.na(resourceID), "is a subset of", NA)) %>%
  dplyr::arrange(eventID, scientificName) %>%
  fill(relatedResourceID) %>%
  filter(!is.na(resourceID))

order <- stringr::str_sort(trawl_resourceRelationship$resourceID, numeric = TRUE)
trawl_resourceRelationship <- trawl_resourceRelationship[match(order, trawl_resourceRelationship$resourceID),]

trawl_resourceRelationship <- trawl_resourceRelationship %>%
  mutate(resourceRelationshipID = paste(relatedResourceID, "rr", sep = ":"),
         ID = sprintf("%03d", row_number()),
         resourceRelationshipID = paste(resourceRelationshipID, ID, sep = ":")) %>%
  select(eventID, resourceRelationshipID, resourceID, relationshipOfResource, relatedResourceID)

TO DO: Add visual of e.g. the top 10 rows of the ResourceRelationship extension.

2.4.2 FAQ

Q1. Why not use the terms associatedOccurrence or associatedTaxa? A. There seems to be a movement away from the term associatedOccurrence as the resourceRelationship extension has a much broader use case. Some issues that were raised on GitHub exemplify this, see e.g. here. associatedTaxa is used to provide identifiers or names of taxa and the associations of an Occurrence with them. This term is not apt for establishing relationships between taxa, only between specific Occurrences of an organism with other taxa. As noted on the TDWG website, […] Note that the ResourceRelationship class is an alternative means of representing associations, and with more detail. See also e.g. this issue.

2.5 dataset-edna

By Diana LaScala-Gruenewald

Binder

2.5.1 Introduction

Rationale:

DNA derived data are increasingly being used to document taxon occurrences. To ensure these data are useful to the broadest possible community, GBIF published a guide entitled “Publishing DNA-derived data through biodiversity data platforms.” This guide is supported by the DNA derived data extension for Darwin Core, which incorporates MIxS terms into the Darwin Core standard.

This use case draws on both the guide and the extension to illustrate how to incorporate a DNA derived data extension file into a Darwin Core archive.

For further information on this use case and the DNA Derived data extension in general, see the recording of the OBIS Webinar on Genetic Data.

Project abstract:

The example data employed in this use case are from marine filtered seawater samples collected at a nearshore station in Monterey Bay, California, USA. They were collected by CTD rosette and filtered by a peristaltic pump system. Subsequently, they underwent metabarcoding for the 18S V9 region. The resulting ASVs, their assigned taxonomy, and the metadata associated with their collection are the input data for the conversion scripts presented here.

A selection of samples from this collection were included in the publication “Environmental DNA reveals seasonal shifts and potential interactions in a marine community” which was published with open access in Nature Communications in 2020.

Contacts: - Francisco Chavez - Principle Investigator (chfr@mbari.org) - Kathleen Pitz - Research Associate (kpitz@mbari.org) - Diana LaScala-Gruenewald - Point of Contact (dianalg@mbari.org)

2.5.2 Published data

2.5.3 Repo structure

.
+-- README.md                   :Description of this repository
+-- LICENSE                     :Repository license
+-- .gitignore                  :Files and directories to be ignored by git
+-- environment.yml             :Conda environment configuration file for Binder
|
+-- raw
|   +-- asv_table.csv           :Source data containing ASV sequences and number of reads
|   +-- taxa_table.csv          :Source data containing taxon matches for each ASV
|   +-- metadata_table.csv      :Source data containing metadata about samples (e.g. collection information)
|
+-- src
|   +-- conversion_code.py      :Darwin Core mapping script
|   +-- conversion_code.ipynb   :Darwin Core mapping Jupyter Notebook
|   +-- WoRMS.py                :Functions for querying the World Register of Marine Species
|
+-- processed
|   +-- occurrence.csv          :Occurrence file, generated by conversion_code
|   +-- dna_extension.csv       :DNA Derived Data Extension file, generated by conversion_code

2.6 Passive Acoustic Monitoring

This is the processes by which the SanctSound passive acoustic monitoring data are translated to Darwin Core.

2.6.1 Background

The information contained here describes the process by which the passive acoustic monitoring data, collected by the National Marine Sanctuaries Sanctuary Soundscape Monitoring Project (SanctSound), was mobilized to OBIS-USA for standardization and sharing to the global repositories OBIS and GBIF. Below we set the stage by clearly defining what an occurrence is in this specific case, then we perform the data translation to DarwinCore, including some data QC, for submission to OBIS-USA. Finally, we provide links to the data once they have been mobilized to OBIS-USA, OBIS, and GBIF.

2.6.2 Source datasets

The data used in this process were sourced from the following locations:

data source url
sound production ERDDAP https://coastwatch.pfeg.noaa.gov/erddap/search/index.html?page=1&itemsPerPage=1000&searchFor=sanctsound+%22Sound+Production%22
sound propagation Google Cloud Storage https://console.cloud.google.com/storage/browser/noaa-passive-bioacoustic/sanctsound/products/sound_propagation_models;tab=objects

2.6.3 Definitions

term definition
occurrence Species x made an acoustic sound at y location on z day. We have binned all presence detections to daily occurrences. So, if an animal made multiple sounds during one day, it would be recorded as 1 occurrence.

2.6.4 File descriptions

file description
01_sound_production_to_presence.ipynb Notebook to collect and combine the acoustically present records on Coastwatch ERDDAP. Merges in WoRMS mapping table, and creates an eventDate column. Writes out data/sanctsound_presence.zip.
02_presence_to_occurrence.ipynb Notebook to convert the presence file to an occurrence file. Creates occurrenceID. Reads in data/sanctsound_presence.zip. Writes out data/occurrence.zip.
03_occurrence_coordinateUncertainty.ipynb Notebook that gathers sound propagation modeling data from Google Cloud Storage and adds them to the occurrence table. This also does some initial QA/QC and creates an emof table. Reads in data/occurrence.zip and sound propagation model data. Writes out data/emof.zip and data/occurrence_w_coordinateUncertainty.zip.
04_data_fixes.ipynb This notebook fixes some errors that were found once the data were loaded to GBIF. Reads in data/occurrence_w_coordinateUncertainty.zip. Writes out data/occurrence_w_coordinateUncertainty.zip.
SanctSound_SpeciesLookupTable.csv Species and frequency lookup table for occurrence records.
data/ Intermediary and final data files sent to OBIS-USA.

2.6.6 Key takeaways

  • For passive acoustic monitoring data, it is really important to define a coordinateUncertaintyInMeters. We defined this by using the sound propagation modeling data, matching on frequency, site, and season.
  • The sound propagation data were massive and hard to work with on Google Cloud Storage. But, gsutil made it easy to download all of the necessary files.
  • The metadata available from the ERDDAP source and the netCDF sound propagation data were immensly helpful for mapping to Darwin Core.
  • Extracting information from file names is a relatively easy process and can be beneficial when aligning to Darwin Core.
  • Performing Darwin Core checks before sending data to OBIS-USA helps reduce the amount of iteration with the node manager.
  • You can see the entire conversation for mobilizing these data at https://github.com/ioos/bio_data_guide/issues/147.