Chapter 2 Applications
This chapter contains a series of example applications to convert source data to the Darwin Core standard. You can find
these examples (and more!) in the GitHub repository under the datasets/
directory.
2.1 Aligning Data to Darwin Core - Event Core with Extended Measurement or Fact
Abby Benson
January 9, 2022
2.1.1 General information about this notebook
Script to process the Texas Parks and Wildlife Department (TPWD) Aransas Bay bag seine data from the format used by the Houston Advanced Research Center (HARC) for bays in Texas. Taxonomy was processed using a separate script (TPWD_Taxonomy.R) using a taxa list pulled from the pdf “2009 Resource Monitoring Operations Manual”. All original data, processed data and scripts are stored on an item in USGS ScienceBase.
# Load some of the libraries
library(reshape2)
library(tidyverse)
library(readr)
# Load the data
<- read.csv("https://www.sciencebase.gov/catalog/file/get/53a887f4e4b075096c60cfdd?f=__disk__6e%2F6a%2F67%2F6e6a678c41cf928e025fd30339789cc8b893a815&allowOpen=true", stringsAsFactors=FALSE, strip.white = TRUE) BagSeine
Note that if not already done you’ll need to run the TPWD_Taxonomy.R script to get the taxaList file squared away or load the taxonomy file to the World Register of Marine Species Taxon Match Tool https://www.marinespecies.org/aphia.php?p=match
2.1.2 Event file
To start we will create the Darwin Core Event file. This is the file that will have all the information about the sampling event such as date, location, depth, sampling protocol. Basically anything about the cruise or the way the sampling was done will go in this file. You can see all the Darwin Core terms that are part of the event file here http://tools.gbif.org/dwca-validator/extension.do?id=dwc:Event.
The original format for these TPWD HARC files has all of the information associated as the event in the first approximately 50 columns and then all of the information about the occurrence (species) as columns for each species. We will need to start by limiting to the event information only.
<- BagSeine[,1:47] event
Next there are several pieces of information that need 1) to be added like the geodeticDatum 2) to be pieced together from multiple columns like datasetID or 3) minor changes like the minimum and maximum depth.
<- event %>%
event mutate(type = "Event",
modified = lubridate::today(),
language = "en",
license = "http://creativecommons.org/publicdomain/zero/1.0/legalcode",
institutionCode = "TPWD",
ownerInstitutionCode = "HARC",
coordinateUncertaintyInMeters = "100",
geodeticDatum = "WGS84",
georeferenceProtocol = "Handheld GPS",
country = "United States",
countryCode = "US",
stateProvince = "Texas",
datasetID = gsub(" ", "_", paste("TPWD_HARC_Texas", event$Bay, event$Gear_Type)),
eventID = paste("Station", event$station_code, "Date", event$completion_dttm, sep = "_"),
sampleSizeUnit = "hectares",
CompDate = lubridate::mdy_hms(event$CompDate, tz="America/Chicago"),
StartDate = lubridate::mdy_hms(event$StartDate, tz="America/Chicago"),
minimumDepthInMeters = ifelse(start_shallow_water_depth_num < start_deep_water_depth_num,
start_shallow_water_depth_num, start_deep_water_depth_num),maximumDepthInMeters = ifelse(start_deep_water_depth_num > start_shallow_water_depth_num,
start_deep_water_depth_num, start_shallow_water_depth_num))
head(event[,48:64], n = 10)
type modified language license institutionCode1 Event 2022-01-09 en http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD
2 Event 2022-01-09 en http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD
3 Event 2022-01-09 en http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD
4 Event 2022-01-09 en http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD
5 Event 2022-01-09 en http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD
6 Event 2022-01-09 en http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD
7 Event 2022-01-09 en http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD
8 Event 2022-01-09 en http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD
9 Event 2022-01-09 en http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD
10 Event 2022-01-09 en http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD
ownerInstitutionCode coordinateUncertaintyInMeters geodeticDatum georeferenceProtocol country1 HARC 100 WGS84 Handheld GPS United States
2 HARC 100 WGS84 Handheld GPS United States
3 HARC 100 WGS84 Handheld GPS United States
4 HARC 100 WGS84 Handheld GPS United States
5 HARC 100 WGS84 Handheld GPS United States
6 HARC 100 WGS84 Handheld GPS United States
7 HARC 100 WGS84 Handheld GPS United States
8 HARC 100 WGS84 Handheld GPS United States
9 HARC 100 WGS84 Handheld GPS United States
10 HARC 100 WGS84 Handheld GPS United States
countryCode stateProvince datasetID eventID1 US Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_95_Date_09JAN1997:14:35:00.000
2 US Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_95_Date_18AUG2000:11:02:00.000
3 US Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_28JUN2005:08:41:00.000
4 US Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_23AUG2006:11:47:00.000
5 US Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_17OCT2006:14:23:00.000
6 US Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_19FEB1996:10:27:00.000
7 US Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_11JUN2001:14:12:00.000
8 US Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_16MAR1992:09:46:00.000
9 US Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_25SEP1996:11:28:00.000
10 US Texas TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_08MAY1997:13:20:00.000
sampleSizeUnit minimumDepthInMeters maximumDepthInMeters1 hectares 0.0 0.6
2 hectares 0.1 0.5
3 hectares 0.4 0.6
4 hectares 0.2 0.4
5 hectares 0.7 0.8
6 hectares 0.1 0.3
7 hectares 0.4 0.5
8 hectares 0.0 0.4
9 hectares 0.3 0.7
10 hectares 0.4 0.6
For this dataset there was a start timestamp and end timestamp that we can use to identify the sampling effort which can be really valuable information for downstream users when trying to reuse data from multiple projects.
## Calculate duration of bag seine event
$samplingEffort <- ""
eventfor (i in 1:nrow(event)){
$samplingEffort <- abs(lubridate::as.duration(event[i,]$CompDate - event[i,]$StartDate))
event[i,]
}$samplingEffort <- paste(event$samplingEffort, "seconds", sep = " ") event
Finally there were a few columns that were a direct match to a Darwin Core term and therefore just need to be renamed to follow the standard.
<- event %>%
event rename(samplingProtocol = Gear_Type,
locality = Estuary,
waterBody = SubBay,
decimalLatitude = Latitude,
decimalLongitude = Longitude,
sampleSizeValue = surface_area_num,
eventDate = CompDate)
2.1.3 Occurrence file
The next file we need to create is the Occurrence file. This file includes all the information about the species that were observed. An occurrence in Darwin Core is the intersection of an organism at a time and a place. We have already done the work to identify the time and place in the event file so we don’t need to do that again here. What we do need to is identify all the information about the organisms. Another piece of information that goes in here is basisOfRecord which is a required field and has a controlled vocabulary. For the data we work with you’ll usually put HumanObservation
or MachineObservation
. If it’s eDNA data you’ll use MaterialSample
. If your data are part of a museum collection you’ll use PreservedSpecimen
.
Important to note that there is overlap in the Darwin Core terms that “allowed” to be in the event file and in the occurrence file. This is because data can be submitted as “Occurrence Only” where you don’t have a separate event file. In that case, the location and date information will need to be included in the occurrence file. Since we are formatting this dataset as a sampling event we will not include location and date information in the occurrence file. To see all the Darwin Core terms that can go in the occurrence file go here https://tools.gbif.org/dwca-validator/extension.do?id=dwc:occurrence.
This dataset in its original format is in “wide format”. All that means is that data that we would expect to be encoded as values in the rows are instead column headers. We have to pull all the scientific names out of the column headers and turn them into actual values in the data.
<- melt(BagSeine, id=1:47, measure=48:109, variable.name="vernacularName", value.name="relativeAbundance") occurrence
You’ll notice when we did that step we went from 5481 obs (or rows) in the data to 334341 obs. We went from wide to long.
dim(BagSeine)
1] 5481 109
[dim(occurrence)
1] 334341 49 [
Now as with the event file we have several pieces of information that need to be added or changed to make sure the data are following Darwin Core. We always want to include as much information as possible to make the data as reusable as possible.
<- occurrence %>%
occurrence mutate(vernacularName = gsub("\\.",' ', vernacularName),
eventID = paste("Station", station_code, "Date", completion_dttm, sep = "_"),
occurrenceStatus = ifelse(relativeAbundance == 0, "Absent", "Present"),
basisOfRecord = "HumanObservation",
organismQuantityType = "Relative Abundance",
collectionCode = paste(Bay, Gear_Type, sep = " "))
We will match the taxa list with our occurrence file data to bring in the taxonomic information that we pulled from WoRMS. To save time you’ll just import the processed taxa list which includes the taxonomic hierarchy and the required term scientificNameID which is one of the most important pieces of information to include for OBIS.
<- read.csv("https://www.sciencebase.gov/catalog/file/get/53a887f4e4b075096c60cfdd?f=__disk__49%2F0a%2F73%2F490a7337fa94039715809496b22f5d003b8a79a2&allowOpen=true", stringsAsFactors = FALSE)
taxaList ## Merge taxaList with occurrence
<- merge(occurrence, taxaList, by = "vernacularName", all.x = T)
occurrence ## Test that all the vernacularNames found a match in taxaList_updated
::describe(occurrence$scientificNameID)
Hmisc
n missing distinct 334341 0 61
: urn:lsid:marinespecies.org:taxname:105792 urn:lsid:marinespecies.org:taxname:107034 urn:lsid:marinespecies.org:taxname:107379 urn:lsid:marinespecies.org:taxname:126983 urn:lsid:marinespecies.org:taxname:127089
lowest : urn:lsid:marinespecies.org:taxname:367528 urn:lsid:marinespecies.org:taxname:396707 urn:lsid:marinespecies.org:taxname:421784 urn:lsid:marinespecies.org:taxname:422069 urn:lsid:marinespecies.org:taxname:443955 highest
For that last line of code we are expecting to see no missing values for scientificNameID. Every row in the file should have a value in scientificNameID which should be a WoRMS LSID that look like this urn:lsid:marinespecies.org:taxname:144531
We need to create a unique ID for each row in the occurrence file. This is known as the occurrenceID
and is a required term. The occurrenceID
needs to be globally unique and needs to be permanent and kept in place if any updates to the dataset are made. You should not create brand new occurrenceIDs when you update a dataset. To facilitate this I like to build the occurrenceID
from pieces of information available in the dataset to create a unique ID for each row in the occurrence file. For this dataset I used the eventID
(Station + Date) plus the scientific name. This only works if there is only one scientific name per station per date so if you have different ages or sexes of species at the same station and date this method of creating the occurrenceID won’t work for you.
$occurrenceID <- paste(occurrence$eventID, gsub(" ", "_",occurrence$scientificName), sep = "_")
occurrence1,]$occurrenceID
occurrence[1] "Station_95_Date_09JAN1997:14:35:00.000_Atractosteus_spatula" [
For the occurrence file we only have one column to rename. We could have avoided this step if we had named it organismQuantity
up above but I kept this to remind me what the data providers had called this.
<- occurrence %>%
occurrence rename(organismQuantity = relativeAbundance)
2.1.4 Extended Measurement or Fact extension file
The final file we are going to create is the Extended Measurement or Fact extension (emof). This is a bit like a catch all for any measurements or facts that are not captured in Darwin Core. Darwin Core does not have terms for things like temperature, salinity, gear type, cruise number, length, weight, etc. We are going to create a long format file where each of these is a set of rows in the extended measurement or fact file. You can find all the terms in this extension here https://tools.gbif.org/dwca-validator/extension.do?id=http://rs.iobis.org/obis/terms/ExtendedMeasurementOrFact.
OBIS uses the BODC NERC Vocabulary Server to provide explicit definitions for each of the measurements https://vocab.nerc.ac.uk/search_nvs/.
For this dataset I was only able to find code definitions provided by the data providers for some of the measurements. I included the ones that I was able to find code definitions and left out any that I couldn’t find those for. The ones I was able to find code definitions for were Total.Of.Samples_Count
, gear_size
, start_wind_speed_num
, start_barometric_pressure_num
, start_temperature_num
, start_salinity_num
, start_dissolved_oxygen_num
. All the others I left out.
<- event[c("Total.Of.Samples_Count", "eventID")]
totalOfSamples <- totalOfSamples[which(!is.na(totalOfSamples$Total.Of.Samples_Count)),]
totalOfSamples <- totalOfSamples %>%
totalOfSamples mutate(measurementType = "Total number of samples used to calculate relative abundance",
measurementUnit = "",
measurementTypeID = "",
measurementUnitID = "",
occurrenceID = "") %>%
rename(measurementValue = Total.Of.Samples_Count)
<- event[c("gear_size", "eventID")]
gear_size <- gear_size[which(!is.na(gear_size$gear_size)),]
gear_size <- gear_size %>%
gear_size mutate(measurementType = "gear size",
measurementUnit = "meters",
measurementTypeID = "http://vocab.nerc.ac.uk/collection/P01/current/MTHAREA1/",
measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/ULAA/",
occurrenceID = "") %>%
rename(measurementValue = gear_size)
<- event[c("start_wind_speed_num", "eventID")]
start_wind_speed_num <- start_wind_speed_num[which(!is.na(start_wind_speed_num$start_wind_speed_num)),]
start_wind_speed_num <- start_wind_speed_num %>%
start_wind_speed_num mutate(measurementType = "wind speed",
measurementUnit = "not provided",
measurementTypeID = "http://vocab.nerc.ac.uk/collection/P01/current/EWSBZZ01/",
measurementUnitID = "",
occurrenceID = "") %>%
rename(measurementValue = start_wind_speed_num)
<- event[c("start_barometric_pressure_num", "eventID")]
start_barometric_pressure_num <- start_barometric_pressure_num[which(!is.na(start_barometric_pressure_num$start_barometric_pressure_num)),]
start_barometric_pressure_num <- start_barometric_pressure_num %>%
start_barometric_pressure_num mutate(measurementType = "barometric pressure",
measurementUnit = "not provided",
measurementTypeID = "http://vocab.nerc.ac.uk/collection/P07/current/CFSN0015/",
measurementUnitID = "",
occurrenceID = "") %>%
rename(measurementValue = start_barometric_pressure_num)
<- event[c("start_temperature_num", "eventID")]
start_temperature_num <- start_temperature_num[which(!is.na(start_temperature_num$start_temperature_num)),]
start_temperature_num <- start_temperature_num %>%
start_temperature_num mutate(measurementType = "water temperature",
measurementUnit = "Celsius",
measurementTypeID = "http://vocab.nerc.ac.uk/collection/P01/current/TEMPPR01/",
measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/UPAA/",
occurrenceID = "") %>%
rename(measurementValue = start_temperature_num)
<- event[c("start_salinity_num", "eventID")]
start_salinity_num <- start_salinity_num[which(!is.na(start_salinity_num$start_salinity_num)),]
start_salinity_num <- start_salinity_num %>%
start_salinity_num mutate(measurementType = "salinity",
measurementUnit = "ppt",
measurementTypeID = "http://vocab.nerc.ac.uk/collection/P01/current/ODSDM021/",
measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/UPPT/",
occurrenceID = "") %>%
rename(measurementValue = start_salinity_num)
<- event[c("start_dissolved_oxygen_num", "eventID")]
start_dissolved_oxygen_num <- start_dissolved_oxygen_num[which(!is.na(start_dissolved_oxygen_num$start_dissolved_oxygen_num)),]
start_dissolved_oxygen_num <- start_dissolved_oxygen_num %>%
start_dissolved_oxygen_num mutate(measurementType = "dissolved oxygen",
measurementUnit = "ppm",
measurementTypeID = "http://vocab.nerc.ac.uk/collection/P09/current/DOX2/",
measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/UPPM/",
occurrenceID = "") %>%
rename(measurementValue = start_dissolved_oxygen_num)
<- event[c("alternate_station_code", "eventID")]
alternate_station_code <- alternate_station_code[which(!is.na(alternate_station_code$alternate_station_code)),]
alternate_station_code <- alternate_station_code %>%
alternate_station_code mutate(measurementType = "alternate station code",
measurementUnit = "",
measurementTypeID = "",
measurementUnitID = "",
occurrenceID = "") %>%
rename(measurementValue = alternate_station_code)
<- occurrence[c("organismQuantity", "eventID", "occurrenceID")]
organismQuantity <- organismQuantity[which(!is.na(organismQuantity$organismQuantity)),]
organismQuantity <- organismQuantity %>%
organismQuantity mutate(measurementType = "relative abundance",
measurementUnit = "",
measurementTypeID = "http://vocab.nerc.ac.uk/collection/S06/current/S0600020/",
measurementUnitID = "") %>%
rename(measurementValue = organismQuantity)
# Bind the separate measurements together into one file
<- rbind(totalOfSamples, start_barometric_pressure_num, start_dissolved_oxygen_num,
mof
start_salinity_num, start_temperature_num, start_wind_speed_num, gear_size,
alternate_station_code, organismQuantity)head(mof)
measurementValue eventID1 18 Station_95_Date_09JAN1997:14:35:00.000
2 103 Station_95_Date_18AUG2000:11:02:00.000
3 401 Station_96_Date_28JUN2005:08:41:00.000
4 35 Station_96_Date_23AUG2006:11:47:00.000
5 57 Station_96_Date_17OCT2006:14:23:00.000
6 5 Station_96_Date_19FEB1996:10:27:00.000
measurementType measurementUnit measurementTypeID1 Total number of samples used to calculate relative abundance
2 Total number of samples used to calculate relative abundance
3 Total number of samples used to calculate relative abundance
4 Total number of samples used to calculate relative abundance
5 Total number of samples used to calculate relative abundance
6 Total number of samples used to calculate relative abundance
measurementUnitID occurrenceID1
2
3
4
5
6
tail(mof)
measurementValue eventID measurementType measurementUnit334336 0.0000000 Station_217_Date_03APR2003:13:28:00.000 relative abundance
334337 0.0000000 Station_217_Date_24FEB2006:10:12:00.000 relative abundance
334338 0.1428571 Station_217_Date_23JUN2001:12:28:00.000 relative abundance
334339 0.0000000 Station_212_Date_23MAY1990:10:43:00.000 relative abundance
334340 0.1224490 Station_212_Date_24JUL1990:09:34:00.000 relative abundance
334341 0.0000000 Station_212_Date_21MAR2001:11:52:00.000 relative abundance
measurementTypeID measurementUnitID334336 http://vocab.nerc.ac.uk/collection/S06/current/S0600020/
334337 http://vocab.nerc.ac.uk/collection/S06/current/S0600020/
334338 http://vocab.nerc.ac.uk/collection/S06/current/S0600020/
334339 http://vocab.nerc.ac.uk/collection/S06/current/S0600020/
334340 http://vocab.nerc.ac.uk/collection/S06/current/S0600020/
334341 http://vocab.nerc.ac.uk/collection/S06/current/S0600020/
occurrenceID334336 Station_217_Date_03APR2003:13:28:00.000_Litopenaeus_setiferus
334337 Station_217_Date_24FEB2006:10:12:00.000_Litopenaeus_setiferus
334338 Station_217_Date_23JUN2001:12:28:00.000_Litopenaeus_setiferus
334339 Station_212_Date_23MAY1990:10:43:00.000_Litopenaeus_setiferus
334340 Station_212_Date_24JUL1990:09:34:00.000_Litopenaeus_setiferus
334341 Station_212_Date_21MAR2001:11:52:00.000_Litopenaeus_setiferus
# Write out the file
write.csv(mof, file = (paste0(event[1,]$datasetID, "_mof_", lubridate::today(),".csv")), fileEncoding = "UTF-8", row.names = F, na = "")
2.1.5 Cleaning up Event and Occurrence files
Now that we have all of our files created we can clean up the Event and Occurrence files to remove the columns that are not following Darwin Core. We had to leave the extra bits in before because we needed them to create the emof file above.
<- event[c("samplingProtocol","locality","waterBody","decimalLatitude","decimalLongitude",
event "eventDate","sampleSizeValue","minimumDepthInMeters",
"maximumDepthInMeters","type","modified","language","license","institutionCode",
"ownerInstitutionCode","coordinateUncertaintyInMeters",
"geodeticDatum", "georeferenceProtocol","country","countryCode","stateProvince",
"datasetID","eventID","sampleSizeUnit","samplingEffort")]
head(event)
samplingProtocol locality waterBody decimalLatitude decimalLongitude1 Bag Seine Mission-Aransas Estuary Aransas Bay 28.13472 -97.00833
2 Bag Seine Mission-Aransas Estuary Aransas Bay 28.13528 -97.00722
3 Bag Seine Mission-Aransas Estuary Aransas Bay 28.13444 -96.99611
4 Bag Seine Mission-Aransas Estuary Aransas Bay 28.13444 -96.99611
5 Bag Seine Mission-Aransas Estuary Aransas Bay 28.13444 -96.99611
6 Bag Seine Mission-Aransas Estuary Aransas Bay 28.13472 -96.99583
eventDate sampleSizeValue minimumDepthInMeters maximumDepthInMeters type modified language1 1997-01-09 14:35:00 0.03 0.0 0.6 Event 2022-01-09 en
2 2000-08-18 11:02:00 0.03 0.1 0.5 Event 2022-01-09 en
3 2005-06-28 08:41:00 0.03 0.4 0.6 Event 2022-01-09 en
4 2006-08-23 11:47:00 0.03 0.2 0.4 Event 2022-01-09 en
5 2006-10-17 14:23:00 0.03 0.7 0.8 Event 2022-01-09 en
6 1996-02-19 10:27:00 0.03 0.1 0.3 Event 2022-01-09 en
license institutionCode ownerInstitutionCode1 http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD HARC
2 http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD HARC
3 http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD HARC
4 http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD HARC
5 http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD HARC
6 http://creativecommons.org/publicdomain/zero/1.0/legalcode TPWD HARC
coordinateUncertaintyInMeters geodeticDatum georeferenceProtocol country countryCode stateProvince1 100 WGS84 Handheld GPS United States US Texas
2 100 WGS84 Handheld GPS United States US Texas
3 100 WGS84 Handheld GPS United States US Texas
4 100 WGS84 Handheld GPS United States US Texas
5 100 WGS84 Handheld GPS United States US Texas
6 100 WGS84 Handheld GPS United States US Texas
datasetID eventID sampleSizeUnit1 TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_95_Date_09JAN1997:14:35:00.000 hectares
2 TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_95_Date_18AUG2000:11:02:00.000 hectares
3 TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_28JUN2005:08:41:00.000 hectares
4 TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_23AUG2006:11:47:00.000 hectares
5 TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_17OCT2006:14:23:00.000 hectares
6 TPWD_HARC_Texas_Aransas_Bay_Bag_Seine Station_96_Date_19FEB1996:10:27:00.000 hectares
samplingEffort1 120 seconds
2 120 seconds
3 120 seconds
4 120 seconds
5 120 seconds
6 120 seconds
write.csv(event, file = paste0(event[1,]$datasetID, "_event_", lubridate::today(),".csv"), fileEncoding = "UTF-8", row.names = F, na = "")
<- occurrence[c("vernacularName","eventID","occurrenceStatus","basisOfRecord",
occurrence "scientificName","scientificNameID","kingdom","phylum","class",
"order","family","genus",
"scientificNameAuthorship","taxonRank", "organismQuantity",
"organismQuantityType", "occurrenceID","collectionCode")]
head(occurrence)
vernacularName eventID occurrenceStatus basisOfRecord1 Alligator gar Station_95_Date_09JAN1997:14:35:00.000 Absent HumanObservation
2 Alligator gar Station_95_Date_18AUG2000:11:02:00.000 Absent HumanObservation
3 Alligator gar Station_96_Date_28JUN2005:08:41:00.000 Absent HumanObservation
4 Alligator gar Station_96_Date_23AUG2006:11:47:00.000 Absent HumanObservation
5 Alligator gar Station_96_Date_17OCT2006:14:23:00.000 Absent HumanObservation
6 Alligator gar Station_96_Date_19FEB1996:10:27:00.000 Absent HumanObservation
scientificName scientificNameID kingdom phylum class1 Atractosteus spatula urn:lsid:marinespecies.org:taxname:279822 Animalia Chordata Actinopteri
2 Atractosteus spatula urn:lsid:marinespecies.org:taxname:279822 Animalia Chordata Actinopteri
3 Atractosteus spatula urn:lsid:marinespecies.org:taxname:279822 Animalia Chordata Actinopteri
4 Atractosteus spatula urn:lsid:marinespecies.org:taxname:279822 Animalia Chordata Actinopteri
5 Atractosteus spatula urn:lsid:marinespecies.org:taxname:279822 Animalia Chordata Actinopteri
6 Atractosteus spatula urn:lsid:marinespecies.org:taxname:279822 Animalia Chordata Actinopteri
order family genus scientificNameAuthorship taxonRank organismQuantity1 Lepisosteiformes Lepisosteidae Atractosteus (Lacepède, 1803) Species 0
2 Lepisosteiformes Lepisosteidae Atractosteus (Lacepède, 1803) Species 0
3 Lepisosteiformes Lepisosteidae Atractosteus (Lacepède, 1803) Species 0
4 Lepisosteiformes Lepisosteidae Atractosteus (Lacepède, 1803) Species 0
5 Lepisosteiformes Lepisosteidae Atractosteus (Lacepède, 1803) Species 0
6 Lepisosteiformes Lepisosteidae Atractosteus (Lacepède, 1803) Species 0
organismQuantityType occurrenceID collectionCode1 Relative Abundance Station_95_Date_09JAN1997:14:35:00.000_Atractosteus_spatula Aransas Bay Bag Seine
2 Relative Abundance Station_95_Date_18AUG2000:11:02:00.000_Atractosteus_spatula Aransas Bay Bag Seine
3 Relative Abundance Station_96_Date_28JUN2005:08:41:00.000_Atractosteus_spatula Aransas Bay Bag Seine
4 Relative Abundance Station_96_Date_23AUG2006:11:47:00.000_Atractosteus_spatula Aransas Bay Bag Seine
5 Relative Abundance Station_96_Date_17OCT2006:14:23:00.000_Atractosteus_spatula Aransas Bay Bag Seine
6 Relative Abundance Station_96_Date_19FEB1996:10:27:00.000_Atractosteus_spatula Aransas Bay Bag Seine
write.csv(occurrence, file = paste0(event[1,]$datasetID, "_occurrence_",lubridate::today(),".csv"), fileEncoding = "UTF-8", row.names = F, na = "")
2.2 Salmon Ocean Ecology Data
2.2.1 Intro
One of the goals of the Hakai Institute and the Canadian Integrated Ocean Observing System (CIOOS) is to facilitate Open Science and FAIR (findable, accessible, interoperable, reusable) ecological and oceanographic data. In a concerted effort to adopt or establish how best to do that, several Hakai and CIOOS staff attended an International Ocean Observing System (IOOS) Code Sprint in Ann Arbour, Michigan between October 7–11, 2019, to discuss how to implement FAIR data principles for biological data collected in the marine environment.
The Darwin Core is a highly structured data format that standardizes data table relations, vocabularies, and defines field names. The Darwin Core defines three table types: event
, occurrence
, and measurementOrFact
. This intuitively captures the way most ecologists conduct their research. Typically, a survey (event) is conducted and measurements, counts, or observations (collectively measurementOrFacts) are made regarding a specific habitat or species (occurrence).
In the following script I demonstrate how I go about converting a subset of the data collected from the Hakai Institute Juvenile Salmon Program and discuss challenges, solutions, pros and cons, and when and what’s worthwhile to convert to Darwin Core.
The conversion of a dataset to Darwin Core is much easier if your data are already tidy (normalized) in which you represent your data in separate tables that reflect the hierarchical and related nature of your observations. If your data are not already in a consistent and structured format, the conversion would likely be very arduos and not intuitive.
2.2.2 event
The first step is to consider what you will define as an event in your data set. I defined the capture of fish using a purse seine net as the event
. Therefore, each row in the event
table is one deployment of a seine net and is assigned a unique eventID
.
My process for conversion was to make a new table called event
and map the standard Darwin Core column names to pre-existing columns that serve the same purpose in my original seine_data
table and populate the other required fields.
<- tibble(eventID = survey_seines$seine_id,
event eventDate = date(survey_seines$survey_date),
decimalLatitude = survey_seines$lat,
decimalLongitude = survey_seines$long,
geodeticDatum = "EPSG:4326 WGS84",
minimumDepthInMeters = 0,
maximumDepthInMeters = 9, # seine depth is 9 m
samplingProtocol = "http://dx.doi.org/10.21966/1.566666" # This is the DOI for the Hakai Salmon Data Package that contains the smnpling protocol, as well as the complete data package
)
write_csv(event, here::here("datasets", "hakai_salmon_data", "event.csv"))
2.2.3 occurrence
Next you’ll want to determine what constitutes an occurrence for your data set. Because each event caputers fish, I consider each fish to be an occurrence. Therefore, the unit of observation (each row) in the occurrence table is a fish. To link each occurence to an event you need to include the eventID
column for every occurrence so that you know what seine (event) each fish (occurrence) came from. You must also provide a globally unique identifier for each occurrence. I already have a locally unique identifier for each fish in the original fish_data
table called ufn
. To make it globally unique I pre-pend the organization and research program metadata to the ufn
column.
#TODO: Include bycatch data as well
## make table long first
<- survey_seines %>%
seines_total_long select(seine_id, so_total, pi_total, cu_total, co_total, he_total, ck_total) %>%
pivot_longer(-seine_id, names_to = "scientificName", values_to = "n")
$scientificName <- recode(seines_total_long$scientificName, so_total = "Oncorhynchus nerka", pi_total = "Oncorhynchus gorbushca", cu_total = "Oncorhynchus keta", co_total = "Oncorhynchus kisutch", ck_total = "Oncorhynchus tshawytscha", he_total = "Clupea pallasii")
seines_total_long
<- survey_seines %>%
seines_taken_long select(seine_id, so_taken, pi_taken, cu_taken, co_taken, he_taken, ck_taken) %>%
pivot_longer(-seine_id, names_to = "scientificName", values_to = "n_taken")
$scientificName <- recode(seines_taken_long$scientificName, so_taken = "Oncorhynchus nerka", pi_taken = "Oncorhynchus gorbushca", cu_taken = "Oncorhynchus keta", co_taken = "Oncorhynchus kisutch", ck_taken = "Oncorhynchus tshawytscha", he_taken = "Clupea pallasii")
seines_taken_long
## remove records that have already been assigned an ID
<- full_join(seines_total_long, seines_taken_long, by = c("seine_id", "scientificName")) %>%
seines_long drop_na() %>%
mutate(n_not_taken = n - n_taken) %>% #so_total includes the number taken so I subtract n_taken to get n_not_taken
select(-n_taken, -n) %>%
filter(n_not_taken > 0)
<-
all_fish_caught rep(seq.int(1, nrow(seines_long)), seines_long$n_not_taken), 1:3] %>%
seines_long[select(-n_not_taken) %>%
mutate(prefix = "hakai-jsp-",
suffix = 1:nrow(.),
occurrenceID = paste0(prefix, suffix)
%>%
) select(-prefix, -suffix)
#
# Change species names to full Scientific names
<- fct_recode(fish_data$species, "Oncorhynchus nerka" = "SO", "Oncorhynchus gorbuscha" = "PI", "Oncorhynchus keta" = "CU", "Oncorhynchus kisutch" = "CO", "Clupea pallasii" = "HE", "Oncorhynchus tshawytscha" = "CK") %>%
latin as.character()
<- fish_data %>%
fish_retained_data mutate(scientificName = latin) %>%
select(-species) %>%
mutate(prefix = "hakai-jsp-",
occurrenceID = paste0(prefix, ufn)) %>%
select(-semsp_id, -prefix, -ufn, -fork_length_field, -fork_length, -weight, -weight_field)
<- bind_rows(all_fish_caught, fish_retained_data) %>%
occurrence mutate(basisOfRecord = "HumanObservation",
occurenceStatus = "present") %>%
rename(eventID = seine_id)
For each occuerence of the six different fish species that I caught I need to match the species name that I provide with the official scientificName
that is part of the World Register of Marine Species database http://www.marinespecies.org/
# I went directly to the WoRMS webite (http://www.marinespecies.org/) to download the full taxonomic levels for the salmon species I have and put the WoRMS output (species_matched.xls) table in this project directory which is read in below and joined with the occurrence table
<- readxl::read_excel(here::here("datasets", "hakai_salmon_data", "raw_data", "species_matched.xls"))
species_matched
<- left_join(occurrence, species_matched, by = c("scientificName" = "ScientificName")) %>%
occurrence select(occurrenceID, basisOfRecord, scientificName, eventID, occurrenceStatus = occurenceStatus, Kingdom, Phylum, Class, Order, Family, Genus, Species)
write_csv(occurrence, here::here("datasets", "hakai_salmon_data", "occurrence.csv"))
2.2.4 measurementOrFact
To convert all your measurements or facts from your normal format to Darwin Core you essentially need to put all your measurements into one column called measurementType and a corresponding column called MeasurementValue. This standardizes the column names are in the measurementOrFact
table. There are a number of predefined measurementType
s listed on the NERC database that should be used where possible. I found it difficult to navigate this page to find the correct measurementType
.
Here I convert length, and weight measurements that relate to an event and an occurrence and call those measurementTypes
as length
and weight
.
$weight <- coalesce(fish_data$weight, fish_data$weight_field)
fish_data$fork_length <- coalesce(fish_data$fork_length, fish_data$fork_length_field)
fish_data
<- fish_data %>%
fish_length mutate(occurrenceID = paste0("hakai-jsp-", ufn)) %>%
select(occurrenceID, eventID = seine_id, fork_length, weight) %>%
mutate(measurementType = "fork length", measurementValue = fork_length) %>%
select(eventID, occurrenceID, measurementType, measurementValue) %>%
mutate(measurementUnit = "millimeters",
measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/UXMM/")
<- fish_data %>%
fish_weight mutate(occurrenceID = paste0("hakai-jsp-", ufn)) %>%
select(occurrenceID, eventID = seine_id, fork_length, weight) %>%
mutate(measurementType = "mass", measurementValue = weight) %>%
select(eventID, occurrenceID, measurementType, measurementValue) %>%
mutate(measurementUnit = "grams",
measurementUnitID = "http://vocab.nerc.ac.uk/collection/P06/current/UGRM/")
<- bind_rows(fish_length, fish_weight) %>%
measurementOrFact drop_na(measurementValue)
rm(fish_length, fish_weight)
write_csv(measurementOrFact, here::here("datasets", "hakai_salmon_data", "measurementOrFact.csv"))
2.3 Hakai Seagrass
2.3.1 Setup
This section clears the workspace, checks the working directory, and installs packages (if required) and loads packages, and loads necessary datasets
library("knitr")
# Knitr global chunk options
$set(message = FALSE,
opts_chunkwarning = FALSE,
error = FALSE)
2.3.1.1 Load Data
First load the seagrass density survey data, set variable classes, and have a quick look
# Load density data
<-
seagrassDensity read.csv("raw_data/seagrass_density_survey.csv",
colClass = "character") %>%
mutate(date = ymd(date),
depth = as.numeric(depth),
transect_dist = factor(transect_dist),
collected_start = ymd_hms(collected_start),
collected_end = ymd_hms(collected_end),
density = as.numeric(density),
density_msq = as.numeric(density_msq),
canopy_height_cm = as.numeric(canopy_height_cm),
flowering_shoots = as.numeric(flowering_shoots)) %T>%
glimpse()
## Rows: 3,031
## Columns: 22
## $ X <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "1…
## $ organization <chr> "HAKAI", "HAKAI", "HAKAI", "HAKAI", "HAKAI", "HAKAI",…
## $ work_area <chr> "CALVERT", "CALVERT", "CALVERT", "CALVERT", "CALVERT"…
## $ project <chr> "MARINEGEO", "MARINEGEO", "MARINEGEO", "MARINEGEO", "…
## $ survey <chr> "PRUTH_BAY", "PRUTH_BAY", "PRUTH_BAY", "PRUTH_BAY", "…
## $ site_id <chr> "PRUTH_BAY_INTERIOR4", "PRUTH_BAY_INTERIOR4", "PRUTH_…
## $ date <date> 2016-05-13, 2016-05-13, 2016-05-13, 2016-05-13, 2016…
## $ sampling_bout <chr> "4", "4", "4", "4", "4", "4", "4", "6", "6", "6", "6"…
## $ dive_supervisor <chr> "Zach", "Zach", "Zach", "Zach", "Zach", "Zach", "Zach…
## $ collector <chr> "Derek", "Derek", "Derek", "Derek", "Derek", "Derek",…
## $ hakai_id <chr> "2016-05-13_PRUTH_BAY_INTERIOR4_0", "2016-05-13_PRUTH…
## $ sample_type <chr> "seagrass_density", "seagrass_density", "seagrass_den…
## $ depth <dbl> 6.0, 6.0, 6.0, 6.0, 5.0, 6.0, 6.0, 9.1, 9.0, 8.9, 9.0…
## $ transect_dist <fct> 0, 5, 10, 15, 20, 25, 30, 10, 15, 20, 25, 30, 0, 5, 1…
## $ collected_start <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ collected_end <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ density <dbl> 13, 10, 18, 22, 16, 31, 9, 5, 6, 6, 6, 3, 13, 30, 23,…
## $ density_msq <dbl> 208, 160, 288, 352, 256, 496, 144, 80, 96, 96, 96, 48…
## $ canopy_height_cm <dbl> 60, 63, 80, 54, 55, 50, 63, 85, 80, 90, 95, 75, 60, 6…
## $ flowering_shoots <dbl> NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, NA, NA, NA…
## $ comments <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quality_log <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Next, load the habitat survey data, and same as above, set variable classes as necessary, and have a quick look.
# load habitat data, set variable classes, have a quick look
<-
seagrassHabitat read.csv("raw_data/seagrass_habitat_survey.csv",
colClasses = "character") %>%
mutate(date = ymd(date),
depth = as.numeric(depth),
hakai_id = str_pad(hakai_id, 5, pad = "0"),
transect_dist = factor(transect_dist),
collected_start = ymd_hms(collected_start),
collected_end = ymd_hms(collected_end)) %T>%
glimpse()
## Rows: 2,052
## Columns: 28
## $ X <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "1…
## $ organization <chr> "HAKAI", "HAKAI", "HAKAI", "HAKAI", "HAKAI", "HAKAI",…
## $ work_area <chr> "CALVERT", "CALVERT", "CALVERT", "CALVERT", "CALVERT"…
## $ project <chr> "MARINEGEO", "MARINEGEO", "MARINEGEO", "MARINEGEO", "…
## $ survey <chr> "CHOKED_PASS", "CHOKED_PASS", "CHOKED_PASS", "CHOKED_…
## $ site_id <chr> "CHOKED_PASS_INTERIOR6", "CHOKED_PASS_INTERIOR6", "CH…
## $ date <date> 2017-11-22, 2017-11-22, 2017-11-22, 2017-11-22, 2017…
## $ sampling_bout <chr> "6", "6", "6", "6", "6", "6", "1", "1", "1", "1", "1"…
## $ dive_supervisor <chr> "gillian", "gillian", "gillian", "gillian", "gillian"…
## $ collector <chr> "zach", "zach", "zach", "zach", "zach", "zach", "kyle…
## $ hakai_id <chr> "10883", "2017-11-22_CHOKED_PASS_INTERIOR6_5 - 10", "…
## $ sample_type <chr> "seagrass_habitat", "seagrass_habitat", "seagrass_hab…
## $ depth <dbl> 9.2, 9.4, 9.3, 9.0, 9.2, 9.2, 3.4, 3.4, 3.4, 3.4, 3.4…
## $ transect_dist <fct> 0 - 5, 10-May, 15-Oct, 15 - 20, 20 - 25, 25 - 30, 0 -…
## $ collected_start <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ collected_end <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bag_uid <chr> "10883", NA, NA, "11094", NA, "11182", "7119", NA, "7…
## $ bag_number <chr> "3557", NA, NA, "3520", NA, "903", "800", NA, "318", …
## $ density_range <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ substrate <chr> "sand,shell hash", "sand,shell hash", "sand,shell has…
## $ patchiness <chr> "< 1", "< 1", "02-Jan", "< 1", "< 1", "< 1", "< 1", "…
## $ adj_habitat_1 <chr> "seagrass", "seagrass", "seagrass", "seagrass", "seag…
## $ adj_habitat_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ sample_collected <chr> "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "TRUE", "T…
## $ vegetation_1 <chr> NA, NA, NA, NA, NA, NA, "des", NA, "des", NA, NA, NA,…
## $ vegetation_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ comments <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quality_log <chr> "1: Flowering shoots 0 for entire transects", NA, NA,…
Finally, load coordinate data for surveys, and subset necessary variables
<-
coordinates read.csv("raw_data/seagrassCoordinates.csv",
colClass = c("Point.Name" = "character")) %>%
select(Point.Name, Decimal.Lat, Decimal.Long) %T>%
glimpse()
## Rows: 70
## Columns: 3
## $ Point.Name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Decimal.Lat <dbl> 52.06200, 52.05200, 51.92270, 51.92500, 51.80900, 51.8090…
## $ Decimal.Long <dbl> -128.4120, -128.4030, -128.4648, -128.4540, -128.2360, -1…
2.3.1.2 Merge Datasets
Now all the datasets have been loaded, and briefly formatted, we’ll join together the habitat and density surveys, and the coordinates for these.
The seagrass density surveys collect data at discrete points (ie. 5 metres) along the transects, while the habitat surveys collect data over sections (ie. 0 - 5 metres) along the transects. In order to fit these two surveys together, we’ll narrow the habitat surveys from a range to a point so the locations will match. Based on how the habitat data is collected, the point the habitat survey is applied to will be the distance at the end of the swath (ie. 10-15m will become 15m). To account for no preceeding distance, the 0m distance will use the 0-5m section of the survey.
First, well make the necessary transformations to the habitat dataset.
# Reformat seagrassHabitat to merge with seagrassDensity
## replicate 0 - 5m transect dist to match with 0m in density survey;
## rest of habitat bins can map one to one with density (ie. 5 - 10m -> 10m)
<-
seagrass0tmp %>%
seagrassHabitat filter(transect_dist %in% c("0 - 5", "0 - 2.5")) %>%
mutate(transect_dist = factor(0))
## collapse various levels to match with seagrassDensity transect_dist
$transect_dist <-
seagrassHabitatfct_collapse(seagrassHabitat$transect_dist,
"5" = c("0 - 5", "2.5 - 7.5"),
"10" = c("5 - 10", "7.5 - 12.5"),
"15" = c("10 - 15", "12.5 - 17.5"),
"20" = c("15 - 20", "17.5 - 22.5"),
"25" = c("20 - 25", "22.5 - 27.5"),
"30" = c("25 - 30", "27.5 - 30"))
## merge seagrass0tmp into seagrassHabitat to account for 0m samples,
## set class for date, datetime variables
<-
seagrassHabitatFull rbind(seagrass0tmp, seagrassHabitat) %>%
filter(transect_dist != "0 - 2.5") %>% # already captured in seagrass0tmp
droplevels(.) # remove now unused factor levels
With the distances of habitat and density surveys now corresponding, we can now merge these two datasets plus there coordinates together, combine redundant fields, and remove unnecessary fields.
# Merge seagrassHabitatFull with seagrassDensity, then coordinates
<-
seagrass full_join(seagrassHabitatFull, seagrassDensity,
by = c("organization",
"work_area",
"project",
"survey",
"site_id",
"date",
"transect_dist")) %>%
# merge hakai_id.x and hakai_id.y into single variable field;
# use combination of date, site_id, transect_dist, and field uid (hakai_id
# when present)
mutate(field_uid = ifelse(sample_collected == TRUE, hakai_id.x, "NA"),
hakai_id = paste(date, "HAKAI:CALVERT", site_id, transect_dist, sep = ":"),
# below, aggregate metadata that didn't merge naturally (ie. due to minor
# differences in watch time or depth gauges)
dive_supervisor = dive_supervisor.x,
collected_start = ymd_hms(ifelse(is.na(collected_start.x),
collected_start.y,
collected_start.x)),collected_end = ymd_hms(ifelse(is.na(collected_start.x),
collected_start.y,
collected_start.x)),depth_m = ifelse(is.na(depth.x), depth.y, depth.x),
sampling_bout = sampling_bout.x) %>%
left_join(., coordinates, # add coordinates
by = c("site_id" = "Point.Name")) %>%
select( - c(X.x, X.y, hakai_id.x, hakai_id.y, # remove unnecessary variables
dive_supervisor.x, dive_supervisor.y,
collected_start.x, collected_start.y,
collected_end.x, collected_end.y,
depth.x, depth.y,%>%
sampling_bout.x, sampling_bout.y)) mutate(density_msq = as.character(density_msq),
canopy_height_cm = as.character(canopy_height_cm),
flowering_shoots = as.character(flowering_shoots),
depth_m = as.character(depth_m)) %T>%
glimpse()
## Rows: 3,743
## Columns: 38
## $ organization <chr> "HAKAI", "HAKAI", "HAKAI", "HAKAI", "HAKAI", "HAKAI",…
## $ work_area <chr> "CALVERT", "CALVERT", "CALVERT", "CALVERT", "CALVERT"…
## $ project <chr> "MARINEGEO", "MARINEGEO", "MARINEGEO", "MARINEGEO", "…
## $ survey <chr> "CHOKED_PASS", "CHOKED_PASS", "CHOKED_PASS", "PRUTH_B…
## $ site_id <chr> "CHOKED_PASS_INTERIOR6", "CHOKED_PASS_EDGE1", "CHOKED…
## $ date <date> 2017-11-22, 2017-05-19, 2017-05-19, 2017-07-03, 2017…
## $ collector.x <chr> "zach", "kyle", NA, "tanya", "zach", "zach", "zach", …
## $ sample_type.x <chr> "seagrass_habitat", "seagrass_habitat", "seagrass_hab…
## $ transect_dist <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bag_uid <chr> "10883", "7119", "7031", "2352", "10255", "10023", "1…
## $ bag_number <chr> "3557", "800", "301", "324", "3506", "3555", "3534", …
## $ density_range <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ substrate <chr> "sand,shell hash", "sand,shell hash", "sand,shell has…
## $ patchiness <chr> "< 1", "< 1", "< 1", "< 1", "< 1", "05-Apr", "04-Mar"…
## $ adj_habitat_1 <chr> "seagrass", "sand", "standing kelp", "seagrass", "sea…
## $ adj_habitat_2 <chr> NA, NA, NA, NA, NA, NA, "standing kelp", NA, NA, NA, …
## $ sample_collected <chr> "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE…
## $ vegetation_1 <chr> NA, "des", "des", "zm", "des", NA, NA, NA, NA, NA, NA…
## $ vegetation_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "…
## $ comments.x <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quality_log.x <chr> "1: Flowering shoots 0 for entire transects", NA, NA,…
## $ collector.y <chr> "derek", "ondine", "ondine", "derek", "derek", "derek…
## $ sample_type.y <chr> "seagrass_density", "seagrass_density", "seagrass_den…
## $ density <dbl> 4, 10, 6, 13, 6, 1, 2, 6, 21, 3, 7, 4, 3, 14, 17, 11,…
## $ density_msq <chr> "64", "160", "96", "208", "96", "16", "32", "96", "33…
## $ canopy_height_cm <chr> "80", "80", "110", "60", "125", "100", "100", "125", …
## $ flowering_shoots <chr> "0", NA, NA, NA, NA, NA, NA, "0", NA, NA, NA, "0", NA…
## $ comments.y <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quality_log.y <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "…
## $ field_uid <chr> "10883", "07119", "07031", "02352", "10255", "10023",…
## $ hakai_id <chr> "2017-11-22:HAKAI:CALVERT:CHOKED_PASS_INTERIOR6:0", "…
## $ dive_supervisor <chr> "gillian", "gillian,gillian.sadlierbrown", "gillian,g…
## $ collected_start <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ collected_end <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ depth_m <chr> "9.2", "3.4", "4.8", "2.4", "5.3", "5.6", "4.4", "2.5…
## $ sampling_bout <chr> "6", "1", "3", "5", "5", "3", "5", "2", "1", "2", "6"…
## $ Decimal.Lat <dbl> 51.67482, 51.67882, 51.67493, 51.64532, 51.67349, 51.…
## $ Decimal.Long <dbl> -128.1195, -128.1148, -128.1237, -128.1193, -128.1180…
2.3.2 Convert Data to Darwin Core - Extended Measurement or Fact format
The Darwin Core ExtendedMeasurementOrFact (eMoF) extension bases records around a core event (rather than occurrence as in standard Darwin Core), allowing for additional measurement variables to be associated with occurrence data.
2.3.2.1 Add Event ID and Occurrence ID variables to dataset
As this dataset will be annually updated, rather than using natural keys (ie. using package::uuid to autogenerate) for event and occurence IDs, here we will use surrogate keys made up of a concatenation of date survey, transect location, observation distance, and sample ID (for occurrenceID, when a sample is present).
# create and populate eventID variable
## currently only event is used, but additional surveys and abiotic data
## are associated with parent events that may be included at a later date
$eventID <- seagrass$hakai_id
seagrass
# create and populate occurrenceID; combine eventID with transect_dist
# and field_uid
## in the event of <NA> field_uid, no sample was collected, but
## measurements and occurrence are still taken; no further subsamples
## are associated with <NA> field_uids
$occurrenceID <-
seagrasswith(seagrass,
paste(eventID, transect_dist, field_uid, sep = ":"))
2.3.2.2 Create Event, Occurrence, and eMoF tables
Now that we’ve created eventIDs and occurrenceIDs to connect all the variables together, we can begin to create the Event, Occurrence, and extended Measurement or Fact table necessary for DarwinCore compliant datasets
2.3.2.2.1 Event Table
# subset seagrass to create event table
<-
seagrassEvent %>%
seagrass %>% # some duplicates in data stemming from database conflicts
distinct select(date,
Decimal.Lat, Decimal.Long, transect_dist,%>%
depth_m, eventID) rename(eventDate = date,
decimalLatitude = Decimal.Lat,
decimalLongitude = Decimal.Long,
coordinateUncertaintyInMeters = transect_dist,
minimumDepthInMeters = depth_m,
maximumDepthInMeters = depth_m) %>%
mutate(geodeticDatum = "WGS84",
samplingEffort = "30 metre transect") %T>% glimpse
## Rows: 3,659
## Columns: 8
## $ eventDate <date> 2017-11-22, 2017-05-19, 2017-05-19, 201…
## $ decimalLatitude <dbl> 51.67482, 51.67882, 51.67493, 51.64532, …
## $ decimalLongitude <dbl> -128.1195, -128.1148, -128.1237, -128.11…
## $ coordinateUncertaintyInMeters <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ maximumDepthInMeters <chr> "9.2", "3.4", "4.8", "2.4", "5.3", "5.6"…
## $ eventID <chr> "2017-11-22:HAKAI:CALVERT:CHOKED_PASS_IN…
## $ geodeticDatum <chr> "WGS84", "WGS84", "WGS84", "WGS84", "WGS…
## $ samplingEffort <chr> "30 metre transect", "30 metre transect"…
# save event table to csv
write.csv(seagrassEvent, "processed_data/hakaiSeagrassDwcEvent.csv")
2.3.2.2.2 Occurrence Table
# subset seagrass to create occurrence table
<-
seagrassOccurrence %>%
seagrass %>% # some duplicates in data stemming from database conflicts
distinct select(eventID, occurrenceID) %>%
mutate(basisOfRecord = "HumanObservation",
scientificName = "Zostera subg. Zostera marina",
occurrenceStatus = "present")
# Taxonomic name matching
# in addition to the above metadata, DarwinCore format requires further
# taxonomic data that can be acquired through the WoRMS register.
## Load taxonomic info, downloaded via WoRMS tool
# zmWorms <-
# read.delim("raw_data/zmworms_matched.txt",
# header = TRUE,
# nrows = 1)
<- wm_record(id = 145795)
zmWorms
# join WoRMS name with seagrassOccurrence create above
<-
seagrassOccurrence full_join(seagrassOccurrence, zmWorms,
by = c("scientificName" = "scientificname")) %>%
select(eventID, occurrenceID, basisOfRecord, scientificName, occurrenceStatus, AphiaID,
url, authority, status, unacceptreason, taxonRankID, rank,
valid_AphiaID, valid_name, valid_authority, parentNameUsageID,
kingdom, phylum, class, order, family, genus, citation, lsid,%T>%
isMarine, match_type, modified) glimpse
## Rows: 3,659
## Columns: 27
## $ eventID <chr> "2017-11-22:HAKAI:CALVERT:CHOKED_PASS_INTERIOR6:0", …
## $ occurrenceID <chr> "2017-11-22:HAKAI:CALVERT:CHOKED_PASS_INTERIOR6:0:0:…
## $ basisOfRecord <chr> "HumanObservation", "HumanObservation", "HumanObserv…
## $ scientificName <chr> "Zostera subg. Zostera marina", "Zostera subg. Zoste…
## $ occurrenceStatus <chr> "present", "present", "present", "present", "present…
## $ AphiaID <int> 145795, 145795, 145795, 145795, 145795, 145795, 1457…
## $ url <chr> "https://www.marinespecies.org/aphia.php?p=taxdetail…
## $ authority <chr> "Linnaeus, 1753", "Linnaeus, 1753", "Linnaeus, 1753"…
## $ status <chr> "accepted", "accepted", "accepted", "accepted", "acc…
## $ unacceptreason <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ taxonRankID <int> 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 22…
## $ rank <chr> "Species", "Species", "Species", "Species", "Species…
## $ valid_AphiaID <int> 145795, 145795, 145795, 145795, 145795, 145795, 1457…
## $ valid_name <chr> "Zostera subg. Zostera marina", "Zostera subg. Zoste…
## $ valid_authority <chr> "Linnaeus, 1753", "Linnaeus, 1753", "Linnaeus, 1753"…
## $ parentNameUsageID <int> 370435, 370435, 370435, 370435, 370435, 370435, 3704…
## $ kingdom <chr> "Plantae", "Plantae", "Plantae", "Plantae", "Plantae…
## $ phylum <chr> "Tracheophyta", "Tracheophyta", "Tracheophyta", "Tra…
## $ class <chr> "Magnoliopsida", "Magnoliopsida", "Magnoliopsida", "…
## $ order <chr> "Alismatales", "Alismatales", "Alismatales", "Alisma…
## $ family <chr> "Zosteraceae", "Zosteraceae", "Zosteraceae", "Zoster…
## $ genus <chr> "Zostera", "Zostera", "Zostera", "Zostera", "Zostera…
## $ citation <chr> "WoRMS (2023). Zostera subg. Zostera marina Linnaeus…
## $ lsid <chr> "urn:lsid:marinespecies.org:taxname:145795", "urn:ls…
## $ isMarine <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ match_type <chr> "exact", "exact", "exact", "exact", "exact", "exact"…
## $ modified <chr> "2008-12-09T10:03:16.140Z", "2008-12-09T10:03:16.140…
# save occurrence table to csv
write.csv(seagrassOccurrence, "processed_data/hakaiSeagrassDwcOccurrence.csv")
2.3.2.2.3 Extended MeasurementOrFact table
<-
seagrassMof %>%
seagrass # select variables for eMoF table
select(date,
eventID, survey, site_id, transect_dist,
substrate, patchiness, adj_habitat_1, adj_habitat_2,
vegetation_1, vegetation_2,%>%
density_msq, canopy_height_cm, flowering_shoots) # split substrate into two variables (currently holds two substrate type in same variable)
separate(substrate, sep = ",", into = c("substrate_1", "substrate_2")) %>%
# change variables names to match NERC database (or to be more descriptive where none exist)
rename(measurementDeterminedDate = date,
SubstrateTypeA = substrate_1,
SubstrateTypeB = substrate_2,
BarePatchLengthWithinSeagrass = patchiness,
PrimaryAdjacentHabitat = adj_habitat_1,
SecondaryAdjacentHabitat = adj_habitat_2,
PrimaryAlgaeSp = vegetation_1,
SecondaryAlgaeSp = vegetation_2,
BedAbund = density_msq,
CanopyHeight = canopy_height_cm,
FloweringBedAbund = flowering_shoots) %>%
# reformat variables into DwC MeasurementOrFact format
# (single values variable, with measurement type, unit, etc. variables)
pivot_longer( - c(measurementDeterminedDate, eventID, survey, site_id, transect_dist),
names_to = "measurementType",
values_to = "measurementValue",
values_ptypes = list(measurementValue = "character")) %>%
# use measurement type to fill in remainder of variables relating to
# NERC vocabulary and metadata fields
mutate(
measurementTypeID = case_when(
== "BedAbund" ~ "http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL02/",
measurementType == "CanopyHeight" ~ "http://vocab.nerc.ac.uk/collection/P01/current/OBSMAXLX/",
measurementType # measurementType == "BarePatchWithinSeagrass" ~ "",
== "FloweringBedAbund" ~ "http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL02/"),
measurementType measurementUnit = case_when(
== "BedAbund" ~ "Number per square metre",
measurementType == "CanopyHeight" ~ "Centimetres",
measurementType == "BarePatchhLengthWithinSeagrass" ~ "Metres",
measurementType == "FloweringBedAbund" ~ "Number per square metre"),
measurementType measurementUnitID = case_when(
== "BedAbund" ~ "http://vocab.nerc.ac.uk/collection/P06/current/UPMS/",
measurementType == "CanopyHeight" ~ "http://vocab.nerc.ac.uk/collection/P06/current/ULCM/",
measurementType == "BarePatchhLengthWithinSeagrass" ~ "http://vocab.nerc.ac.uk/collection/P06/current/ULAA/2/",
measurementType == "FloweringBedAbund" ~ "http://vocab.nerc.ac.uk/collection/P06/current/UPMS/"),
measurementType measurementAccuracy = case_when(
== "CanopyHeight" ~ 5),
measurementType measurementMethod = case_when(
== "BedAbund" ~ "25cmx25cm quadrat count",
measurementType == "CanopyHeight" ~ "in situ with ruler",
measurementType == "BarePatchhLengthWithinSeagrass" ~ "estimated along transect line",
measurementType == "FloweringBedAbund" ~ "25cmx25cm quadrat count")) %>%
measurementType select(eventID, measurementDeterminedDate, measurementType, measurementValue,
measurementTypeID, measurementUnit, measurementUnitID, measurementAccuracy,%T>%
measurementMethod) # select(!c(survey, site_id, transect_dist)) %T>%
glimpse()
## Rows: 37,430
## Columns: 9
## $ eventID <chr> "2017-11-22:HAKAI:CALVERT:CHOKED_PASS_INTERI…
## $ measurementDeterminedDate <date> 2017-11-22, 2017-11-22, 2017-11-22, 2017-11…
## $ measurementType <chr> "SubstrateTypeA", "SubstrateTypeB", "BarePat…
## $ measurementValue <chr> "sand", "shell hash", "< 1", "seagrass", NA,…
## $ measurementTypeID <chr> NA, NA, NA, NA, NA, NA, NA, "http://vocab.ne…
## $ measurementUnit <chr> NA, NA, NA, NA, NA, NA, NA, "Number per squa…
## $ measurementUnitID <chr> NA, NA, NA, NA, NA, NA, NA, "http://vocab.ne…
## $ measurementAccuracy <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 5, NA, NA, N…
## $ measurementMethod <chr> NA, NA, NA, NA, NA, NA, NA, "25cmx25cm quadr…
# save eMoF table to csv
write.csv(seagrassMof, "processed_data/hakaiSeagrassDwcEmof.csv")
2.3.3 Session Info
Print session information below in case necessary for future reference
# Print Session Info for future reference
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] worrms_0.4.3 magrittr_2.0.3 knitr_1.42 here_1.0.1
## [5] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.2
## [9] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
## [13] ggplot2_3.4.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 xfun_0.39 bslib_0.4.2 colorspace_2.1-0
## [5] vctrs_0.6.2 generics_0.1.3 htmltools_0.5.5 yaml_2.3.7
## [9] utf8_1.2.3 rlang_1.1.1 jquerylib_0.1.4 pillar_1.9.0
## [13] httpcode_0.3.0 glue_1.6.2 withr_2.5.0 bit64_4.0.5
## [17] readxl_1.4.3 lifecycle_1.0.3 munsell_0.5.0 gtable_0.3.3
## [21] cellranger_1.1.0 evaluate_0.21 tzdb_0.3.0 fastmap_1.1.1
## [25] curl_5.0.1 parallel_4.1.1 fansi_1.0.4 triebeard_0.4.1
## [29] urltools_1.7.3 Rcpp_1.0.11 scales_1.2.1 cachem_1.0.8
## [33] vroom_1.6.3 jsonlite_1.8.4 bit_4.0.5 hms_1.1.3
## [37] digest_0.6.31 stringi_1.7.12 bookdown_0.34 grid_4.1.1
## [41] rprojroot_2.0.3 cli_3.6.1 tools_4.1.1 sass_0.4.6
## [45] crul_1.4.0 crayon_1.5.2 pkgconfig_2.0.3 timechange_0.2.0
## [49] rmarkdown_2.21 rstudioapi_0.15.0 R6_2.5.1 compiler_4.1.1
2.4 Trawl Data
One of the more common datasets that can be standardized to Darwin Core and integrated within OBIS is catch data from e.g. a trawl sampling event, or a zooplankton net tow. Of special concern here are datasets that include both a total (species-specific) catch weight, in addition to individual measurements (for a subset of the overall data). In this case, through our standardization to Darwin Core, we want to ensure that data users understand that the individual measurements are a part of, or subset of, the overall (species-specific) record, whilst at the same time ensure that data providers are not duplicating occurrence records to OBIS.
The GitHub issue related to application is can be found here
2.4.1 Workflow Overview
In our current setup, this relationship between the overall catch data and subsetted information is provided in the resourceRelationship extension. This extension cannot currently be harvested by GBIF. The required terms for this extension are resourceID
, relatedResourceID
, resourceRelationshipID
and relationshipOfResource
. The relatedResourceID
here refers to the object of the relationship, whereas the resourceID
refers to the subject of the relationship:
- resourceRelationshipID: a unique identifier for the relationship between one resource (the subject) and another (relatedResource, object).
- resourceID: a unique identifier for the resource that is the subject of the relationship.
- relatedResourceID: a unique identifier for the resource that is the object of the relationship.
- relationshipOfResource: The relationship of the subject (identified by the resourceID) to the object (relatedResourceID). The relationshipOfResource is a free text field.
A few resources have been published to OBIS that contain the resourceRelationship extension (examples). Here, I’ll lay out the process and coding used for the Trawl Catch and Species Abundance from the 2019 Gulf of Alaska International Year of the Salmon Expedition. In the following code chunks some details are omitted to improve the readability - the overall code to standardize the catch data can be found here. This dataset includes species-specific total catch data at multiple stations (sampling events). From each catch, individual measurements were also taken. Depending on the number of individual caught in the trawl, this was either the total number of species individuals caught, or only a subset (in case of large numbers of individuals caught).
In this specific data record, we created a single Event Core with three extensions: an occurrence extension, measurement or fact extension, and the resourceRelationship extension. However, in this walk-through I’ll only touch on the Event Core, occurrence extension and resourceRelationship extension.
The trawl data is part of a larger project collecting various data types related to salmon ocean ecology. Therefore, in our Event Core we nested information related to the sampling event in the specific layer. (include a visual representation of the schema). Prior to creating the Event Core, we ensured that e.g. dates and times followed the correct ISO-8601 standards, and converted to the correct time zone.
# Time is recorded numerically (1037 instead of 10:37), so need to change these columns:
$END_DEPLOYMENT_TIME <- substr(as.POSIXct(sprintf("%04.0f", trawl2019$END_DEPLOYMENT_TIME), format = "%H%M"), 12, 16)
trawl2019$BEGIN_RETRIEVAL_TIME <- substr(as.POSIXct(sprintf("%04.0f", trawl2019$BEGIN_RETRIEVAL_TIME), format = "%H%M"), 12, 16)
trawl2019# Additionally, the vessel time is recorded in 'Vladivostok' according to the metadata tab. This has to be converted to UTC.
<- trawl2019 %>%
trawl2019 mutate(eventDate_start = format_iso_8601(as.POSIXct(paste(EVENT_DATE_START, END_DEPLOYMENT_TIME),
tz = "Asia/Vladivostok")),
eventDate_start = str_replace(eventDate_start, "\\+00:00", "Z"),
eventDate_finish = format_iso_8601(as.POSIXct(paste(EVENT_DATE_FINISH, BEGIN_RETRIEVAL_TIME),
tz = "Asia/Vladivostok")),
eventDate_finish = str_replace(eventDate_finish, "\\+00:00", "Z"),
eventDate = paste(eventDate_start, eventDate_finish, sep = "/"),
project = "IYS",
cruise = paste(project, "GoA2019", sep = ":"),
station = paste(cruise, TOW_NUMBER, sep=":Stn"),
trawl = paste(station, "trawl", sep=":"))
Then we created the various layers of our Event Core. We created these layers/data frames from two separate datasets that data are pulled from - one dataset that contains the overall catch data, and one dataset that contains the specimen data:
<- read_excel(here("Trawl", "2019", "raw_data",
trawl2019_allCatch "2019_GoA_Fish_Trawl_catchdata.xlsx"), sheet = "CATCH_FINAL") %>%
mutate(project = "IYS",
cruise = paste(project, "GoA2019", sep = ":"),
station = paste(cruise, `TOW_NUMBER (number)`, sep = ":Stn"),
trawl = paste(station, "trawl", sep = ":"))
<- read_excel(here("Trawl", "2019", "raw_data", "2019_GoA_Fish_Specimen_data.xlsx"),
trawl2019_specimen sheet = "SPECIMEN_FINAL") %>%
mutate(project = "IYS",
cruise = paste(project, "GoA2019", sep = ":"),
station = paste(cruise, TOW_NUMBER, sep = ":Stn"),
trawl = paste(station, "trawl", sep = ":"),
sample = paste(trawl, "sample", sep = ":"),
sample = paste(sample, row_number(), sep = ""))
Next we created the Event Core, ensuring that we connect the data to the right layer (i.e. date and time should be connected to the layer associated with the sampling event). Please note that because we are creating multiple layers and nesting information, and then at a later stage combining different tables, this results in cells being populated with NA
. These have to be removed prior to publishing the Event Core through the IPT.
<- trawl2019 %>%
trawl2019_project select(eventID = project) %>%
distinct(eventID) %>%
mutate(type = "project")
<- trawl2019 %>%
trawl2019_cruise select(eventID = cruise,
parentEventID = project) %>%
distinct(eventID, .keep_all = TRUE) %>%
mutate(type = "cruise")
<- trawl2019 %>%
trawl2019_station select(eventID = station,
parentEventID = cruise) %>%
distinct(eventID, .keep_all = TRUE) %>%
mutate(type = "station")
# The coordinates associated to the trawl need to be presented in a LINESTRING.
# END_LONGITUDE_DD needs to be inverted (has to be between -180 and 180, inclusive).
<- trawl2019 %>%
trawl2019_coordinates select(eventID = trawl,
START_LATITUDE_DD,
longitude,
END_LATITUDE_DD,%>%
END_LONGITUDE_DD) mutate(END_LONGITUDE_DD = END_LONGITUDE_DD * -1,
footprintWKT = paste("LINESTRING (", longitude, START_LATITUDE_DD, ",",
")"))
END_LONGITUDE_DD, END_LATITUDE_DD, <- obistools::calculate_centroid(trawl2019_coordinates$footprintWKT)
trawl2019_linestring <- cbind(trawl2019_coordinates, trawl2019_linestring) %>%
trawl2019_linestring select(eventID, footprintWKT, decimalLatitude, decimalLongitude, coordinateUncertaintyInMeters)
<- trawl2019 %>%
trawl2019_trawl select(eventID = trawl,
parentEventID = station,
eventDate,
year,
month,%>%
day) mutate(minimumDepthInMeters = 0, # headrope was at the surface
maximumDepthInMeters = trawl2019$MOUTH_OPENING_HEIGHT,
samplingProtocol = "midwater trawl", # when available add DOI to paper here
locality = case_when(
$EVENT_SUB_TYPE == "Can EEZ" ~ "Canadian EEZ"),
trawl2019locationID = case_when(
$EVENT_SUB_TYPE == "Can EEZ" ~ "http://marineregions.org/mrgid/8493")) %>%
trawl2019left_join(trawl2019_linestring, by = "eventID") %>%
distinct(eventID, .keep_all = TRUE) %>%
mutate(type = "midwater trawl")
<- trawl2019_specimen %>%
trawl2019_sample select(eventID = sample,
parentEventID = trawl) %>%
distinct(eventID, .keep_all = TRUE) %>%
mutate(type = "individual sample")
<- bind_rows(trawl2019_project,
trawl2019_event
trawl2019_cruise,
trawl2019_station,
trawl2019_trawl,
trawl2019_sample)
# Remove NAs from the Event Core:
<- sapply(trawl2019_event, as.character)
trawl2019_event is.na(trawl2019_event)] <- ""
trawl2019_event[<- as.data.frame(trawl2019_event) trawl2019_event
TO DO: Add visual of e.g. the top 10 rows of the Event Core.
Now that we created the Event Core, we create the occurrence extension. To do this, we create two separate occurrence data tables: one that includes the occurrence data for the total catch, and one data table for the specimen data. Finally, the Occurrence extension is created by combining these two data frames. Personally, I prefer to re-order it so it makes visual sense to me (nest the specimen occurrence records under their respective overall catch data).
<- worrms::wm_records_names(unique(trawl2019_allCatch$scientificname))
trawl2019_allCatch_worms <- left_join(trawl2019_allCatch, trawl2019_allCatch_worms, by = "scientificname") %>%
trawl2019_occ rename(eventID = trawl,
specificEpithet = species,
scientificNameAuthorship = authority,
taxonomicStatus = status,
taxonRank = rank,
scientificName = scientificname,
scientificNameID = lsid,
individualCount = `CATCH_COUNT (pieces)(**includes Russian expansion for some species)`,
occurrenceRemarks = COMMENTS) %>%
mutate(occurrenceID = paste(eventID, "occ", sep = ":"),
occurrenceID = paste(occurrenceID, row_number(), sep = ":"),
occurrenceStatus = "present",
sex = "")
<- worrms::wm_records_names(unique(trawl2019_catch_ind$scientificname)) %>% bind_rows()
trawl2019_catch_ind_worms <- left_join(trawl2019_catch_ind, trawl2019_catch_ind_worms, by = "scientificname") %>%
trawl2019_catch_ind_occ rename(scientificNameAuthorship = authority,
taxonomicStatus = status,
taxonRank = rank,
scientificName = scientificname,
scientificNameID = lsid) %>%
mutate(occurrenceID = paste(eventID, "occ", sep = ":"),
occurrenceStatus = "present",
individualCount = 1)
# Combine the two occurrence data frames:
<- dplyr::bind_rows(trawl2019_occ_fnl, trawl2019_catch_ind_fnl)
trawl2019_occ_ext
# To re-order the occurrenceID, use following code:
<- stringr::str_sort(trawl2019_occ_ext$occurrenceID, numeric=TRUE)
order <- trawl2019_occ_ext[match(order, trawl2019_occ_ext$occurrenceID),] %>%
trawl2019_occ_ext mutate(basisOfRecord = "HumanObservation")
TO DO: Add visual of e.g. the top 10 rows of the Occurrence extension.
Please note that in the overall species-specific occurrence data frame, individualCount was not included. This term should not be used for abundance studies, but to avoid confusion and the appearance that the specimen records are an additional observation on top of the overall catch record, the individualCount term was left blank for the overall catch data.
A resource relationship extension is created to further highlight that the individual samples in the occurrence extension are part of a larger overall catch that was also listed in the occurrence extension. In this extension, we wanted to make sure to highlight that the specimen occurrence records are a subset of the overall catch data through the field relationshipOfResource1
. Each of these relationships gets a unique resourceRelationshipID
.
<- trawl2019_occ_ext %>%
trawl_resourceRelationship select(eventID, occurrenceID, scientificName) %>%
mutate(resourceID = ifelse(grepl("sample", trawl2019_occ_ext$occurrenceID), trawl2019_occ_ext$occurrenceID, NA)) %>%
mutate(eventID = gsub(":sample.*", "", trawl2019_occ_ext$eventID)) %>%
group_by(eventID, scientificName) %>%
filter(n() != 1) %>%
ungroup()
<- trawl_resourceRelationship %>%
trawl_resourceRelationship mutate(relatedResourceID = ifelse(grepl("sample", trawl_resourceRelationship$occurrenceID), NA, trawl_resourceRelationship$occurrenceID)) %>%
mutate(relationshipOfResource = ifelse(!is.na(resourceID), "is a subset of", NA)) %>%
::arrange(eventID, scientificName) %>%
dplyrfill(relatedResourceID) %>%
filter(!is.na(resourceID))
<- stringr::str_sort(trawl_resourceRelationship$resourceID, numeric = TRUE)
order <- trawl_resourceRelationship[match(order, trawl_resourceRelationship$resourceID),]
trawl_resourceRelationship
<- trawl_resourceRelationship %>%
trawl_resourceRelationship mutate(resourceRelationshipID = paste(relatedResourceID, "rr", sep = ":"),
ID = sprintf("%03d", row_number()),
resourceRelationshipID = paste(resourceRelationshipID, ID, sep = ":")) %>%
select(eventID, resourceRelationshipID, resourceID, relationshipOfResource, relatedResourceID)
TO DO: Add visual of e.g. the top 10 rows of the ResourceRelationship extension.
2.4.2 FAQ
Q1. Why not use the terms associatedOccurrence or associatedTaxa?
A. There seems to be a movement away from the term associatedOccurrence as the resourceRelationship
extension has a much broader use case. Some issues that were raised on GitHub exemplify this, see e.g. here. associatedTaxa is used to provide identifiers or names of taxa and the associations of an Occurrence with them. This term is not apt for establishing relationships between taxa, only between specific Occurrences of an organism with other taxa. As noted on the TDWG website, […] Note that the ResourceRelationship class is an alternative means of representing associations, and with more detail. See also e.g. this issue.
2.5 dataset-edna
2.5.1 Introduction
Rationale:
DNA derived data are increasingly being used to document taxon occurrences. To ensure these data are useful to the broadest possible community, GBIF published a guide entitled “Publishing DNA-derived data through biodiversity data platforms.” This guide is supported by the DNA derived data extension for Darwin Core, which incorporates MIxS terms into the Darwin Core standard.
This use case draws on both the guide and the extension to illustrate how to incorporate a DNA derived data extension file into a Darwin Core archive.
For further information on this use case and the DNA Derived data extension in general, see the recording of the OBIS Webinar on Genetic Data.
Project abstract:
The example data employed in this use case are from marine filtered seawater samples collected at a nearshore station in Monterey Bay, California, USA. They were collected by CTD rosette and filtered by a peristaltic pump system. Subsequently, they underwent metabarcoding for the 18S V9 region. The resulting ASVs, their assigned taxonomy, and the metadata associated with their collection are the input data for the conversion scripts presented here.
A selection of samples from this collection were included in the publication “Environmental DNA reveals seasonal shifts and potential interactions in a marine community” which was published with open access in Nature Communications in 2020.
Contacts: - Francisco Chavez - Principle Investigator (chfr@mbari.org) - Kathleen Pitz - Research Associate (kpitz@mbari.org) - Diana LaScala-Gruenewald - Point of Contact (dianalg@mbari.org)
2.5.3 Repo structure
.
+-- README.md :Description of this repository
+-- LICENSE :Repository license
+-- .gitignore :Files and directories to be ignored by git
+-- environment.yml :Conda environment configuration file for Binder
|
+-- raw
| +-- asv_table.csv :Source data containing ASV sequences and number of reads
| +-- taxa_table.csv :Source data containing taxon matches for each ASV
| +-- metadata_table.csv :Source data containing metadata about samples (e.g. collection information)
|
+-- src
| +-- conversion_code.py :Darwin Core mapping script
| +-- conversion_code.ipynb :Darwin Core mapping Jupyter Notebook
| +-- WoRMS.py :Functions for querying the World Register of Marine Species
|
+-- processed
| +-- occurrence.csv :Occurrence file, generated by conversion_code
| +-- dna_extension.csv :DNA Derived Data Extension file, generated by conversion_code