Aligning Data to Darwin Core

Aligning Data to Darwin Core#

Creating event core with an occurrence and extended measurement or fact extension using Python#

Created: 2020-12-08

Caution: This notebook was created for the IOOS DMAC Code Sprint Biological Data Session. The data in this notebook were created specifically as an example and meant solely to be illustrative of the process for aligning data to the biological data standard - Darwin Core. These data should not be considered actual occurrences of species and any measurements are also contrived. This notebook is meant to provide a step by step process for taking original data and aligning it to Darwin Core. It has been adapted from the R markdown notebook created by Abby Benson IOOS_DMAC_DataToDWC_Notebook_event.md.

First let’s bring in the appropriate libraries to work with the tabular data files and generate the appropriate content for the Darwin Core requirements.

import csv
import pprint
import uuid

import numpy as np
import pandas as pd
import pyworms

Now we need to read in the raw data file using pandas.read_csv(). Here we display the first ten rows of data to give the user an idea of what observations are contained in the raw file.

url = (
    "https://raw.githubusercontent.com/ioos/ioos_code_lab/main/"
    "jupyterbook/content/code_gallery/data/"
)
file = "MadeUpDataForBiologicalDataTraining.csv"
df = pd.read_csv(url + file, header=[0])
df.head()

	date	lat	lon	region	station	transect	scientific name	percent cover	depth	bottom type	rugosity	temperature
0	7/16/2004	18.29788	-64.79451	St. John	250	1	Acropora cervicornis	0	25	shallow reef flat	0.295833	25.2
1	7/16/2004	18.29788	-64.79451	St. John	250	1	Madracis auretenra	5	25	shallow reef flat	0.295833	25.2
2	7/16/2004	18.29788	-64.79451	St. John	250	1	Mussa angulosa	15	25	shallow reef flat	0.295833	25.2
3	7/16/2004	18.29788	-64.79451	St. John	250	1	Siderastrea radians	0	25	shallow reef flat	0.295833	25.2
4	7/16/2004	18.29788	-64.79451	St. John	250	2	Acropora cervicornis	0	35	complex back reef	0.364583	24.8

First we need to to decide if we will build an occurrence only version of the data or an event core with an occurrence and extended measurement or facts extension (eMoF) version of the data.

Occurrence only:
- Easier to create.
- It’s only one file to produce.
- However, several pieces of information will be left out if we choose that option.
sampling event with occurrence and extended measurement or fact (eMoF):
- More difficult to create.
- composed of several files.
- Can capture all of the data in the file creating a lossless version.

Here we decide to use the second option, extended measurement or fact (eMoF), to include as much information as we can.

First let’s create the eventID and occurrenceID in the original file so that information can be reused for all necessary files down the line.

df["eventID"] = df[["region", "station", "transect"]].apply(
    lambda x: "_".join(x.astype(str)), axis=1
)
df["occurrenceID"] = uuid.uuid4()

We will need to create three separate files to comply with the sampling event format. We’ll start with the event file but we only need to include the columns that are relevant to the event file.

Event file#

More information on the event category in Darwin Core can be found at https://dwc.tdwg.org/terms/#event.

Let’s first make a copy of the DataFrame we pulled in. Only using the data fields of interest for the event file).

event = df[
    [
        "date",
        "lat",
        "lon",
        "region",
        "station",
        "transect",
        "depth",
        "bottom type",
        "eventID",
    ]
].copy()

Next we need to rename any columns of data to match directly to Darwin Core.

event["decimalLatitude"] = event["lat"]
event["decimalLongitude"] = event["lon"]
event["minimumDepthInMeters"] = event["depth"]
event["maximumDepthInMeters"] = event["depth"]
event["habitat"] = event["bottom type"]
event["island"] = event["region"]

We need to appropriately read in the date field, so we can export it to ISO format. Also add any missing, required, fields.

event["eventDate"] = pd.to_datetime(event["date"], format="%m/%d/%Y")
event["basisOfRecord"] = "HumanObservation"
event["geodeticDatum"] = "EPSG:4326 WGS84"

Then we’ll remove any fields that we no longer need to clean things up a bit.

event.drop(
    columns=[
        "date",
        "lat",
        "lon",
        "region",
        "station",
        "transect",
        "depth",
        "bottom type",
    ],
    inplace=True,
)

We have too many repeating rows of information. We can pare this down using eventID which is a unique identifier for each sampling event in the data.

event.drop_duplicates(subset="eventID", inplace=True)

Finally, we write out the event file, specifying the ISO date format. We’ve printed ten random rows of the DataFrame to give an example of what the resultant file will look like.

url = "https://github.com/ioos/notebooks_demos/raw/master/notebooks/data/dwc/processed/"
file = "MadeUpData_event.csv"

event.to_csv(url + file, header=True, index=False, date_format="%Y-%m-%d")

event.sample(n=5).sort_index()

	eventID	decimalLatitude	decimalLongitude	minimumDepthInMeters	maximumDepthInMeters	habitat	island	eventDate	basisOfRecord	geodeticDatum
4	St. John_250_2	18.29788	-64.79451	35	35	complex back reef	St. John	2004-07-16	HumanObservation	EPSG:4326 WGS84
8	St. John_250_3	18.29788	-64.79451	85	85	deep reef	St. John	2004-07-16	HumanObservation	EPSG:4326 WGS84
12	St. John_356_1	18.27609	-64.75740	28	28	complex back reef	St. John	2004-07-17	HumanObservation	EPSG:4326 WGS84
16	St. John_356_2	18.27609	-64.75740	16	16	shallow reef flat	St. John	2004-07-17	HumanObservation	EPSG:4326 WGS84
20	St. John_356_3	18.27609	-64.75740	90	90	deep reef	St. John	2004-07-17	HumanObservation	EPSG:4326 WGS84

Occurrence file#

More information on the occurrence category in Darwin Core can be found at https://dwc.tdwg.org/terms/#occurrence.

For creating the occurrence file, we start by creating the DataFrame and renaming the fields that align directly with Darwin Core. Then, we’ll add the required information that is missing.

occurrence = df[["scientific name", "eventID", "occurrenceID", "percent cover"]].copy()
occurrence["scientificName"] = occurrence["scientific name"]
occurrence["occurrenceStatus"] = np.where(
    occurrence["percent cover"] == 0, "absent", "present"
)

Taxonomic Name Matching#

A requirement for OBIS is that all scientific names match to the World Register of Marine Species (WoRMS) and a scientificNameID is included. A scientificNameID looks like this urn:lsid:marinespecies.org:taxname:275730 with the last digits after the colon being the WoRMS aphia ID. We’ll need to go out to WoRMS to grab this information. So, we create a lookup table of the unique scientific names found in the occurrence data we created above.

lut_worms = pd.DataFrame(
    columns=["scientificName"], data=occurrence["scientificName"].unique()
)

Next, we add the known columns that we can grab information from WoRMS including the required scientificNameID and populate the look up table with empty values for those fields (to initialize the DataFrame for population later).

headers = [
    "acceptedname",
    "acceptedID",
    "scientificNameID",
    "kingdom",
    "phylum",
    "class",
    "order",
    "family",
    "genus",
    "scientificNameAuthorship",
    "taxonRank",
]

for head in headers:
    lut_worms[head] = ""

Next, we perform a taxonomic lookup using the library pyworms. Using the function pyworms.aphiaRecordsByMatchNames() to collect the information and populate the look up table.

Here we print the scientific name of the species we are looking up and the matching response from WoRMS with the detailed species information.

for index, row in lut_worms.iterrows():
    print(f"\n**Searching for scientific name = {row["scientificName"]}**")
    resp = pyworms.aphiaRecordsByMatchNames(row["scientificName"])[0][0]
    pprint.pprint(resp)
    lut_worms.loc[index, "acceptedname"] = resp["valid_name"]
    lut_worms.loc[index, "acceptedID"] = resp["valid_AphiaID"]
    lut_worms.loc[index, "scientificNameID"] = resp["lsid"]
    lut_worms.loc[index, "kingdom"] = resp["kingdom"]
    lut_worms.loc[index, "phylum"] = resp["phylum"]
    lut_worms.loc[index, "class"] = resp["class"]
    lut_worms.loc[index, "order"] = resp["order"]
    lut_worms.loc[index, "family"] = resp["family"]
    lut_worms.loc[index, "genus"] = resp["genus"]
    lut_worms.loc[index, "scientificNameAuthorship"] = resp["authority"]
    lut_worms.loc[index, "taxonRank"] = resp["rank"]

**Searching for scientific name = Acropora cervicornis**
{'AphiaID': 206989,
 'authority': '(Lamarck, 1816)',
 'citation': 'Hoeksema, B. W.; Cairns, S. (2021). World List of Scleractinia. '
             'Acropora cervicornis (Lamarck, 1816). Accessed through: World '
             'Register of Marine Species at: '
             'http://www.marinespecies.org/aphia.php?p=taxdetails&id=206989 on '
             '2021-08-30',
 'class': 'Anthozoa',
 'family': 'Acroporidae',
 'genus': 'Acropora',
 'isBrackish': 0,
 'isExtinct': None,
 'isFreshwater': 0,
 'isMarine': 1,
 'isTerrestrial': 0,
 'kingdom': 'Animalia',
 'lsid': 'urn:lsid:marinespecies.org:taxname:206989',
 'match_type': 'exact',
 'modified': '2018-08-27T16:36:11.490Z',
 'order': 'Scleractinia',
 'parentNameUsageID': 205469,
 'phylum': 'Cnidaria',
 'rank': 'Species',
 'scientificname': 'Acropora cervicornis',
 'status': 'accepted',
 'taxonRankID': 220,
 'unacceptreason': None,
 'url': 'http://www.marinespecies.org/aphia.php?p=taxdetails&id=206989',
 'valid_AphiaID': 206989,
 'valid_authority': '(Lamarck, 1816)',
 'valid_name': 'Acropora cervicornis'}

**Searching for scientific name = Madracis auretenra**
{'AphiaID': 430664,
 'authority': 'Locke, Weil & Coates, 2007',
 'citation': 'Hoeksema, B. W.; Cairns, S. (2021). World List of Scleractinia. '
             'Madracis auretenra Locke, Weil & Coates, 2007. Accessed through: '
             'World Register of Marine Species at: '
             'http://www.marinespecies.org/aphia.php?p=taxdetails&id=430664 on '
             '2021-08-30',
 'class': 'Anthozoa',
 'family': 'Pocilloporidae',
 'genus': 'Madracis',
 'isBrackish': 0,
 'isExtinct': None,
 'isFreshwater': 0,
 'isMarine': 1,
 'isTerrestrial': 0,
 'kingdom': 'Animalia',
 'lsid': 'urn:lsid:marinespecies.org:taxname:430664',
 'match_type': 'exact',
 'modified': '2020-04-10T07:30:40.497Z',
 'order': 'Scleractinia',
 'parentNameUsageID': 135125,
 'phylum': 'Cnidaria',
 'rank': 'Species',
 'scientificname': 'Madracis auretenra',
 'status': 'accepted',
 'taxonRankID': 220,
 'unacceptreason': None,
 'url': 'http://www.marinespecies.org/aphia.php?p=taxdetails&id=430664',
 'valid_AphiaID': 430664,
 'valid_authority': 'Locke, Weil & Coates, 2007',
 'valid_name': 'Madracis auretenra'}

**Searching for scientific name = Mussa angulosa**
{'AphiaID': 216135,
 'authority': '(Pallas, 1766)',
 'citation': 'Hoeksema, B. W.; Cairns, S. (2021). World List of Scleractinia. '
             'Mussa angulosa (Pallas, 1766). Accessed through: World Register '
             'of Marine Species at: '
             'http://www.marinespecies.org/aphia.php?p=taxdetails&id=216135 on '
             '2021-08-30',
 'class': 'Anthozoa',
 'family': 'Faviidae',
 'genus': 'Mussa',
 'isBrackish': 0,
 'isExtinct': 0,
 'isFreshwater': 0,
 'isMarine': 1,
 'isTerrestrial': 0,
 'kingdom': 'Animalia',
 'lsid': 'urn:lsid:marinespecies.org:taxname:216135',
 'match_type': 'exact',
 'modified': '2020-06-28T17:27:59.150Z',
 'order': 'Scleractinia',
 'parentNameUsageID': 206306,
 'phylum': 'Cnidaria',
 'rank': 'Species',
 'scientificname': 'Mussa angulosa',
 'status': 'accepted',
 'taxonRankID': 220,
 'unacceptreason': None,
 'url': 'http://www.marinespecies.org/aphia.php?p=taxdetails&id=216135',
 'valid_AphiaID': 216135,
 'valid_authority': '(Pallas, 1766)',
 'valid_name': 'Mussa angulosa'}

**Searching for scientific name = Siderastrea radians**
{'AphiaID': 207517,
 'authority': '(Pallas, 1766)',
 'citation': 'Hoeksema, B. W.; Cairns, S. (2021). World List of Scleractinia. '
             'Siderastrea radians (Pallas, 1766). Accessed through: World '
             'Register of Marine Species at: '
             'http://www.marinespecies.org/aphia.php?p=taxdetails&id=207517 on '
             '2021-08-30',
 'class': 'Anthozoa',
 'family': 'Siderastreidae',
 'genus': 'Siderastrea',
 'isBrackish': 0,
 'isExtinct': None,
 'isFreshwater': 0,
 'isMarine': 1,
 'isTerrestrial': 0,
 'kingdom': 'Animalia',
 'lsid': 'urn:lsid:marinespecies.org:taxname:207517',
 'match_type': 'exact',
 'modified': '2014-06-02T10:15:47.813Z',
 'order': 'Scleractinia',
 'parentNameUsageID': 204291,
 'phylum': 'Cnidaria',
 'rank': 'Species',
 'scientificname': 'Siderastrea radians',
 'status': 'accepted',
 'taxonRankID': 220,
 'unacceptreason': None,
 'url': 'http://www.marinespecies.org/aphia.php?p=taxdetails&id=207517',
 'valid_AphiaID': 207517,
 'valid_authority': '(Pallas, 1766)',
 'valid_name': 'Siderastrea radians'}

We then merge the lookup table of unique scientific names back into the occurrence data. Matching on the field scientificName. Then, we remove any unnecessary columns to clean up the DataFrame for writing.

occurrence = pd.merge(occurrence, lut_worms, how="left", on="scientificName")

occurrence.drop(columns=["scientific name", "percent cover"], inplace=True)

Finally, we write out the occurrence file. We’ve printed ten random rows of the DataFrame to give an example of what the resultant file will look like.

# sort the columns on scientificName
occurrence.sort_values("scientificName", inplace=True)

# reorganize column order to be consistent with R example:
columns = [
    "scientificName",
    "eventID",
    "occurrenceID",
    "occurrenceStatus",
    "acceptedname",
    "acceptedID",
    "scientificNameID",
    "kingdom",
    "phylum",
    "class",
    "order",
    "family",
    "genus",
    "scientificNameAuthorship",
    "taxonRank",
]


url = "https://github.com/ioos/notebooks_demos/raw/master/notebooks/data/dwc/processed/"
file = "MadeUpData_Occurrence.csv"

occurrence.to_csv(
    url + file, header=True, index=False, quoting=csv.QUOTE_ALL, columns=columns
)

occurrence.sample(n=10).sort_index()

	eventID	occurrenceID	scientificName	occurrenceStatus	acceptedname	acceptedID	scientificNameID	kingdom	phylum	class	order	family	genus	scientificNameAuthorship	taxonRank
4	St. John_250_2	f470068c-998a-4e9b-b026-02bf02118de7	Acropora cervicornis	absent	Acropora cervicornis	206989	urn:lsid:marinespecies.org:taxname:206989	Animalia	Cnidaria	Anthozoa	Scleractinia	Acroporidae	Acropora	(Lamarck, 1816)	Species
5	St. John_250_2	f470068c-998a-4e9b-b026-02bf02118de7	Madracis auretenra	present	Madracis auretenra	430664	urn:lsid:marinespecies.org:taxname:430664	Animalia	Cnidaria	Anthozoa	Scleractinia	Pocilloporidae	Madracis	Locke, Weil & Coates, 2007	Species
7	St. John_250_2	f470068c-998a-4e9b-b026-02bf02118de7	Siderastrea radians	absent	Siderastrea radians	207517	urn:lsid:marinespecies.org:taxname:207517	Animalia	Cnidaria	Anthozoa	Scleractinia	Siderastreidae	Siderastrea	(Pallas, 1766)	Species
10	St. John_250_3	f470068c-998a-4e9b-b026-02bf02118de7	Mussa angulosa	present	Mussa angulosa	216135	urn:lsid:marinespecies.org:taxname:216135	Animalia	Cnidaria	Anthozoa	Scleractinia	Faviidae	Mussa	(Pallas, 1766)	Species
12	St. John_356_1	f470068c-998a-4e9b-b026-02bf02118de7	Acropora cervicornis	present	Acropora cervicornis	206989	urn:lsid:marinespecies.org:taxname:206989	Animalia	Cnidaria	Anthozoa	Scleractinia	Acroporidae	Acropora	(Lamarck, 1816)	Species
13	St. John_356_1	f470068c-998a-4e9b-b026-02bf02118de7	Madracis auretenra	present	Madracis auretenra	430664	urn:lsid:marinespecies.org:taxname:430664	Animalia	Cnidaria	Anthozoa	Scleractinia	Pocilloporidae	Madracis	Locke, Weil & Coates, 2007	Species
19	St. John_356_2	f470068c-998a-4e9b-b026-02bf02118de7	Siderastrea radians	present	Siderastrea radians	207517	urn:lsid:marinespecies.org:taxname:207517	Animalia	Cnidaria	Anthozoa	Scleractinia	Siderastreidae	Siderastrea	(Pallas, 1766)	Species
21	St. John_356_3	f470068c-998a-4e9b-b026-02bf02118de7	Madracis auretenra	absent	Madracis auretenra	430664	urn:lsid:marinespecies.org:taxname:430664	Animalia	Cnidaria	Anthozoa	Scleractinia	Pocilloporidae	Madracis	Locke, Weil & Coates, 2007	Species
22	St. John_356_3	f470068c-998a-4e9b-b026-02bf02118de7	Mussa angulosa	absent	Mussa angulosa	216135	urn:lsid:marinespecies.org:taxname:216135	Animalia	Cnidaria	Anthozoa	Scleractinia	Faviidae	Mussa	(Pallas, 1766)	Species
23	St. John_356_3	f470068c-998a-4e9b-b026-02bf02118de7	Siderastrea radians	present	Siderastrea radians	207517	urn:lsid:marinespecies.org:taxname:207517	Animalia	Cnidaria	Anthozoa	Scleractinia	Siderastreidae	Siderastrea	(Pallas, 1766)	Species

Extended Measurement Or Fact (eMoF)#

The last file we need to create is the extended measurement or fact (eMoF) file. The measurement or fact includes measurements/facts about the event (temp, salinity, etc) as well as about the occurrence (percent cover, abundance, weight, length, etc). They are linked to the events using eventID and to the occurrences using occurrenceID. Extended Measurements Or Facts are any other generic observations that are associated with resources that are described using Darwin Core (eg. water temperature observations). See the DwC implementation guide for more information.

For the various TypeID fields (eg. measurementTypeID) include URI’s from the BODC NERC vocabulary or other nearly permanent source, where possible. For example, water temperature in the BODC NERC vocabulary, the URI is http://vocab.nerc.ac.uk/collection/P25/current/WTEMP/.

We then populate the appropriate fields with the information we have available. The measurementValue field is populated with the observed values of the measurement described in the measurementType and measurementUnit field.

For measurement or facts of the occurrence (eg. percent cover, length, density, biomass, etc), we want to be sure to include the occurrenceID from the occurrence record as those observations are measurements of/from the organism. Other observations are tied to the event via the eventID (eg. water temperature, rugosity, etc).

Below we walk through creating three independent DataFrames for temperature, rugosity, and percent cover. Populating each DataFrame with all of the information we have available and removing duplicative fields. We finally concatenate all the extended measurements or facts together into one DataFrame.

temperature = df[["eventID", "temperature", "date"]].copy()
temperature["occurrenceID"] = ""
temperature["measurementType"] = "temperature"
temperature["measurementTypeID"] = (
    "http://vocab.nerc.ac.uk/collection/P25/current/WTEMP/"
)
temperature["measurementValue"] = temperature["temperature"]
temperature["measurementUnit"] = "Celsius"
temperature["measurementUnitID"] = (
    "http://vocab.nerc.ac.uk/collection/P06/current/UPAA/"
)
temperature["measurementAccuracy"] = 3
temperature["measurementDeterminedDate"] = pd.to_datetime(
    temperature["date"], format="%m/%d/%Y"
)
temperature["measurementMethod"] = ""
temperature.drop(columns=["temperature", "date"], inplace=True)

rugosity = df[["eventID", "rugosity", "date"]].copy()
rugosity["occurrenceID"] = ""
rugosity["measurementType"] = "rugosity"
rugosity["measurementTypeID"] = ""
rugosity["measurementValue"] = rugosity["rugosity"].map("{:,.6f}".format)
rugosity["measurementUnit"] = ""
rugosity["measurementUnitID"] = ""
rugosity["measurementAccuracy"] = ""
rugosity["measurementDeterminedDate"] = pd.to_datetime(
    rugosity["date"], format="%m/%d/%Y"
)
rugosity["measurementMethod"] = ""
rugosity.drop(columns=["rugosity", "date"], inplace=True)

percent_cover = df[["eventID", "occurrenceID", "percent cover", "date"]].copy()
percent_cover["measurementType"] = "Percent Cover"
percent_cover["measurementTypeID"] = (
    "http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL10/"
)
percent_cover["measurementValue"] = percent_cover["percent cover"]
percent_cover["measurementUnit"] = "Percent/100m^2"
percent_cover["measurementUnitID"] = ""
percent_cover["measurementAccuracy"] = 5
percent_cover["measurementDeterminedDate"] = pd.to_datetime(
    percent_cover["date"], format="%m/%d/%Y"
)
percent_cover["measurementMethod"] = ""
percent_cover.drop(columns=["percent cover", "date"], inplace=True)

measurementorfact = pd.concat([temperature, rugosity, percent_cover])

Finally, we write the measurement or fact file, again specifying the ISO date format. We’ve printed ten random rows of the DataFrame to give an example of what the resultant file will look like.

url = "https://github.com/ioos/notebooks_demos/raw/master/notebooks/data/dwc/processed/"
file = "MadeUpData_mof.csv"

measurementorfact.to_csv(url + file, index=False, header=True, date_format="%Y-%m-%d")
measurementorfact.sample(n=10)

	eventID	occurrenceID	measurementType	measurementTypeID	measurementValue	measurementUnit	measurementUnitID	measurementAccuracy	measurementDeterminedDate
6	St. John_250_2		temperature	http://vocab.nerc.ac.uk/collection/P25/current...	24.8	Celsius	http://vocab.nerc.ac.uk/collection/P06/current...	3	2004-07-16
18	St. John_356_2		rugosity		0.158489				2004-07-17
4	St. John_250_2		temperature	http://vocab.nerc.ac.uk/collection/P25/current...	24.8	Celsius	http://vocab.nerc.ac.uk/collection/P06/current...	3	2004-07-16
11	St. John_250_3		temperature	http://vocab.nerc.ac.uk/collection/P25/current...	23.1	Celsius	http://vocab.nerc.ac.uk/collection/P06/current...	3	2004-07-16
6	St. John_250_2	f470068c-998a-4e9b-b026-02bf02118de7	Percent Cover	http://vocab.nerc.ac.uk/collection/P01/current...	0	Percent/100m^2		5	2004-07-16
4	St. John_250_2		rugosity		0.364583				2004-07-16
4	St. John_250_2	f470068c-998a-4e9b-b026-02bf02118de7	Percent Cover	http://vocab.nerc.ac.uk/collection/P01/current...	0	Percent/100m^2		5	2004-07-16
20	St. John_356_3		rugosity		0.489574				2004-07-17
2	St. John_250_1	f470068c-998a-4e9b-b026-02bf02118de7	Percent Cover	http://vocab.nerc.ac.uk/collection/P01/current...	15	Percent/100m^2		5	2004-07-16
2	St. John_250_1		temperature	http://vocab.nerc.ac.uk/collection/P25/current...	25.2	Celsius	http://vocab.nerc.ac.uk/collection/P06/current...	3	2004-07-16

Author: Mathew Biddle