Aligning Data to Darwin Core#

Creating event core with an occurrence and extended measurement or fact extension using Python#

Created: 2020-12-08

Caution: This notebook was created for the IOOS DMAC Code Sprint Biological Data Session. The data in this notebook were created specifically as an example and meant solely to be illustrative of the process for aligning data to the biological data standard - Darwin Core. These data should not be considered actual occurrences of species and any measurements are also contrived. This notebook is meant to provide a step by step process for taking original data and aligning it to Darwin Core. It has been adapted from the R markdown notebook created by Abby Benson IOOS_DMAC_DataToDWC_Notebook_event.md.

First let’s bring in the appropriate libraries to work with the tabular data files and generate the appropriate content for the Darwin Core requirements.

import csv
import pprint
import uuid

import numpy as np
import pandas as pd
import pyworms

Now we need to read in the raw data file using pandas.read_csv(). Here we display the first ten rows of data to give the user an idea of what observations are contained in the raw file.

url = (
    "https://raw.githubusercontent.com/ioos/ioos_code_lab/main/"
    "jupyterbook/content/code_gallery/data/"
)
file = "MadeUpDataForBiologicalDataTraining.csv"
df = pd.read_csv(url + file, header=[0])
df.head()
date lat lon region station transect scientific name percent cover depth bottom type rugosity temperature
0 7/16/2004 18.29788 -64.79451 St. John 250 1 Acropora cervicornis 0 25 shallow reef flat 0.295833 25.2
1 7/16/2004 18.29788 -64.79451 St. John 250 1 Madracis auretenra 5 25 shallow reef flat 0.295833 25.2
2 7/16/2004 18.29788 -64.79451 St. John 250 1 Mussa angulosa 15 25 shallow reef flat 0.295833 25.2
3 7/16/2004 18.29788 -64.79451 St. John 250 1 Siderastrea radians 0 25 shallow reef flat 0.295833 25.2
4 7/16/2004 18.29788 -64.79451 St. John 250 2 Acropora cervicornis 0 35 complex back reef 0.364583 24.8

First we need to to decide if we will build an occurrence only version of the data or an event core with an occurrence and extended measurement or facts extension (eMoF) version of the data.

Here we decide to use the second option, extended measurement or fact (eMoF), to include as much information as we can.

First let’s create the eventID and occurrenceID in the original file so that information can be reused for all necessary files down the line.

df["eventID"] = df[["region", "station", "transect"]].apply(
    lambda x: "_".join(x.astype(str)), axis=1
)
df["occurrenceID"] = uuid.uuid4()

We will need to create three separate files to comply with the sampling event format. We’ll start with the event file but we only need to include the columns that are relevant to the event file.

Event file#

More information on the event category in Darwin Core can be found at https://dwc.tdwg.org/terms/#event.

Let’s first make a copy of the DataFrame we pulled in. Only using the data fields of interest for the event file).

event = df[
    [
        "date",
        "lat",
        "lon",
        "region",
        "station",
        "transect",
        "depth",
        "bottom type",
        "eventID",
    ]
].copy()

Next we need to rename any columns of data to match directly to Darwin Core.

event["decimalLatitude"] = event["lat"]
event["decimalLongitude"] = event["lon"]
event["minimumDepthInMeters"] = event["depth"]
event["maximumDepthInMeters"] = event["depth"]
event["habitat"] = event["bottom type"]
event["island"] = event["region"]

We need to appropriately read in the date field, so we can export it to ISO format. Also add any missing, required, fields.

event["eventDate"] = pd.to_datetime(event["date"], format="%m/%d/%Y")
event["basisOfRecord"] = "HumanObservation"
event["geodeticDatum"] = "EPSG:4326 WGS84"

Then we’ll remove any fields that we no longer need to clean things up a bit.

event.drop(
    columns=[
        "date",
        "lat",
        "lon",
        "region",
        "station",
        "transect",
        "depth",
        "bottom type",
    ],
    inplace=True,
)

We have too many repeating rows of information. We can pare this down using eventID which is a unique identifier for each sampling event in the data.

event.drop_duplicates(subset="eventID", inplace=True)

Finally, we write out the event file, specifying the ISO date format. We’ve printed ten random rows of the DataFrame to give an example of what the resultant file will look like.

url = "https://github.com/ioos/notebooks_demos/raw/master/notebooks/data/dwc/processed/"
file = "MadeUpData_event.csv"

event.to_csv(url + file, header=True, index=False, date_format="%Y-%m-%d")

event.sample(n=5).sort_index()
eventID decimalLatitude decimalLongitude minimumDepthInMeters maximumDepthInMeters habitat island eventDate basisOfRecord geodeticDatum
4 St. John_250_2 18.29788 -64.79451 35 35 complex back reef St. John 2004-07-16 HumanObservation EPSG:4326 WGS84
8 St. John_250_3 18.29788 -64.79451 85 85 deep reef St. John 2004-07-16 HumanObservation EPSG:4326 WGS84
12 St. John_356_1 18.27609 -64.75740 28 28 complex back reef St. John 2004-07-17 HumanObservation EPSG:4326 WGS84
16 St. John_356_2 18.27609 -64.75740 16 16 shallow reef flat St. John 2004-07-17 HumanObservation EPSG:4326 WGS84
20 St. John_356_3 18.27609 -64.75740 90 90 deep reef St. John 2004-07-17 HumanObservation EPSG:4326 WGS84

Occurrence file#

More information on the occurrence category in Darwin Core can be found at https://dwc.tdwg.org/terms/#occurrence.

For creating the occurrence file, we start by creating the DataFrame and renaming the fields that align directly with Darwin Core. Then, we’ll add the required information that is missing.

occurrence = df[["scientific name", "eventID", "occurrenceID", "percent cover"]].copy()
occurrence["scientificName"] = occurrence["scientific name"]
occurrence["occurrenceStatus"] = np.where(
    occurrence["percent cover"] == 0, "absent", "present"
)

Taxonomic Name Matching#

A requirement for OBIS is that all scientific names match to the World Register of Marine Species (WoRMS) and a scientificNameID is included. A scientificNameID looks like this urn:lsid:marinespecies.org:taxname:275730 with the last digits after the colon being the WoRMS aphia ID. We’ll need to go out to WoRMS to grab this information. So, we create a lookup table of the unique scientific names found in the occurrence data we created above.

lut_worms = pd.DataFrame(
    columns=["scientificName"], data=occurrence["scientificName"].unique()
)

Next, we add the known columns that we can grab information from WoRMS including the required scientificNameID and populate the look up table with empty values for those fields (to initialize the DataFrame for population later).

headers = [
    "acceptedname",
    "acceptedID",
    "scientificNameID",
    "kingdom",
    "phylum",
    "class",
    "order",
    "family",
    "genus",
    "scientificNameAuthorship",
    "taxonRank",
]

for head in headers:
    lut_worms[head] = ""

Next, we perform a taxonomic lookup using the library pyworms. Using the function pyworms.aphiaRecordsByMatchNames() to collect the information and populate the look up table.

Here we print the scientific name of the species we are looking up and the matching response from WoRMS with the detailed species information.

for index, row in lut_worms.iterrows():
    print(f"\n**Searching for scientific name = {row["scientificName"]}**")
    resp = pyworms.aphiaRecordsByMatchNames(row["scientificName"])[0][0]
    pprint.pprint(resp)
    lut_worms.loc[index, "acceptedname"] = resp["valid_name"]
    lut_worms.loc[index, "acceptedID"] = resp["valid_AphiaID"]
    lut_worms.loc[index, "scientificNameID"] = resp["lsid"]
    lut_worms.loc[index, "kingdom"] = resp["kingdom"]
    lut_worms.loc[index, "phylum"] = resp["phylum"]
    lut_worms.loc[index, "class"] = resp["class"]
    lut_worms.loc[index, "order"] = resp["order"]
    lut_worms.loc[index, "family"] = resp["family"]
    lut_worms.loc[index, "genus"] = resp["genus"]
    lut_worms.loc[index, "scientificNameAuthorship"] = resp["authority"]
    lut_worms.loc[index, "taxonRank"] = resp["rank"]
**Searching for scientific name = Acropora cervicornis**
{'AphiaID': 206989,
 'authority': '(Lamarck, 1816)',
 'citation': 'Hoeksema, B. W.; Cairns, S. (2021). World List of Scleractinia. '
             'Acropora cervicornis (Lamarck, 1816). Accessed through: World '
             'Register of Marine Species at: '
             'http://www.marinespecies.org/aphia.php?p=taxdetails&id=206989 on '
             '2021-08-30',
 'class': 'Anthozoa',
 'family': 'Acroporidae',
 'genus': 'Acropora',
 'isBrackish': 0,
 'isExtinct': None,
 'isFreshwater': 0,
 'isMarine': 1,
 'isTerrestrial': 0,
 'kingdom': 'Animalia',
 'lsid': 'urn:lsid:marinespecies.org:taxname:206989',
 'match_type': 'exact',
 'modified': '2018-08-27T16:36:11.490Z',
 'order': 'Scleractinia',
 'parentNameUsageID': 205469,
 'phylum': 'Cnidaria',
 'rank': 'Species',
 'scientificname': 'Acropora cervicornis',
 'status': 'accepted',
 'taxonRankID': 220,
 'unacceptreason': None,
 'url': 'http://www.marinespecies.org/aphia.php?p=taxdetails&id=206989',
 'valid_AphiaID': 206989,
 'valid_authority': '(Lamarck, 1816)',
 'valid_name': 'Acropora cervicornis'}

**Searching for scientific name = Madracis auretenra**
{'AphiaID': 430664,
 'authority': 'Locke, Weil & Coates, 2007',
 'citation': 'Hoeksema, B. W.; Cairns, S. (2021). World List of Scleractinia. '
             'Madracis auretenra Locke, Weil & Coates, 2007. Accessed through: '
             'World Register of Marine Species at: '
             'http://www.marinespecies.org/aphia.php?p=taxdetails&id=430664 on '
             '2021-08-30',
 'class': 'Anthozoa',
 'family': 'Pocilloporidae',
 'genus': 'Madracis',
 'isBrackish': 0,
 'isExtinct': None,
 'isFreshwater': 0,
 'isMarine': 1,
 'isTerrestrial': 0,
 'kingdom': 'Animalia',
 'lsid': 'urn:lsid:marinespecies.org:taxname:430664',
 'match_type': 'exact',
 'modified': '2020-04-10T07:30:40.497Z',
 'order': 'Scleractinia',
 'parentNameUsageID': 135125,
 'phylum': 'Cnidaria',
 'rank': 'Species',
 'scientificname': 'Madracis auretenra',
 'status': 'accepted',
 'taxonRankID': 220,
 'unacceptreason': None,
 'url': 'http://www.marinespecies.org/aphia.php?p=taxdetails&id=430664',
 'valid_AphiaID': 430664,
 'valid_authority': 'Locke, Weil & Coates, 2007',
 'valid_name': 'Madracis auretenra'}

**Searching for scientific name = Mussa angulosa**
{'AphiaID': 216135,
 'authority': '(Pallas, 1766)',
 'citation': 'Hoeksema, B. W.; Cairns, S. (2021). World List of Scleractinia. '
             'Mussa angulosa (Pallas, 1766). Accessed through: World Register '
             'of Marine Species at: '
             'http://www.marinespecies.org/aphia.php?p=taxdetails&id=216135 on '
             '2021-08-30',
 'class': 'Anthozoa',
 'family': 'Faviidae',
 'genus': 'Mussa',
 'isBrackish': 0,
 'isExtinct': 0,
 'isFreshwater': 0,
 'isMarine': 1,
 'isTerrestrial': 0,
 'kingdom': 'Animalia',
 'lsid': 'urn:lsid:marinespecies.org:taxname:216135',
 'match_type': 'exact',
 'modified': '2020-06-28T17:27:59.150Z',
 'order': 'Scleractinia',
 'parentNameUsageID': 206306,
 'phylum': 'Cnidaria',
 'rank': 'Species',
 'scientificname': 'Mussa angulosa',
 'status': 'accepted',
 'taxonRankID': 220,
 'unacceptreason': None,
 'url': 'http://www.marinespecies.org/aphia.php?p=taxdetails&id=216135',
 'valid_AphiaID': 216135,
 'valid_authority': '(Pallas, 1766)',
 'valid_name': 'Mussa angulosa'}

**Searching for scientific name = Siderastrea radians**
{'AphiaID': 207517,
 'authority': '(Pallas, 1766)',
 'citation': 'Hoeksema, B. W.; Cairns, S. (2021). World List of Scleractinia. '
             'Siderastrea radians (Pallas, 1766). Accessed through: World '
             'Register of Marine Species at: '
             'http://www.marinespecies.org/aphia.php?p=taxdetails&id=207517 on '
             '2021-08-30',
 'class': 'Anthozoa',
 'family': 'Siderastreidae',
 'genus': 'Siderastrea',
 'isBrackish': 0,
 'isExtinct': None,
 'isFreshwater': 0,
 'isMarine': 1,
 'isTerrestrial': 0,
 'kingdom': 'Animalia',
 'lsid': 'urn:lsid:marinespecies.org:taxname:207517',
 'match_type': 'exact',
 'modified': '2014-06-02T10:15:47.813Z',
 'order': 'Scleractinia',
 'parentNameUsageID': 204291,
 'phylum': 'Cnidaria',
 'rank': 'Species',
 'scientificname': 'Siderastrea radians',
 'status': 'accepted',
 'taxonRankID': 220,
 'unacceptreason': None,
 'url': 'http://www.marinespecies.org/aphia.php?p=taxdetails&id=207517',
 'valid_AphiaID': 207517,
 'valid_authority': '(Pallas, 1766)',
 'valid_name': 'Siderastrea radians'}

We then merge the lookup table of unique scientific names back into the occurrence data. Matching on the field scientificName. Then, we remove any unnecessary columns to clean up the DataFrame for writing.

occurrence = pd.merge(occurrence, lut_worms, how="left", on="scientificName")

occurrence.drop(columns=["scientific name", "percent cover"], inplace=True)

Finally, we write out the occurrence file. We’ve printed ten random rows of the DataFrame to give an example of what the resultant file will look like.

# sort the columns on scientificName
occurrence.sort_values("scientificName", inplace=True)

# reorganize column order to be consistent with R example:
columns = [
    "scientificName",
    "eventID",
    "occurrenceID",
    "occurrenceStatus",
    "acceptedname",
    "acceptedID",
    "scientificNameID",
    "kingdom",
    "phylum",
    "class",
    "order",
    "family",
    "genus",
    "scientificNameAuthorship",
    "taxonRank",
]


url = "https://github.com/ioos/notebooks_demos/raw/master/notebooks/data/dwc/processed/"
file = "MadeUpData_Occurrence.csv"

occurrence.to_csv(
    url + file, header=True, index=False, quoting=csv.QUOTE_ALL, columns=columns
)

occurrence.sample(n=10).sort_index()
eventID occurrenceID scientificName occurrenceStatus acceptedname acceptedID scientificNameID kingdom phylum class order family genus scientificNameAuthorship taxonRank
4 St. John_250_2 f470068c-998a-4e9b-b026-02bf02118de7 Acropora cervicornis absent Acropora cervicornis 206989 urn:lsid:marinespecies.org:taxname:206989 Animalia Cnidaria Anthozoa Scleractinia Acroporidae Acropora (Lamarck, 1816) Species
5 St. John_250_2 f470068c-998a-4e9b-b026-02bf02118de7 Madracis auretenra present Madracis auretenra 430664 urn:lsid:marinespecies.org:taxname:430664 Animalia Cnidaria Anthozoa Scleractinia Pocilloporidae Madracis Locke, Weil & Coates, 2007 Species
7 St. John_250_2 f470068c-998a-4e9b-b026-02bf02118de7 Siderastrea radians absent Siderastrea radians 207517 urn:lsid:marinespecies.org:taxname:207517 Animalia Cnidaria Anthozoa Scleractinia Siderastreidae Siderastrea (Pallas, 1766) Species
10 St. John_250_3 f470068c-998a-4e9b-b026-02bf02118de7 Mussa angulosa present Mussa angulosa 216135 urn:lsid:marinespecies.org:taxname:216135 Animalia Cnidaria Anthozoa Scleractinia Faviidae Mussa (Pallas, 1766) Species
12 St. John_356_1 f470068c-998a-4e9b-b026-02bf02118de7 Acropora cervicornis present Acropora cervicornis 206989 urn:lsid:marinespecies.org:taxname:206989 Animalia Cnidaria Anthozoa Scleractinia Acroporidae Acropora (Lamarck, 1816) Species
13 St. John_356_1 f470068c-998a-4e9b-b026-02bf02118de7 Madracis auretenra present Madracis auretenra 430664 urn:lsid:marinespecies.org:taxname:430664 Animalia Cnidaria Anthozoa Scleractinia Pocilloporidae Madracis Locke, Weil & Coates, 2007 Species
19 St. John_356_2 f470068c-998a-4e9b-b026-02bf02118de7 Siderastrea radians present Siderastrea radians 207517 urn:lsid:marinespecies.org:taxname:207517 Animalia Cnidaria Anthozoa Scleractinia Siderastreidae Siderastrea (Pallas, 1766) Species
21 St. John_356_3 f470068c-998a-4e9b-b026-02bf02118de7 Madracis auretenra absent Madracis auretenra 430664 urn:lsid:marinespecies.org:taxname:430664 Animalia Cnidaria Anthozoa Scleractinia Pocilloporidae Madracis Locke, Weil & Coates, 2007 Species
22 St. John_356_3 f470068c-998a-4e9b-b026-02bf02118de7 Mussa angulosa absent Mussa angulosa 216135 urn:lsid:marinespecies.org:taxname:216135 Animalia Cnidaria Anthozoa Scleractinia Faviidae Mussa (Pallas, 1766) Species
23 St. John_356_3 f470068c-998a-4e9b-b026-02bf02118de7 Siderastrea radians present Siderastrea radians 207517 urn:lsid:marinespecies.org:taxname:207517 Animalia Cnidaria Anthozoa Scleractinia Siderastreidae Siderastrea (Pallas, 1766) Species

Extended Measurement Or Fact (eMoF)#

The last file we need to create is the extended measurement or fact (eMoF) file. The measurement or fact includes measurements/facts about the event (temp, salinity, etc) as well as about the occurrence (percent cover, abundance, weight, length, etc). They are linked to the events using eventID and to the occurrences using occurrenceID. Extended Measurements Or Facts are any other generic observations that are associated with resources that are described using Darwin Core (eg. water temperature observations). See the DwC implementation guide for more information.

For the various TypeID fields (eg. measurementTypeID) include URI’s from the BODC NERC vocabulary or other nearly permanent source, where possible. For example, water temperature in the BODC NERC vocabulary, the URI is http://vocab.nerc.ac.uk/collection/P25/current/WTEMP/.

We then populate the appropriate fields with the information we have available. The measurementValue field is populated with the observed values of the measurement described in the measurementType and measurementUnit field.

For measurement or facts of the occurrence (eg. percent cover, length, density, biomass, etc), we want to be sure to include the occurrenceID from the occurrence record as those observations are measurements of/from the organism. Other observations are tied to the event via the eventID (eg. water temperature, rugosity, etc).

Below we walk through creating three independent DataFrames for temperature, rugosity, and percent cover. Populating each DataFrame with all of the information we have available and removing duplicative fields. We finally concatenate all the extended measurements or facts together into one DataFrame.

temperature = df[["eventID", "temperature", "date"]].copy()
temperature["occurrenceID"] = ""
temperature["measurementType"] = "temperature"
temperature["measurementTypeID"] = (
    "http://vocab.nerc.ac.uk/collection/P25/current/WTEMP/"
)
temperature["measurementValue"] = temperature["temperature"]
temperature["measurementUnit"] = "Celsius"
temperature["measurementUnitID"] = (
    "http://vocab.nerc.ac.uk/collection/P06/current/UPAA/"
)
temperature["measurementAccuracy"] = 3
temperature["measurementDeterminedDate"] = pd.to_datetime(
    temperature["date"], format="%m/%d/%Y"
)
temperature["measurementMethod"] = ""
temperature.drop(columns=["temperature", "date"], inplace=True)

rugosity = df[["eventID", "rugosity", "date"]].copy()
rugosity["occurrenceID"] = ""
rugosity["measurementType"] = "rugosity"
rugosity["measurementTypeID"] = ""
rugosity["measurementValue"] = rugosity["rugosity"].map("{:,.6f}".format)
rugosity["measurementUnit"] = ""
rugosity["measurementUnitID"] = ""
rugosity["measurementAccuracy"] = ""
rugosity["measurementDeterminedDate"] = pd.to_datetime(
    rugosity["date"], format="%m/%d/%Y"
)
rugosity["measurementMethod"] = ""
rugosity.drop(columns=["rugosity", "date"], inplace=True)

percent_cover = df[["eventID", "occurrenceID", "percent cover", "date"]].copy()
percent_cover["measurementType"] = "Percent Cover"
percent_cover["measurementTypeID"] = (
    "http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL10/"
)
percent_cover["measurementValue"] = percent_cover["percent cover"]
percent_cover["measurementUnit"] = "Percent/100m^2"
percent_cover["measurementUnitID"] = ""
percent_cover["measurementAccuracy"] = 5
percent_cover["measurementDeterminedDate"] = pd.to_datetime(
    percent_cover["date"], format="%m/%d/%Y"
)
percent_cover["measurementMethod"] = ""
percent_cover.drop(columns=["percent cover", "date"], inplace=True)

measurementorfact = pd.concat([temperature, rugosity, percent_cover])

Finally, we write the measurement or fact file, again specifying the ISO date format. We’ve printed ten random rows of the DataFrame to give an example of what the resultant file will look like.

url = "https://github.com/ioos/notebooks_demos/raw/master/notebooks/data/dwc/processed/"
file = "MadeUpData_mof.csv"

measurementorfact.to_csv(url + file, index=False, header=True, date_format="%Y-%m-%d")
measurementorfact.sample(n=10)
eventID occurrenceID measurementType measurementTypeID measurementValue measurementUnit measurementUnitID measurementAccuracy measurementDeterminedDate measurementMethod
6 St. John_250_2 temperature http://vocab.nerc.ac.uk/collection/P25/current... 24.8 Celsius http://vocab.nerc.ac.uk/collection/P06/current... 3 2004-07-16
18 St. John_356_2 rugosity 0.158489 2004-07-17
4 St. John_250_2 temperature http://vocab.nerc.ac.uk/collection/P25/current... 24.8 Celsius http://vocab.nerc.ac.uk/collection/P06/current... 3 2004-07-16
11 St. John_250_3 temperature http://vocab.nerc.ac.uk/collection/P25/current... 23.1 Celsius http://vocab.nerc.ac.uk/collection/P06/current... 3 2004-07-16
6 St. John_250_2 f470068c-998a-4e9b-b026-02bf02118de7 Percent Cover http://vocab.nerc.ac.uk/collection/P01/current... 0 Percent/100m^2 5 2004-07-16
4 St. John_250_2 rugosity 0.364583 2004-07-16
4 St. John_250_2 f470068c-998a-4e9b-b026-02bf02118de7 Percent Cover http://vocab.nerc.ac.uk/collection/P01/current... 0 Percent/100m^2 5 2004-07-16
20 St. John_356_3 rugosity 0.489574 2004-07-17
2 St. John_250_1 f470068c-998a-4e9b-b026-02bf02118de7 Percent Cover http://vocab.nerc.ac.uk/collection/P01/current... 15 Percent/100m^2 5 2004-07-16
2 St. John_250_1 temperature http://vocab.nerc.ac.uk/collection/P25/current... 25.2 Celsius http://vocab.nerc.ac.uk/collection/P06/current... 3 2004-07-16

Author: Mathew Biddle