QA/QC
Last updated on 2025-01-07 | Edit this page
Overview
Questions
- How can I QC my data?
Objectives
- “Data enhancement and quality control”
Data enhancement and quality control
OBIS performs a number of quality checks on the data it receives. Red quality flags are attached to occurrence records if errors are encountered, and records may also be rejected if they do not meet minimum requirements. The checks that OBIS performs are documented here and a python implementation is available here. Therefore, prior to publishing your data to OBIS and/or GBIF, it is important to perform quality control on your standardized data. This can help identify any outliers or “faulty” data. It will also help with ensuring that your data is compatible and interoperable with other datasets published to OBIS. There are numerous functions within the obistools R packages that can serve to identify outliers, inspect quality or ensure that the dataset structure fits the required format for both the Event and Occurrence tables.
📌 Recommended initial checks on your data
- Check that all the required Darwin Core terms are present and contain the correct information.
- Make a map from your data to ensure the coordinates are valid and within your expected range.
- Run basic statistics on each column of numeric data (min, max, mean, std. dev., etc.) to identify potential issues.
- Look at unique values of columns containing string entries to identify potential issues (eg. spelling).
- Check for uniqueness of
occurrenceID
field. - Check for uniqueness of
eventID
for each event, if applicable. - Check that dates are following ISO-8601.
- Check that the
scientificNameID
is/are valid.
One method for reviewing your data is to use the r package Hmisc and the function describe. Expand the example below using output from this notebook to see how it works.
R
# pull in the occurrence file from https://www.sciencebase.gov/catalog/item/53a887f4e4b075096c60cfdd
url <- "https://www.sciencebase.gov/catalog/file/get/53a887f4e4b075096c60cfdd?f=__disk__32%2F24%2F80%2F322480c9bcbad19030e29c9ec5e2caeb54cb4a08&allowOpen=true"
occurrence <- read.csv(url)
head(occurrence,n=1)
vernacularName eventID occurrenceStatus
1 Alligator gar Station_95_Date_09JAN1997:14:35:00.000 Absent
basisOfRecord scientificName
1 HumanObservation Atractosteus spatula
scientificNameID kingdom phylum class
1 urn:lsid:marinespecies.org:taxname:279822 Animalia Chordata Actinopteri
order family genus scientificNameAuthorship
1 Lepisosteiformes Lepisosteidae Atractosteus (LacepA"de, 1803)
taxonRank organismQuantity organismQuantityType
1 Species 0 Relative Abundance
occurrenceID
1 Station_95_Date_09JAN1997:14:35:00.000_Atractosteus_spatula
collectionCode
1 Aransas Bay Bag Seine
Hmisc::describe(occurrence)
OUTPUT
occurrence
18 Variables 334341 Observations
--------------------------------------------------------------------------------
vernacularName
n missing distinct
334341 0 61
lowest : Alligator gar Arrow shrimp Atlantic brief squid Atlantic bumper Atlantic croaker
highest: Striped mullet Thinstripe hermit Threadfin shad White mullet White shrimp
--------------------------------------------------------------------------------
eventID
n missing distinct
334341 0 5481
lowest : Station_10_Date_04DEC1991:13:59:00.000 Station_10_Date_04SEP2002:13:17:00.000 Station_10_Date_05JUN1991:15:20:00.000 Station_10_Date_07APR1995:12:54:00.000 Station_10_Date_07APR2000:11:16:00.000
highest: Station_99_Date_21APR1998:18:24:00.000 Station_99_Date_22OCT2001:13:12:00.000 Station_99_Date_25JUN1990:13:48:00.000 Station_99_Date_25NOV2003:11:11:00.000 Station_99_Date_27JUN1988:12:45:00.000
--------------------------------------------------------------------------------
occurrenceStatus
n missing distinct
334341 0 2
Value Absent Present
Frequency 294469 39872
Proportion 0.881 0.119
--------------------------------------------------------------------------------
basisOfRecord
n missing distinct value
334341 0 1 HumanObservation
Value HumanObservation
Frequency 334341
Proportion 1
--------------------------------------------------------------------------------
scientificName
n missing distinct
334341 0 61
lowest : Adinia xenica Anchoa mitchilli Archosargus probatocephalus Ariopsis felis Atractosteus spatula
highest: Stomatopoda Stomolophus meleagris Syngnathus scovelli Tozeuma carolinense Trichiurus lepturus
--------------------------------------------------------------------------------
scientificNameID
n missing distinct
334341 0 61
lowest : urn:lsid:marinespecies.org:taxname:105792 urn:lsid:marinespecies.org:taxname:107034 urn:lsid:marinespecies.org:taxname:107379 urn:lsid:marinespecies.org:taxname:126983 urn:lsid:marinespecies.org:taxname:127089
highest: urn:lsid:marinespecies.org:taxname:367528 urn:lsid:marinespecies.org:taxname:396707 urn:lsid:marinespecies.org:taxname:421784 urn:lsid:marinespecies.org:taxname:422069 urn:lsid:marinespecies.org:taxname:443955
--------------------------------------------------------------------------------
kingdom
n missing distinct value
334341 0 1 Animalia
Value Animalia
Frequency 334341
Proportion 1
--------------------------------------------------------------------------------
phylum
n missing distinct
328860 5481 4
Value Arthropoda Chordata Cnidaria Mollusca
Frequency 71253 246645 5481 5481
Proportion 0.217 0.750 0.017 0.017
--------------------------------------------------------------------------------
class
n missing distinct
328860 5481 5
lowest : Actinopteri Cephalopoda Elasmobranchii Malacostraca Scyphozoa
highest: Actinopteri Cephalopoda Elasmobranchii Malacostraca Scyphozoa
Value Actinopteri Cephalopoda Elasmobranchii Malacostraca
Frequency 235683 5481 10962 71253
Proportion 0.717 0.017 0.033 0.217
Value Scyphozoa
Frequency 5481
Proportion 0.017
--------------------------------------------------------------------------------
order
n missing distinct
328860 5481 22
lowest : Atheriniformes Batrachoidiformes Carangaria incertae sedis Carangiformes Carcharhiniformes
highest: Rhizostomeae Scombriformes Siluriformes Syngnathiformes Tetraodontiformes
--------------------------------------------------------------------------------
family
n missing distinct
328860 5481 36
lowest : Ariidae Atherinopsidae Batrachoididae Carangidae Carcharhinidae
highest: Stromateidae Syngnathidae Tetraodontidae Trichiuridae Triglidae
--------------------------------------------------------------------------------
genus
n missing distinct
328860 5481 52
lowest : Adinia Anchoa Archosargus Ariopsis Atractosteus
highest: Sphoeroides Stomolophus Syngnathus Tozeuma Trichiurus
--------------------------------------------------------------------------------
scientificNameAuthorship
n missing distinct
328860 5481 52
lowest : (Baird & Girard, 1853) (Baird & Girard, 1855) (Blainville, 1823) (Bosc, 1801) (Burkenroad, 1939)
highest: Rathbun, 1896 Say, 1817 [in Say, 1817-1818] Shipp & Yerger, 1969 Valenciennes, 1836 Winchell, 1864
--------------------------------------------------------------------------------
taxonRank
n missing distinct
334341 0 3
Value Genus Order Species
Frequency 5481 5481 323379
Proportion 0.016 0.016 0.967
--------------------------------------------------------------------------------
organismQuantity
n missing distinct Info Mean Gmd .05 .10
334341 0 8696 0.317 0.01639 0.03141 0.00000 0.00000
.25 .50 .75 .90 .95
0.00000 0.00000 0.00000 0.01005 0.07407
lowest : 0.0000000000 0.0000917684 0.0001835370 0.0002136300 0.0002241650
highest: 0.9969931270 0.9974226800 0.9981570220 0.9982300880 1.0000000000
--------------------------------------------------------------------------------
organismQuantityType
n missing distinct value
334341 0 1 Relative Abundance
Value Relative Abundance
Frequency 334341
n missing distinct
334341 0 1
value
Aransas Bay Bag Seine
Value Aransas Bay Bag Seine
Frequency 334341
Proportion 1
--------------------------------------------------------------------------------
Exercise
Perform the following minimal quality assurance and control checks:
- Run a diagnostics report for the data quality.
- Ensure that the eventIDs are unique.
- Make sure that the eventDates follow ISO-8601 standards.
- Determine whether reported depths are accurate.
The event core data used in the checks below can be found in this Excel file.
📌 Tip
- In some cases you’ll want to ensure the values are representative of the entity you are reporting.
- For example,
individualCount
should be an integer. So, checking that column for integer values would be good.
Key Points
- Several packages (e.g. obistools, Hmisc, pandas) can be used to QA/QC data.