Using NCEI geoportal REST API to collect information about IOOS Regional Association archived data

IOOS regional associations archive their non-federal observational data with NOAA’s National Center for Environmental Information (NCEI). In this notebook we will use the RESTful services of the NCEI geoportal to collect metadata from the archive packages found in the NCEI archives. The metadata information are stored in ISO 19115-2 xml files which the NCEI geoportal uses for discovery of Archival Information Packages (AIPs). This example uses the ISO metadata records to display publication information as well as plot the time coverage of each AIP at NCEI which meets the search criteria.

First we update the namespaces dictionary from owslib to include the appropriate namespace reference for gmi and gml.

For more information on ISO Namespaces see: https://geo-ide.noaa.gov/wiki/index.php?title=ISO_Namespaces

from owslib.iso import namespaces

# Append gmi namespace to namespaces dictionary.
namespaces.update({"gmi": "http://www.isotc211.org/2005/gmi"})
namespaces.update({"gml": "http://www.opengis.net/gml/3.2"})
del namespaces[None]

Now we select a Regional Association and platform

This is where the user identifies the Regional Association and the platform type they are interested in. Change the RA acronym to the RA of interest. The user can also omit the Regional Association, by using None, to collect metadata information about all IOOS non-Federal observation data archived through the NCEI-IOOS pipeline.

The options for platform include: "HF Radar", "Glider", and "FIXED PLATFORM".

# Select RA, this will be the acronym for the RA or None if you want to search across all RAs
ra = 'CARICOOS'

# Identify the platform.
platform = '"FIXED PLATFORM"' # Options include: None, "HF Radar", "Glider", "FIXED PLATFORM"

Next we generate a geoportal query and georss feed

To find more information about how to compile a geoportal query, have a look at REST API Syntax and the NCEI Search Tips for the NCEI geoportal. The example provided is specific to the NCEI-IOOS data pipeline project and only searches for non-federal timeseries data collected by each Regional Association.

The query developed here can be updated to search for any Archival Information Packages at NCEI, therefore the user should develop the appropriate query using the NCEI Geoportal and update this portion of the code to identify the REST API of interest.

try:
    from urllib.parse import quote
except ImportError:
    from urllib import quote

# Generate geoportal query and georss feed.

# Base geoportal url.
baseurl = "https://www.ncei.noaa.gov/" "metadata/geoportal/opensearch" "?q="

# Identify the Regional Association
if ra is None:
    reg_assoc = ''
else:
    RAs = {
        "AOOS": "Alaska Ocean Observing System",
        "CARICOOS": "Caribbean Coastal Ocean Observing System",
        "CeNCOOS": "Central and Northern California Coastal Ocean Observing System",
        "GCOOS": "Gulf of Mexico Coastal Ocean Observing System",
        "GLOS": "Great Lakes Observing System",
        "MARACOOS": "Mid-Atlantic Regional Association Coastal Ocean Observing System",
        "NANOOS": "Northwest Association of Networked Ocean Observing Systems",
        "NERACOOS": "Northeastern Regional Association of Coastal Ocean Observing System",
        "PacIOOS": "Pacific Islands Ocean Observing System",
        "SCCOOS": "Southern California Coastal Ocean Observing System",
        "SECOORA": "Southeast Coastal Ocean Observing Regional Association",
        }
    reg_assoc = '(dataThemeinstitutions_s:"%s" dataThemeprojects_s:"%s (%s)")'%(RAs[ra], RAs[ra], ra)

# Identify the project.
project = '"Integrated Ocean Observing System Data Assembly Centers Data Stewardship Program"'

# Identify the amount of records and format of the response: 1 to 1010 records.
records = "&start=1&num=1010"

# Identify the format of the response: georss.
response_format = "&f=csv"

if platform is not None:
  reg_assoc_plat = quote(reg_assoc + ' AND' + platform)
else:
  reg_assoc_plat = quote(reg_assoc)

# Combine the URL.
url = "{}{}{}{}".format(baseurl , reg_assoc_plat, '&filter=dataThemeprojects_s:', quote(project) + records + response_format)

print("Identified response format:\n{}".format(url))
print(
    "\nSearch page response:\n{}".format(url.replace(response_format, "&f=searchPage"))
)
Identified response format:
https://www.ncei.noaa.gov/metadata/geoportal/opensearch?q=%28dataThemeinstitutions_s%3A%22Caribbean%20Coastal%20Ocean%20Observing%20System%22%20dataThemeprojects_s%3A%22Caribbean%20Coastal%20Ocean%20Observing%20System%20%28CARICOOS%29%22%29%20AND%22FIXED%20PLATFORM%22&filter=dataThemeprojects_s:%22Integrated%20Ocean%20Observing%20System%20Data%20Assembly%20Centers%20Data%20Stewardship%20Program%22&start=1&num=1010&f=csv

Search page response:
https://www.ncei.noaa.gov/metadata/geoportal/opensearch?q=%28dataThemeinstitutions_s%3A%22Caribbean%20Coastal%20Ocean%20Observing%20System%22%20dataThemeprojects_s%3A%22Caribbean%20Coastal%20Ocean%20Observing%20System%20%28CARICOOS%29%22%29%20AND%22FIXED%20PLATFORM%22&filter=dataThemeprojects_s:%22Integrated%20Ocean%20Observing%20System%20Data%20Assembly%20Centers%20Data%20Stewardship%20Program%22&start=1&num=1010&f=searchPage

Time to query the portal and parse out the csv response

Here we are opening the specified REST API and parsing it into a string. Then, since we identified it as a csv format above, we parse it using the Pandas package. We also split the Data_Date_Range column into two columns, data_start_date and data_end_date to have that useful information available.

import pandas as pd
import numpy as np

df = pd.read_csv(url)

df[['data_start_date','data_end_date']] = df['Data_Date_Range'].str.split(' to ',expand=True)
df['data_start_date'] = pd.to_datetime(df['data_start_date'])
df['data_end_date'] = pd.to_datetime(df['data_end_date']) + pd.Timedelta(np.timedelta64(1, "ms")) 

df.head()
Id Title Description West South East North Link_Xml Link_1 Link_2 Link_3 Link_4 Data_Date_Range Date_Published data_start_date data_end_date
0 gov.noaa.nodc:0163740 Oceanographic and surface meteorological data ... NaN -66.5321 17.8628 -66.5212 17.8686 http://www.ncei.noaa.gov/metadata/geoportal/re... NaN NaN NaN NaN 2009-06-09T00:00:00Z to 2020-10-14T23:59:59.999Z 2017-06-27T00:00:00Z 2009-06-09 00:00:00+00:00 2020-10-15 00:00:00+00:00

Now, lets pull out all the ISO metadata record links and print them out so the user can browse to the metadata record and look for what items they might be interested in.

# parse the csv response

print("Found %i record(s)" % len(df))
for index, row in df.iterrows():
    print('ISO19115-2 record:',row['Link_Xml'])  # URL to ISO19115-2 record.
    print('NCEI dataset metadata page: https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=' + row['Id'] )
    print('\n')
Found 1 record(s)
ISO19115-2 record: http://www.ncei.noaa.gov/metadata/geoportal/rest/metadata/item/gov.noaa.nodc%3A0163740/xml
NCEI dataset metadata page: https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.nodc:0163740

Let’s collect what we have found

Now that we have all the ISO metadata records we are interested in, it’s time to do something fun with them. In this example we want to generate a timeseries plot of the data coverage for the “Southern California Coastal Ocean Observing System” stations we have archived at NCEI.

First we need to collect some information. We loop through each iso record to collect metadata information about each package. The example here shows how to collect the following items:

  1. NCEI Archival Information Package (AIP) Accession ID (7-digit Accession Number)

  2. The first date the archive package was published.

  3. The platform code identified from the provider.

  4. The version number and date it was published.

  5. The current AIP size, in MB.

There are plenty of other metadata elements to collect from the ISO records, so we recommend browsing to one of the records and having a look at the items of interest to your community.

# Process each iso record.
%matplotlib inline

from datetime import datetime
import xml.etree.ElementTree as ET
from owslib import util
from urllib.request import urlopen


df[['provider_platform_name','NCEI_accession_number','package_size_mb','submitter']] = ''

# For each accession in response.
for url in df['Link_Xml']:

    iso = urlopen(url)
    iso_tree = ET.parse(iso)
    root = iso_tree.getroot()

    vers_dict = dict()
    
    # Collect Publication date information.
    date_path = (
        ".//"
        "gmd:identificationInfo/"
        "gmd:MD_DataIdentification/"
        "gmd:citation/"
        "gmd:CI_Citation/"
        "gmd:date/"
        "gmd:CI_Date/"
        "gmd:date/gco:Date"
    )
    # First published date.
    pubdate = root.find(date_path, namespaces)
    print("\nFirst published date = %s" % util.testXMLValue(pubdate))
    
    # Data Temporal Coverage.
    temporal_extent_path = (
        ".//"
        "gmd:temporalElement/"
        "gmd:EX_TemporalExtent/"
        "gmd:extent/"
        "gml:TimePeriod"
    
    )
    
    beginPosition = root.find(temporal_extent_path + '/gml:beginPosition', namespaces).text
    endPosition = root.find(temporal_extent_path + '/gml:endPosition', namespaces).text
    
    print("Data time coverage: %s to %s" % (beginPosition, endPosition))

    # Collect keyword terms of interest.
    for MD_keywords in root.iterfind('.//gmd:descriptiveKeywords/gmd:MD_Keywords', namespaces):

        for thesaurus_name in MD_keywords.iterfind('.//gmd:thesaurusName/gmd:CI_Citation/gmd:title/gco:CharacterString', namespaces):
            
            if thesaurus_name.text == "Provider Platform Names":

                plat_name = MD_keywords.find('.//gmd:keyword/gco:CharacterString', namespaces).text
                print("Provider Platform Code = %s" % plat_name)
                df.loc[df.Link_Xml == url, ['provider_platform_name']] = plat_name
                break
                
            elif thesaurus_name.text == "NCEI ACCESSION NUMBER":
                acce_no = MD_keywords.find('.//gmd:keyword/gmx:Anchor', namespaces).text
                print("Accession:",acce_no)
                df.loc[df.Link_Xml == url, ['NCEI_accession_number']] = acce_no
                break
            
            elif thesaurus_name.text == "NODC SUBMITTING INSTITUTION NAMES THESAURUS":
                submitter = MD_keywords.find('.//gmd:keyword/gmx:Anchor', namespaces).text
                print("Submitter:", submitter)
                df.loc[df.Link_Xml == url, ['submitter']] = submitter
            
    # Pull out the version information.
    # Iterate through each processing step which is an NCEI version.
    for process_step in root.iterfind(".//gmd:processStep", namespaces):
        # Only parse gco:DateTime and gmd:title/gco:CharacterString.
        vers_title = (
            ".//"
            "gmi:LE_ProcessStep/"
            "gmi:output/"
            "gmi:LE_Source/"
            "gmd:sourceCitation/"
            "gmd:CI_Citation/"
            "gmd:title/"
            "gco:CharacterString"
        )
        vers_date = (
            ".//" 
            "gmi:LE_ProcessStep/" 
            "gmd:dateTime/"
            "gco:DateTime"
        )
        if process_step.findall(vers_date, namespaces) and process_step.findall(vers_title, namespaces):
            # Extract dateTime for each version.
            datetime = pd.to_datetime(process_step.find(vers_date, namespaces).text)
            
            # Extract version number.
            version = process_step.find(vers_title, namespaces).text.split(" ")[-1]
            print(
                "{} = {}".format(
                    version, datetime
                )
            )
            vers_dict[version] = datetime
            df.loc[df.Link_Xml == url, ['version_info']] = [vers_dict]
    
    # Collect package size information.
    # Iterate through transfer size nodes.
    for trans_size in root.iterfind(".//gmd:transferSize", namespaces):

        if trans_size.find(".//gco:Real", namespaces).text:
            
            sizes = trans_size.find(".//gco:Real", namespaces).text
            print("Current AIP Size = %s MB" % sizes)
                
            df.loc[df.Link_Xml == url, ['package_size_mb']] = float(sizes)
            break

        break
    
First published date = 2017-06-27
Data time coverage: 2009-06-09 to 2020-10-14
Accession: 0163740
Submitter: Caribbean Coastal Ocean Observing System
Provider Platform Code = PR1 (CarICOOS Data Buoy A)
v1.1 = 2017-06-27 14:48:08+00:00
v2.2 = 2021-01-07 23:08:26+00:00
Current AIP Size = 93.228 MB

Create a timeseries plot of data coverage

Now that we have a DataFrame with all the information we’re interested in, lets make a time coverage plot for all the AIP’s at NCEI.

import matplotlib.dates as mdates
import matplotlib.pyplot as plt
    
ypos = range(len(df))
fig, ax = plt.subplots(figsize=(15, 12))

# Plot the data
ax.barh(ypos, mdates.date2num(df['data_end_date']) - mdates.date2num(df['data_start_date']), 
        left = mdates.date2num(df['data_start_date']), 
        height = 0.5, 
        align = 'center')

xlim = ( mdates.date2num(df['data_start_date'].min() - pd.Timedelta(np.timedelta64(1, "M"))),
         mdates.date2num(df['data_end_date'].max() + pd.Timedelta(np.timedelta64(1, "M"))) )

ax.set_xlim(xlim)
ax.set(yticks = np.arange(0, len(df)))
ax.tick_params(which="both", direction="out")
ax.set_ylabel("NCEI Accession Number")
ax.set_yticklabels(df['NCEI_accession_number'])
ax.set_title('NCEI archive package time coverage')

ax.xaxis_date()
ax.set_xlabel('Date')

plt.grid(axis='x', linestyle='--')
../../../_images/2017-06-12-NCEI_RA_archive_history_13_0.png

This procedure has been developed as an example of how to use NCEI’s geoportal REST API’s to collect information about packages that have been archived at NCEI. The intention is to provide some guidance and ways to collect this information without having to request it directly from NCEI. There are a significant amount of metadata elements which NCEI makes available through their ISO metadata records. Therefore, anyone interested in collecting other information from the records at NCEI should have a look at the ISO metadata records and determine which items are of interest to their community. Then, update the example code provided to collect that information.

Author: Mathew Biddle