system-test

IOOS DMAC System Integration Test project

What information is available?

A common task is to find out what information is available for further research later on.

We can programmatically build a list of strings to query common data catalogs and find out what services are available. This post will show how to perform a query for numerical models strings and try to answer the question: how many services are available in each catalog?

To answers that question we will start by building a list of known catalogs services.

(This post is part of Theme 1 - Scenario A.)

In [3]:
known_csw_servers = ['http://data.nodc.noaa.gov/geoportal/csw',
                     'http://cwic.csiss.gmu.edu/cwicv1/discovery',
                     'http://geoport.whoi.edu/geoportal/csw',
                     'https://edg.epa.gov/metadata/csw',
                     'http://www.ngdc.noaa.gov/geoportal/csw',
                     'http://cmgds.marine.usgs.gov/geonetwork/srv/en/csw',
                     'http://www.nodc.noaa.gov/geoportal/csw',
                     'http://cida.usgs.gov/gdp/geonetwork/srv/en/csw',
                     'http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw',
                     'http://geoport.whoi.edu/gi-cat/services/cswiso',
                     'https://data.noaa.gov/csw']

And a list of known model strings to query.

In [4]:
known_model_strings = ['roms', 'selfe', 'adcirc', 'ncom',
                       'hycom', 'fvcom', 'pom', 'wrams', 'wrf']
In [5]:
from owslib import fes

model_name_filters = []
for model in known_model_strings:
    kw = dict(literal='*%s*' % model, wildCard='*')
    title_filter = fes.PropertyIsLike(propertyname='apiso:Title', **kw)
    subject_filter = fes.PropertyIsLike(propertyname='apiso:Subject', **kw)
    model_name_filters.append(fes.Or([title_filter, subject_filter]))

The FES filter we build below is simpler than what we did before. We are only looking for matches in Title or Subject that contain the model strings.

In [6]:
from owslib.csw import CatalogueServiceWeb

model_results = []

for x in range(len(model_name_filters)):
    model_name = known_model_strings[x]
    single_model_filter = model_name_filters[x]
    for url in known_csw_servers:
        try:
            csw = CatalogueServiceWeb(url, timeout=20)
            csw.getrecords2(constraints=[single_model_filter],
                            maxrecords=1000, esn='full')
            for record, item in csw.records.items():
                for d in item.references:
                    result = dict(model=model_name,
                                  scheme=d['scheme'],
                                  url=d['url'],
                                  server=url)
                    model_results.append(result)
        except BaseException as e:
            print("- FAILED: {} - {}".format(url, e))
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://data.nodc.noaa.gov/geoportal/csw - ('Connection aborted.', gaierror(-2, 'Name or service not known'))
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50

Note that some servers have a maximum amount of records you can retrieve at once and are failing our query here. (See https://github.com/ioos/system-test/issues/126.)

Let's get the data as a pandas.DataFrame.

In [7]:
from pandas import DataFrame

df = DataFrame(model_results)
df = df.drop_duplicates()

And now that we have the results, what do they mean?

First let's plot the total number of services available.

In [8]:
total_services = DataFrame(df.groupby("scheme").size(),
                           columns=(["Number of services"]))

ax = total_services.sort('Number of services',
                         ascending=False).plot(kind="barh", figsize=(10, 8))

We can note that some identical services types URNs are being identified differently!

There should be a consistent way of representing each service, or a mapping needs to be made available.

We can try to get around the issue of the same services being identified differently by relying on the "Scheme" metadata field.

In [9]:
def normalize_service_urn(urn):
    urns = urn.split(':')
    if urns[-1].lower() == "url":
        del urns[-1]
    return urns[-1].lower()


urns = df.copy(deep=True)
urns["urn"] = urns["scheme"].map(normalize_service_urn)
In [10]:
urns_summary = DataFrame(urns.groupby("scheme").size(),
                         columns=(["Number of services"]))

ax = urns_summary.sort('Number of services',
                       ascending=False).plot(kind="barh", figsize=(10, 6))

A little better, but still not ideal.

Let's move forward and plot the number of services available for the list of model strings we requested.

Models per CSW server:

In [11]:
records_per_csw = DataFrame(urns.groupby(["model", "server"]).size(),
                            columns=(["Number of services"]))

model_csw_plotter = records_per_csw.unstack("model")

ax = model_csw_plotter['Number of services'].plot(kind='barh', figsize=(10, 8))

Services available per CSW server:

In [12]:
records_per_csw = DataFrame(urns.groupby(["scheme", "server"]).size(),
                            columns=(["Number of services"]))

model_csw_plotter = records_per_csw.unstack("server")
ax = model_csw_plotter.plot(kind='barh', subplots=True,
                            figsize=(12, 30), sharey=True)

Querying several catalogs like we did in this notebook is very slow. This approach should be used only to help to determine which catalog we can use after we know what type of data and service we need.

You can see the original IOOS System Test notebook here.

In [13]:
HTML(html)
Out[13]:

This post was written as an IPython notebook. It is available for download. You can also try an interactive version on binder.

Comments