A common task is to find out what information is available for further research later on.
We can programmatically build a list of strings to query common data catalogs and find out what services are available. This post will show how to perform a query for numerical models strings and try to answer the question: how many services are available in each catalog?
To answers that question we will start by building a list of known catalogs services.
known_csw_servers = ['http://data.nodc.noaa.gov/geoportal/csw',
'http://cwic.csiss.gmu.edu/cwicv1/discovery',
'http://geoport.whoi.edu/geoportal/csw',
'https://edg.epa.gov/metadata/csw',
'http://www.ngdc.noaa.gov/geoportal/csw',
'http://cmgds.marine.usgs.gov/geonetwork/srv/en/csw',
'http://www.nodc.noaa.gov/geoportal/csw',
'http://cida.usgs.gov/gdp/geonetwork/srv/en/csw',
'http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw',
'http://geoport.whoi.edu/gi-cat/services/cswiso',
'https://data.noaa.gov/csw']
And a list of known model strings to query.
known_model_strings = ['roms', 'selfe', 'adcirc', 'ncom',
'hycom', 'fvcom', 'pom', 'wrams', 'wrf']
from owslib import fes
model_name_filters = []
for model in known_model_strings:
kw = dict(literal='*%s*' % model, wildCard='*')
title_filter = fes.PropertyIsLike(propertyname='apiso:Title', **kw)
subject_filter = fes.PropertyIsLike(propertyname='apiso:Subject', **kw)
model_name_filters.append(fes.Or([title_filter, subject_filter]))
The FES filter we build below is simpler than what we did before.
We are only looking for matches in Title
or Subject
that contain the model strings.
from owslib.csw import CatalogueServiceWeb
model_results = []
for x in range(len(model_name_filters)):
model_name = known_model_strings[x]
single_model_filter = model_name_filters[x]
for url in known_csw_servers:
try:
csw = CatalogueServiceWeb(url, timeout=20)
csw.getrecords2(constraints=[single_model_filter],
maxrecords=1000, esn='full')
for record, item in csw.records.items():
for d in item.references:
result = dict(model=model_name,
scheme=d['scheme'],
url=d['url'],
server=url)
model_results.append(result)
except BaseException as e:
print("- FAILED: {} - {}".format(url, e))
Note that some servers have a maximum amount of records you can retrieve at once and are failing our query here. (See https://github.com/ioos/system-test/issues/126.)
Let's get the data as a pandas.DataFrame
.
from pandas import DataFrame
df = DataFrame(model_results)
df = df.drop_duplicates()
And now that we have the results, what do they mean?
First let's plot the total number of services available.
total_services = DataFrame(df.groupby("scheme").size(),
columns=(["Number of services"]))
ax = total_services.sort('Number of services',
ascending=False).plot(kind="barh", figsize=(10, 8))
We can note that some identical services types URNs are being identified differently!
There should be a consistent way of representing each service, or a mapping needs to be made available.
We can try to get around the issue of the same services being identified differently by relying on the "Scheme"
metadata field.
def normalize_service_urn(urn):
urns = urn.split(':')
if urns[-1].lower() == "url":
del urns[-1]
return urns[-1].lower()
urns = df.copy(deep=True)
urns["urn"] = urns["scheme"].map(normalize_service_urn)
urns_summary = DataFrame(urns.groupby("scheme").size(),
columns=(["Number of services"]))
ax = urns_summary.sort('Number of services',
ascending=False).plot(kind="barh", figsize=(10, 6))
A little better, but still not ideal.
Let's move forward and plot the number of services available for the list of model strings we requested.
Models per CSW server:¶
records_per_csw = DataFrame(urns.groupby(["model", "server"]).size(),
columns=(["Number of services"]))
model_csw_plotter = records_per_csw.unstack("model")
ax = model_csw_plotter['Number of services'].plot(kind='barh', figsize=(10, 8))
Services available per CSW server:¶
records_per_csw = DataFrame(urns.groupby(["scheme", "server"]).size(),
columns=(["Number of services"]))
model_csw_plotter = records_per_csw.unstack("server")
ax = model_csw_plotter.plot(kind='barh', subplots=True,
figsize=(12, 30), sharey=True)
Querying several catalogs like we did in this notebook is very slow. This approach should be used only to help to determine which catalog we can use after we know what type of data and service we need.
You can see the original IOOS System Test notebook here.
HTML(html)