Programmatically query the IOOS Data Catalog for a specific observation type#
Created: 2024-09-17
Updated: 2024-09-19
Author: Mathew Biddle
In this notebook we highlight the ability to search the IOOS Data Catalog for a specific subset of observations using the CKAN web accessible Application Programming Interface (API).
For this example, we want to look for observations of oxygen in the water column across the IOOS Catalog. As part of the IOOS Metadata Profile, which the US IOOS community uses to publish datasets, we know that each Regional Association and DAC will be following the Climate and Forecast (CF) Conventions and using CF standard_names
to describe their datasets. So, with that assumption, we can search across the IOOS Data catalog for datasets with the CF standard names that contain oxygen
and sea_water
. Then, we can build a simple map to show the geographical distribution of those datasets.
Build CKAN API query base.#
Uses ckan/ckanapi
from ckanapi import RemoteCKAN
ua = "ckanapiioos/1.0 (+https://ioos.us/)"
# ioos_catalog = RemoteCKAN('https://data.ioos.us', user_agent=ua, get_only=True)
ioos_catalog = RemoteCKAN("https://data.ioos.us", user_agent=ua)
ioos_catalog
<ckanapi.remoteckan.RemoteCKAN at 0x704d55f8c500>
What organizations are in the catalog?#
Tell me what organizations are there.
orgs = ioos_catalog.action.organization_list()
print(orgs)
['aoos', 'caricoos', 'cdip', 'cencoos', 'comt', 'gcoos', 'glider-dac', 'glos', 'hf-radar-dac', 'ioos', 'maracoos', 'nanoos', 'neracoos', 'noaa-co-ops', 'noaa-ndbc', 'oceansites', 'pacioos', 'sccoos', 'secoora', 'unidata', 'usgs', 'us-navy']
How many datasets are we searching across?#
Grab all the datasets available and return the count.
# datasets = ioos_catalog.action.package_search(fq='+cf_standard_names:mass_concentration_of_oxygen_in_sea_water', rows=50)
datasets = ioos_catalog.action.package_search()
datasets["count"]
44311
Grab the most recent applicable CF standard names#
Collect CF standard names that contain oxygen
and sea_water
from the CF standard name list.
import pandas as pd
url = "https://cfconventions.org/Data/cf-standard-names/current/src/cf-standard-name-table.xml"
tbl_version = pd.read_xml(url, xpath="./*")["version_number"][0].astype(int)
df = pd.read_xml(url, xpath="entry")
std_names = df.loc[
(df["id"].str.contains("oxygen") & df["id"].str.contains("sea_water"))
]
print(f"CF Standard Name Table: {tbl_version}")
std_names[["id", "description"]]
CF Standard Name Table: 86
id | description | |
---|---|---|
469 | depth_at_shallowest_local_minimum_in_vertical_... | Depth is the vertical distance below the surfa... |
624 | fractional_saturation_of_oxygen_in_sea_water | Fractional saturation is the ratio of some mea... |
1357 | mass_concentration_of_oxygen_in_sea_water | Mass concentration means mass per unit volume ... |
1725 | mole_concentration_of_dissolved_molecular_oxyg... | Mole concentration means number of moles per u... |
1726 | mole_concentration_of_dissolved_molecular_oxyg... | "Mole concentration at saturation" means the m... |
1727 | mole_concentration_of_dissolved_molecular_oxyg... | Mole concentration means number of moles per u... |
1825 | mole_concentration_of_preformed_dissolved_mole... | "Mole concentration" means the number of moles... |
1996 | moles_of_oxygen_per_unit_mass_in_sea_water | moles_of_X_per_unit_mass_inY is also called "m... |
3203 | surface_molecular_oxygen_partial_pressure_diff... | The surface called "surface" means the lower b... |
3700 | temperature_of_sensor_for_oxygen_in_sea_water | Temperature_of_sensor_for_oxygen_in_sea_water ... |
4776 | volume_fraction_of_oxygen_in_sea_water | "Volume fraction" is used in the construction ... |
4780 | volume_mixing_ratio_of_oxygen_at_stp_in_sea_water | "ratio_of_X_to_Y" means X/Y. "stp" means stand... |
Search across IOOS Data Catalog using CKAN API#
Search the IOOS Data Catalog for CF standard names that match those above.
import time
from ckanapi import RemoteCKAN
from ckanapi.errors import CKANAPIError
ua = "ckanapiioos/1.0 (+https://ioos.us/)"
ioos_catalog = RemoteCKAN("https://data.ioos.us", user_agent=ua)
ioos_catalog
df_out = pd.DataFrame()
for std_name in std_names["id"]:
print(std_name)
fq = f"+cf_standard_names:{std_name}"
result_count = 0
while True:
try:
datasets = ioos_catalog.action.package_search(fq=fq, rows=500)
except CKANAPIError:
continue
num_results = datasets["count"]
print(f"num_results: {num_results}, result_count: {result_count}")
for dataset in datasets["results"]:
# print(dataset['title'])
df = pd.DataFrame(
{
"title": [dataset["title"]],
"url": [dataset["resources"][0]["url"]],
"org": [dataset["organization"]["title"]],
"std_name": std_name,
}
)
df_out = pd.concat([df_out, df])
result_count = result_count + 1
time.sleep(1)
if result_count >= num_results:
print(f"num_results: {num_results}, result_count: {result_count}")
break
df_out
depth_at_shallowest_local_minimum_in_vertical_profile_of_mole_concentration_of_dissolved_molecular_oxygen_in_sea_water
num_results: 0, result_count: 0
num_results: 0, result_count: 0
fractional_saturation_of_oxygen_in_sea_water
num_results: 987, result_count: 0
num_results: 987, result_count: 500
num_results: 987, result_count: 1000
mass_concentration_of_oxygen_in_sea_water
num_results: 2735, result_count: 0
num_results: 2735, result_count: 500
num_results: 2735, result_count: 1000
num_results: 2735, result_count: 1500
num_results: 2735, result_count: 2000
num_results: 2735, result_count: 2500
num_results: 2735, result_count: 3000
mole_concentration_of_dissolved_molecular_oxygen_in_sea_water
num_results: 300, result_count: 0
num_results: 300, result_count: 300
mole_concentration_of_dissolved_molecular_oxygen_in_sea_water_at_saturation
num_results: 0, result_count: 0
num_results: 0, result_count: 0
mole_concentration_of_dissolved_molecular_oxygen_in_sea_water_at_shallowest_local_minimum_in_vertical_profile
num_results: 0, result_count: 0
num_results: 0, result_count: 0
mole_concentration_of_preformed_dissolved_molecular_oxygen_in_sea_water
num_results: 0, result_count: 0
num_results: 0, result_count: 0
moles_of_oxygen_per_unit_mass_in_sea_water
num_results: 813, result_count: 0
num_results: 813, result_count: 500
num_results: 813, result_count: 1000
surface_molecular_oxygen_partial_pressure_difference_between_sea_water_and_air
num_results: 0, result_count: 0
num_results: 0, result_count: 0
temperature_of_sensor_for_oxygen_in_sea_water
num_results: 167, result_count: 0
num_results: 167, result_count: 167
volume_fraction_of_oxygen_in_sea_water
num_results: 18, result_count: 0
num_results: 18, result_count: 18
volume_mixing_ratio_of_oxygen_at_stp_in_sea_water
num_results: 0, result_count: 0
num_results: 0, result_count: 0
title | url | org | std_name | |
---|---|---|---|---|
0 | St. Lucie Estuary - South Fork 2 (SLE-SF2) | https://erddap.secoora.org/erddap/tabledap/st-... | SECOORA | fractional_saturation_of_oxygen_in_sea_water |
0 | Neuse River at Marker 15 (ModMon 70, AWS J8903... | https://erddap.secoora.org/erddap/tabledap/neu... | SECOORA | fractional_saturation_of_oxygen_in_sea_water |
0 | Pamlico Sound at PS9 (ModMon) | https://erddap.secoora.org/erddap/tabledap/pam... | SECOORA | fractional_saturation_of_oxygen_in_sea_water |
0 | Indian River Lagoon - Jensen Beach (IRL-JB) | https://erddap.secoora.org/erddap/tabledap/ind... | SECOORA | fractional_saturation_of_oxygen_in_sea_water |
0 | Indian River Lagoon - Sebastian (IRL-SB) | https://erddap.secoora.org/erddap/tabledap/ind... | SECOORA | fractional_saturation_of_oxygen_in_sea_water |
... | ... | ... | ... | ... |
0 | Walton-Smith CTD, WS22215, WS22215_2022_08_Wea... | https://gcoos5.geos.tamu.edu/erddap/tabledap/W... | GCOOS | volume_fraction_of_oxygen_in_sea_water |
0 | Walton-Smith CTD, WS22215, WS22215_2022_08_Wea... | https://gcoos5.geos.tamu.edu/erddap/tabledap/W... | GCOOS | volume_fraction_of_oxygen_in_sea_water |
0 | Walton-Smith CTD, WS22215, WS22215_2022_08_Wea... | https://gcoos5.geos.tamu.edu/erddap/tabledap/W... | GCOOS | volume_fraction_of_oxygen_in_sea_water |
0 | Walton-Smith CTD, WS22215, WS22215_2022_08_Wea... | https://gcoos5.geos.tamu.edu/erddap/tabledap/W... | GCOOS | volume_fraction_of_oxygen_in_sea_water |
0 | Walton-Smith CTD, WS22215, WS22215_2022_08_Wea... | https://gcoos5.geos.tamu.edu/erddap/tabledap/W... | GCOOS | volume_fraction_of_oxygen_in_sea_water |
5485 rows × 4 columns
Do some summarizing of the responses#
The DataFrame of the matching datasets is quite large. I wonder what the distribution of those datasets across organizations looks like? Let’s use pandas.groupby() to generate some statistics about how many datasets are provided, matching our criteria, by which organization.
df_out.groupby(by="org").count()
title | url | std_name | |
---|---|---|---|
org | |||
CeNCOOS | 2 | 2 | 2 |
GCOOS | 2208 | 2208 | 2208 |
Glider DAC | 2625 | 2625 | 2625 |
NERACOOS | 28 | 28 | 28 |
PacIOOS | 16 | 16 | 16 |
SECOORA | 606 | 606 | 606 |
Drop the Glider DAC data#
Glider DAC data are already making it to NCEI, so we can drop those entries.
df_out_no_glider = df_out.loc[~df_out["org"].str.contains("Glider DAC")]
df_out_no_glider.groupby(by="org").count()
title | url | std_name | |
---|---|---|---|
org | |||
CeNCOOS | 2 | 2 | 2 |
GCOOS | 2208 | 2208 | 2208 |
NERACOOS | 28 | 28 | 28 |
PacIOOS | 16 | 16 | 16 |
SECOORA | 606 | 606 | 606 |
Digging into some of the nuances#
There are still quite a lot of datasets from each organization. As our search above looked for each CF standard_name across all the datasets, there might be duplicate datasets which have multiple matching CF standard names. ie. one dataset might have both mass_concentration_of_oxygen_in_sea_water
and fractional_saturation_of_oxygen_in_sea_water
, but we only need to know that it’s one dataset.
As we only need to know about the unique datasets, let’s count how many unique dataset urls we have.
df_out_no_glider.groupby(by="url").count()
title | org | std_name | |
---|---|---|---|
url | |||
http://www.neracoos.org/erddap/tabledap/A01_optode_all | 3 | 3 | 3 |
http://www.neracoos.org/erddap/tabledap/GRBGBWQ_NERRS | 4 | 4 | 4 |
http://www.neracoos.org/erddap/tabledap/GRBLRWQ_NERRS | 4 | 4 | 4 |
http://www.neracoos.org/erddap/tabledap/LOBO_CSV_65 | 1 | 1 | 1 |
http://www.neracoos.org/erddap/tabledap/URI_168-MV_BottomSonde | 16 | 16 | 16 |
... | ... | ... | ... |
https://gcoos5.geos.tamu.edu/erddap/tabledap/deepwater_pe972250_ctd | 6 | 6 | 6 |
https://gcoos5.geos.tamu.edu/erddap/tabledap/deepwater_pe972274_ctd | 6 | 6 | 6 |
https://gcoos5.geos.tamu.edu/erddap/tabledap/deepwater_st093lay_ctd | 6 | 6 | 6 |
https://pae-paha.pacioos.hawaii.edu/erddap/tabledap/hui_water_quality | 8 | 8 | 8 |
https://pae-paha.pacioos.hawaii.edu/erddap/tabledap/nss_012 | 8 | 8 | 8 |
422 rows × 3 columns
Drop duplicate records#
As you can see above, there are a lot of duplicate dataset urls which we can simplify down. We identify duplicates by looking at the URL, which should be unique for each dataset, and drop the duplicates.
df_out_nodups_no_glider = df_out_no_glider.drop_duplicates(subset=["url"], keep="last")
df_out_nodups_no_glider
title | url | org | std_name | |
---|---|---|---|---|
0 | Neuse River near the south shore (ModMon 96) | https://erddap.secoora.org/erddap/tabledap/neu... | SECOORA | fractional_saturation_of_oxygen_in_sea_water |
0 | Great Bay,NH. Lamprey River WQ station | http://www.neracoos.org/erddap/tabledap/GRBLRW... | NERACOOS | fractional_saturation_of_oxygen_in_sea_water |
0 | Great Bay,NH. Great Bay WQ station | http://www.neracoos.org/erddap/tabledap/GRBGBW... | NERACOOS | fractional_saturation_of_oxygen_in_sea_water |
0 | WACCASASSA RIVER NR GULF HAMMOCK, FLA. (USGS 0... | https://erddap.secoora.org/erddap/tabledap/gov... | SECOORA | mass_concentration_of_oxygen_in_sea_water |
0 | LITTLE BACK RIVER AT HOG ISLAND, NEAR SAVANNAH... | https://erddap.secoora.org/erddap/tabledap/gov... | SECOORA | mass_concentration_of_oxygen_in_sea_water |
... | ... | ... | ... | ... |
0 | Walton-Smith CTD, WS22215, WS22215_2022_08_Wea... | https://gcoos5.geos.tamu.edu/erddap/tabledap/W... | GCOOS | volume_fraction_of_oxygen_in_sea_water |
0 | Walton-Smith CTD, WS22215, WS22215_2022_08_Wea... | https://gcoos5.geos.tamu.edu/erddap/tabledap/W... | GCOOS | volume_fraction_of_oxygen_in_sea_water |
0 | Walton-Smith CTD, WS22215, WS22215_2022_08_Wea... | https://gcoos5.geos.tamu.edu/erddap/tabledap/W... | GCOOS | volume_fraction_of_oxygen_in_sea_water |
0 | Walton-Smith CTD, WS22215, WS22215_2022_08_Wea... | https://gcoos5.geos.tamu.edu/erddap/tabledap/W... | GCOOS | volume_fraction_of_oxygen_in_sea_water |
0 | Walton-Smith CTD, WS22215, WS22215_2022_08_Wea... | https://gcoos5.geos.tamu.edu/erddap/tabledap/W... | GCOOS | volume_fraction_of_oxygen_in_sea_water |
422 rows × 4 columns
How many endpoints are not ERDDAP?#
Now we have a unique list of datasets which match our CF standard name criteria. Since we have some background in using ERDDAP to query for data, let’s take a look at what other endpoints each of the datasets are using.
Hint: We know ERDDAP systems typically have erddap
in their urls.
df_out_nodups_no_glider.loc[~df_out_nodups_no_glider["url"].str.contains("erddap")]
title | url | org | std_name |
---|
What’s the remaining distribution?#
This is the distribution of unique datasets found in the IOOS Data Catalog which have a CF Standard Name that contains the work oxygen
and sea_water
. We’ve dropped out the Glider DAC datasets as, theoretically, those are in NCEI already.
df_out_nodups_no_glider.groupby(by="org").count()
title | url | std_name | |
---|---|---|---|
org | |||
CeNCOOS | 2 | 2 | 2 |
GCOOS | 378 | 378 | 378 |
NERACOOS | 5 | 5 | 5 |
PacIOOS | 2 | 2 | 2 |
SECOORA | 35 | 35 | 35 |
Ingest data#
Let’s rip through all of the datasets, grab the data as a table (including units) and make a monster dictionary. This takes a bit.
dict_out_final = {}
for index, row in df_out_nodups_no_glider.iterrows():
dict_out_final["{}".format(row["title"])] = pd.read_csv(
"{}.csvp".format(row["url"]), low_memory=False
)
Let’s take a quick look at one of the DataFrames.
Transpose it when we print, so we can see all the columns.
dict_out_final['"Deepwater CTD - pe972218.ctd.nc - 29.25N, -87.89W - 1997-03-21"'].head(
5
).T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
profile | NaN | NaN | NaN | NaN | NaN |
time (UTC) | 1968-01-20T03:14:07Z | 1968-01-20T03:14:07Z | 1968-01-20T03:14:07Z | 1968-01-20T03:14:07Z | 1968-01-20T03:14:07Z |
latitude (degrees_north) | 29.2472 | 29.2472 | 29.2472 | 29.2472 | 29.2472 |
longitude (degrees_east) | -87.888901 | -87.888901 | -87.888901 | -87.888901 | -87.888901 |
numberOfLevel | 351 | 351 | 351 | 351 | 351 |
depth (m) | 7.0 | 8.0 | 9.0 | 10.0 | 11.0 |
temperature (degree_C) | 20.823 | 20.832001 | 20.837 | 20.837999 | 20.854 |
salinity (PSU) | 35.321999 | 35.332001 | 35.335999 | 35.338001 | 35.345001 |
oxygen (milliliters per liter) | 4.59 | 4.62 | 4.63 | 4.63 | 4.75 |
nitrite (micromols) | 0.0 | 0.0 | 0.0 | 0.0 | 0.01 |
nitrate (micromols) | 0.08 | 0.08 | 0.11 | 0.11 | 0.1 |
phosphate (micromols) | 0.02 | 0.01 | 0.0 | 0.01 | 0.01 |
silicate (micromols) | 1.33 | 1.18 | 1.13 | 1.05 | 1.41 |
salinity2 (PSU) | 36.103001 | 36.092999 | 36.094002 | 36.096001 | 36.285999 |
qualityFlag | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
instrument | NaN | NaN | NaN | NaN | NaN |
instrument1 | NaN | NaN | NaN | NaN | NaN |
instrument2 | NaN | NaN | NaN | NaN | NaN |
instrument3 | NaN | NaN | NaN | NaN | NaN |
instrument4 | NaN | NaN | NaN | NaN | NaN |
instrument5 | NaN | NaN | NaN | NaN | NaN |
Let’s make a nice map of the distribution of observations#
Below we create a mapping function to plot the unique dataset points on a map. Then, we use that function with our full response. We have to do a little reorganizing of the data to build one DataFrame for all the coordinates.
import cartopy.io.shapereader as shpreader
import geopandas as gpd
import matplotlib.pyplot as plt
def make_map(df):
# initialize an axis
fig, ax = plt.subplots(figsize=(8, 6)) # plot map on axis
shpfilename = shpreader.natural_earth(
resolution="110m",
category="cultural",
name="admin_0_countries",
)
countries = gpd.read_file(shpfilename)
countries[countries["NAME"] == "United States of America"].plot(
color="lightgrey", ax=ax
)
# plot points
df.plot(
x="longitude (degrees_east)",
y="latitude (degrees_north)",
kind="scatter",
ax=ax,
) # add grid
ax.grid(visible=True, alpha=0.5)
return ax
df_coords = pd.DataFrame(
columns=["latitude (degrees_north)", "longitude (degrees_east)"]
)
for key in dict_out_final.keys():
df_coords = pd.concat(
[
df_coords,
dict_out_final[key][
["latitude (degrees_north)", "longitude (degrees_east)"]
],
]
)
# drop all duplicates
df_coords_clean = df_coords.drop_duplicates(ignore_index=True)
# make the map
make_map(df_coords_clean)
/tmp/ipykernel_100331/3055943423.py:6: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
df_coords = pd.concat(
<Axes: xlabel='longitude (degrees_east)', ylabel='latitude (degrees_north)'>
Lets explore those points on an interactive map#
Just for fun, we can us geopandas.explore()
to plot these points on an interactive map to browse around.
gdf = gpd.GeoDataFrame(
df_coords_clean,
geometry=gpd.points_from_xy(
df_coords_clean["longitude (degrees_east)"],
df_coords_clean["latitude (degrees_north)"],
),
crs="EPSG:4326",
)
gdf.explore()
We hope this example demonstrates the flexibility of direct requests to the IOOS Data Catalog CKAN server and all the possibilities it provides. In this notebook we:
Search the IOOS Data Catalog CKAN API with keywords.
Found datasets matching our specified criteria.
Collected all the data from each of the datasets matching our criteria.
Created a simple map of the distribution of datasets which match our criteria.
To take this one step further, since we collected all the data from each of the datasets (in the dictionary dict_out_final
) a user could integrate all of the oxygen observations together and start to build a comprehensive dataset.
Additionally, a user could modify the CKAN query to search for terms outside of the CF standard names to potentially gather more datasets.