Using pyobis to query for known datasets#
Created: 2022-11-23
Updated: 2023-03-24
Author: Mathew Biddle
This notebook uses the pyobis Python package to query the OBIS API for datasets associated with projects funded under the United States Marine Biodiversity Observation Network. The notebook walks through the process for querying the OBIS api for a specific institution, then using the resultant datasets to gather the locations of all the occurrences using the pyobis package.
The US Marine Biodiversity Observation Network (US MBON) is an interagency initiative that seeks to coordinate across sectors and government to characterize biodiversity and understand drivers of change. US MBON represents a broad, collaborative effort to address the need for systematic collection and sharing of marine life information, ensure that information is available for decision-making and management from local to national levels, and document marine biodiversity status and trends in the face of human- and climate-induced change using a range of technologies and approaches. Through the National Oceanographic Partnership Program, NOAA, NASA, Office of Naval Research, and BOEM have invested in US MBON since 2014, most recently announcing new five year projects in 2022.
import pyobis
pyobis.__version__
'1.3.1'
First off, let’s look through the available institutions at OBIS and find the US MBON one.
Hint: We know the name of the institution is United States Marine Biodiversity Observation Network.
import pandas as pd
pd.set_option("max_colwidth", 400)
url = "https://api.obis.org/v3/institute"
institutes = pd.read_json(url)
df_inst = pd.DataFrame.from_records(institutes["results"])
df_inst.dropna(inplace=True)
df_inst["id"] = df_inst["id"].astype(int)
institution = df_inst.loc[
df_inst["name"].str.contains(
"United States Marine Biodiversity Observation", na=False
)
]
institution
id | name | country | records | |
---|---|---|---|---|
61 | 23070 | United States Marine Biodiversity Observation Network | United States of America | 1620917 |
Well that looks like the institution we’re after!
Using the id
we can check it out on the OBIS website: https://obis.org/institute/23070
Yes, that does look like what we want. Now let’s use that id
to query OBIS for all associated datasets.
institution_id = institution["id"].tolist()[0]
institution_id
23070
Collect metadata about all of the datasets#
Here we use the institution_id
collected from OBIS to query the API for metadata on the datasets associated with that id
.
First, we build the API call, then we bring it into a Pandas DataFrame and create additional fields for use later.
import pyobis
query = pyobis.dataset.search(instituteid=institution_id)
df = pd.DataFrame(query.execute())
df_meta = pd.DataFrame.from_records(df["results"])
# create a column for the human readable short name of the datasets
df_meta["short_name"] = df_meta["url"].str.split("=", expand=True)[1]
# create another column to group datasets by, which removes the year from the short names
df_meta["short_name_group"] = df_meta["short_name"].replace("\d", "", regex=True)
df_meta.head(5)
<>:13: SyntaxWarning: invalid escape sequence '\d'
<>:13: SyntaxWarning: invalid escape sequence '\d'
/tmp/ipykernel_72684/2509542522.py:13: SyntaxWarning: invalid escape sequence '\d'
df_meta["short_name_group"] = df_meta["short_name"].replace("\d", "", regex=True)
id | url | archive | published | created | updated | core | extensions | statistics | extent | ... | intellectualrights | feed | institutes | contacts | nodes | keywords | downloads | records | short_name | short_name_group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | cfceb150-bbe2-4efb-8682-14cfc7167e7c | https://ipt-obis.gbif.us/resource?r=rcca_transects | https://ipt-obis.gbif.us/archive.do?r=rcca_transects | 2023-01-18T20:38:20.000Z | 2020-12-08T01:07:21.790Z | 2023-06-03T12:51:53.436Z | event | [measurementorfact, occurrence] | {'Event': 19375, 'absence': 487162, 'dropped': 0, 'Occurrence': 724871, 'DNADerivedData': 0, 'MeasurementOrFact': 255300} | POLYGON((-124.294724 32.694233000000004,-124.294724 42.045155,-117.264999 42.045155,-117.264999 32.694233000000004,-124.294724 32.694233000000004)) | ... | To the extent possible under law, the publisher has waived all rights to these data and has dedicated them to the Public Domain (CC0 1.0) | {'id': '8112d75b-7e78-4641-9e09-9a2cc7db7375', 'url': 'https://ipt-obis.gbif.us/rss.do'} | [{'name': 'U.S. Geological Survey HQ', 'oceanexpert_id': 12976, 'oceanexpert_parent_id': None}, {'name': 'United States Marine Biodiversity Observation Network', 'oceanexpert_id': 23070, 'oceanexpert_parent_id': None}, {'name': 'Central & Northern California Ocean Observing System', 'oceanexpert_id': 23204, 'oceanexpert_parent_id': None}, {'name': 'Reef Check Foundation', 'oceanexpert_id': 232... | [{'role': None, 'type': 'creator', 'givenname': 'Jan', 'surname': 'Friewald', 'organization': 'Reef Check Foundation', 'position': 'Executive Director', 'email': 'jfreiwald@reefcheck.org', 'url': None, 'organization_oceanexpert_id': 23205, 'type_display': 'Creator'}, {'role': None, 'type': 'contact', 'givenname': 'Jan', 'surname': 'Friewald', 'organization': 'Reef Check Foundation', 'position'... | [{'id': 'b7c47783-a020-4173-b390-7b57c4fa1426', 'name': 'OBIS USA'}] | [{'keyword': 'Samplingevent', 'thesaurus': 'GBIF Dataset Type Vocabulary: http://rs.gbif.org/vocabulary/gbif/dataset_type_2015-07-10.xml'}] | [{'year': 2024, 'downloads': 3849, 'records': 520892938}, {'year': 2023, 'downloads': 3648, 'records': 588494102}, {'year': 2022, 'downloads': 4191, 'records': 426899064}, {'year': 2021, 'downloads': 13776, 'records': 497036713}, {'year': 2020, 'downloads': 747, 'records': 26199396}] | 237709 | rcca_transects | rcca_transects |
1 | d50e0443-4d84-4bd9-a5ad-4d417a7607e2 | https://ipt-obis.gbif.us/resource?r=2009floridakeysrvc | https://ipt-obis.gbif.us/archive.do?r=2009floridakeysrvc | 2023-02-21T17:28:10.000Z | None | 2023-06-06T16:14:55.499Z | event | [measurementorfact, occurrence] | {'Event': 2542, 'absence': 353964, 'dropped': 0, 'Occurrence': 451581, 'DNADerivedData': 0, 'MeasurementOrFact': 905704} | POLYGON((-81.9975333 24.4313,-81.9975333 25.7526,-80.08755 25.7526,-80.08755 24.4313,-81.9975333 24.4313)) | ... | To the extent possible under law, the publisher has waived all rights to these data and has dedicated them to the Public Domain (CC0 1.0) | {'id': '8112d75b-7e78-4641-9e09-9a2cc7db7375', 'url': 'https://ipt-obis.gbif.us/rss.do'} | [{'name': 'NOAA, Southeast Fisheries Science Center, National Marine Fisheries Service Southeast Miami Laboratory', 'oceanexpert_id': 7534, 'oceanexpert_parent_id': 11379}, {'name': 'Rosenstiel School of Marine & Atmospheric Science, University of Miami', 'oceanexpert_id': 8124, 'oceanexpert_parent_id': None}, {'name': 'U.S. Geological Survey HQ', 'oceanexpert_id': 12976, 'oceanexpert_parent_i... | [{'role': None, 'type': 'creator', 'givenname': 'Jerald', 'surname': 'Ault', 'organization': 'University of Miami, Rosenstiel School of Marine & Atmospheric Science', 'position': 'Program Lead', 'email': 'jault@rsmas.miami.edu', 'url': None, 'organization_oceanexpert_id': 8124, 'type_display': 'Creator'}, {'role': None, 'type': 'creator', 'givenname': 'Jim', 'surname': 'Bohnsack', 'organiz... | [{'id': 'b7c47783-a020-4173-b390-7b57c4fa1426', 'name': 'OBIS USA'}] | [{'keyword': 'Samplingevent', 'thesaurus': 'GBIF Dataset Type Vocabulary: http://rs.gbif.org/vocabulary/gbif/dataset_type.xml'}, {'keyword': 'Samplingevent', 'thesaurus': 'GBIF Dataset Type Vocabulary: http://rs.gbif.org/vocabulary/gbif/dataset_type_2015-07-10.xml'}] | [{'year': 2024, 'downloads': 11707, 'records': 216972587}, {'year': 2023, 'downloads': 7679, 'records': 242648438}, {'year': 2022, 'downloads': 7478, 'records': 175113790}, {'year': 2021, 'downloads': 17549, 'records': 204507663}, {'year': 2020, 'downloads': 3443, 'records': 68153160}, {'year': 2019, 'downloads': 293, 'records': 8867930}, {'year': 2018, 'downloads': 1, 'records': 97617}] | 97617 | 2009floridakeysrvc | floridakeysrvc |
2 | b4ac3d44-b4bf-49ef-a6e9-e2c02adff2fa | https://ipt-obis.gbif.us/resource?r=drytortugasreefvisualcensus2012 | https://ipt-obis.gbif.us/archive.do?r=drytortugasreefvisualcensus2012 | 2023-02-21T17:50:22.000Z | None | 2023-06-06T09:35:54.566Z | event | [measurementorfact, occurrence] | {'Event': 2280, 'absence': 297153, 'dropped': 0, 'Occurrence': 384195, 'DNADerivedData': 0, 'MeasurementOrFact': 770670} | POLYGON((-83.1035167 24.5519333,-83.1035167 24.7284333,-82.7729667 24.7284333,-82.7729667 24.5519333,-83.1035167 24.5519333)) | ... | To the extent possible under law, the publisher has waived all rights to these data and has dedicated them to the Public Domain (CC0 1.0) | {'id': '8112d75b-7e78-4641-9e09-9a2cc7db7375', 'url': 'https://ipt-obis.gbif.us/rss.do'} | [{'name': 'NOAA, Southeast Fisheries Science Center, National Marine Fisheries Service Southeast Miami Laboratory', 'oceanexpert_id': 7534, 'oceanexpert_parent_id': 11379}, {'name': 'Rosenstiel School of Marine & Atmospheric Science, University of Miami', 'oceanexpert_id': 8124, 'oceanexpert_parent_id': None}, {'name': 'Gulf of Mexico Coastal Ocean Observing System', 'oceanexpert_id': 18936, '... | [{'role': None, 'type': 'creator', 'givenname': 'Jerald', 'surname': 'Ault', 'organization': 'University of Miami, Rosenstiel School of Marine & Atmospheric Science', 'position': 'Project Lead', 'email': 'jault@rsmas.miami.edu', 'url': None, 'organization_oceanexpert_id': 8124, 'type_display': 'Creator'}, {'role': None, 'type': 'creator', 'givenname': 'Jim', 'surname': 'Bohnsack', 'organiz... | [{'id': 'b7c47783-a020-4173-b390-7b57c4fa1426', 'name': 'OBIS USA'}] | [{'keyword': 'Occurrence', 'thesaurus': 'GBIF Dataset Type Vocabulary: http://rs.gbif.org/vocabulary/gbif/dataset_type.xml'}, {'keyword': 'Observation', 'thesaurus': 'GBIF Dataset Subtype Vocabulary: http://rs.gbif.org/vocabulary/gbif/dataset_subtype.xml'}, {'keyword': 'Samplingevent', 'thesaurus': 'GBIF Dataset Type Vocabulary: http://rs.gbif.org/vocabulary/gbif/dataset_type_2015-07-10.xml'}] | [{'year': 2024, 'downloads': 11800, 'records': 192712914}, {'year': 2023, 'downloads': 7461, 'records': 215815493}, {'year': 2022, 'downloads': 7273, 'records': 158822384}, {'year': 2021, 'downloads': 17801, 'records': 178906115}, {'year': 2020, 'downloads': 3353, 'records': 59225761}, {'year': 2019, 'downloads': 299, 'records': 7648000}, {'year': 2018, 'downloads': 1, 'records': 87042}] | 87042 | drytortugasreefvisualcensus2012 | drytortugasreefvisualcensus |
3 | 208bbf73-5dac-4d20-a017-5fc7d18a5b48 | https://ipt-obis.gbif.us/resource?r=floridakeysreefvisualcensus2011 | https://ipt-obis.gbif.us/archive.do?r=floridakeysreefvisualcensus2011 | 2023-02-21T18:05:15.000Z | None | 2023-06-05T04:39:24.917Z | event | [measurementorfact, occurrence] | {'Event': 2095, 'absence': 276737, 'dropped': 0, 'Occurrence': 353280, 'DNADerivedData': 0, 'MeasurementOrFact': 708655} | POLYGON((-81.988117 24.43305,-81.988117 25.71506,-80.092217 25.71506,-80.092217 24.43305,-81.988117 24.43305)) | ... | To the extent possible under law, the publisher has waived all rights to these data and has dedicated them to the Public Domain (CC0 1.0) | {'id': '8112d75b-7e78-4641-9e09-9a2cc7db7375', 'url': 'https://ipt-obis.gbif.us/rss.do'} | [{'name': 'NOAA, Southeast Fisheries Science Center, National Marine Fisheries Service Southeast Miami Laboratory', 'oceanexpert_id': 7534, 'oceanexpert_parent_id': 11379}, {'name': 'Rosenstiel School of Marine & Atmospheric Science, University of Miami', 'oceanexpert_id': 8124, 'oceanexpert_parent_id': None}, {'name': 'Gulf of Mexico Coastal Ocean Observing System', 'oceanexpert_id': 18936, '... | [{'role': None, 'type': 'creator', 'givenname': 'Jerald', 'surname': 'Ault', 'organization': 'University of Miami, Rosenstiel School of Marine & Atmospheric Science', 'position': 'Project Lead', 'email': 'jault@rsmas.miami.edu', 'url': None, 'organization_oceanexpert_id': 8124, 'type_display': 'Creator'}, {'role': None, 'type': 'creator', 'givenname': 'Jim', 'surname': 'Bohnsack', 'organiz... | [{'id': 'b7c47783-a020-4173-b390-7b57c4fa1426', 'name': 'OBIS USA'}] | [{'keyword': 'Occurrence', 'thesaurus': 'GBIF Dataset Type Vocabulary: http://rs.gbif.org/vocabulary/gbif/dataset_type.xml'}, {'keyword': 'Observation', 'thesaurus': 'GBIF Dataset Subtype Vocabulary: http://rs.gbif.org/vocabulary/gbif/dataset_subtype.xml'}, {'keyword': 'Samplingevent', 'thesaurus': 'GBIF Dataset Type Vocabulary: http://rs.gbif.org/vocabulary/gbif/dataset_type_2015-07-10.xml'}] | [{'year': 2024, 'downloads': 11481, 'records': 169952503}, {'year': 2023, 'downloads': 7513, 'records': 190581549}, {'year': 2022, 'downloads': 7350, 'records': 139710500}, {'year': 2021, 'downloads': 17831, 'records': 160861289}, {'year': 2020, 'downloads': 3556, 'records': 53543342}, {'year': 2019, 'downloads': 322, 'records': 7049786}, {'year': 2018, 'downloads': 1, 'records': 76543}] | 76543 | floridakeysreefvisualcensus2011 | floridakeysreefvisualcensus |
4 | 33f39020-dfb7-4d45-bee3-9e3910792d9d | https://ipt-obis.gbif.us/resource?r=2012floridakeys | https://ipt-obis.gbif.us/archive.do?r=2012floridakeys | 2023-02-21T17:29:53.000Z | None | 2023-06-05T12:51:56.106Z | event | [measurementorfact, occurrence] | {'Event': 2015, 'absence': 290261, 'dropped': 0, 'Occurrence': 366514, 'DNADerivedData': 0, 'MeasurementOrFact': 735043} | POLYGON((-81.9519667 24.4371333,-81.9519667 25.75155,-80.0896667 25.75155,-80.0896667 24.4371333,-81.9519667 24.4371333)) | ... | To the extent possible under law, the publisher has waived all rights to these data and has dedicated them to the Public Domain (CC0 1.0) | {'id': '8112d75b-7e78-4641-9e09-9a2cc7db7375', 'url': 'https://ipt-obis.gbif.us/rss.do'} | [{'name': 'NOAA, Southeast Fisheries Science Center, National Marine Fisheries Service Southeast Miami Laboratory', 'oceanexpert_id': 7534, 'oceanexpert_parent_id': 11379}, {'name': 'Rosenstiel School of Marine & Atmospheric Science, University of Miami', 'oceanexpert_id': 8124, 'oceanexpert_parent_id': None}, {'name': 'U.S. Geological Survey HQ', 'oceanexpert_id': 12976, 'oceanexpert_parent_i... | [{'role': None, 'type': 'creator', 'givenname': 'Jerald', 'surname': 'Ault', 'organization': 'University of Miami, Rosenstiel School of Marine & Atmospheric Science', 'position': 'Project Lead', 'email': 'jault@rsmas.miami.edu', 'url': None, 'organization_oceanexpert_id': 8124, 'type_display': 'Creator'}, {'role': None, 'type': 'creator', 'givenname': 'Jim', 'surname': 'Bohnsack', 'organiz... | [{'id': 'b7c47783-a020-4173-b390-7b57c4fa1426', 'name': 'OBIS USA'}] | [{'keyword': 'Samplingevent', 'thesaurus': 'GBIF Dataset Type Vocabulary: http://rs.gbif.org/vocabulary/gbif/dataset_type.xml'}, {'keyword': 'Samplingevent', 'thesaurus': 'GBIF Dataset Type Vocabulary: http://rs.gbif.org/vocabulary/gbif/dataset_type_2015-07-10.xml'}] | [{'year': 2024, 'downloads': 11368, 'records': 169299730}, {'year': 2023, 'downloads': 7385, 'records': 189896655}, {'year': 2022, 'downloads': 7230, 'records': 138680258}, {'year': 2021, 'downloads': 17532, 'records': 160135207}, {'year': 2020, 'downloads': 3446, 'records': 53299681}, {'year': 2019, 'downloads': 341, 'records': 6924827}, {'year': 2018, 'downloads': 1, 'records': 76253}] | 76253 | 2012floridakeys | floridakeys |
5 rows × 24 columns
df_meta[["short_name", "short_name_group"]].describe()
short_name | short_name_group | |
---|---|---|
count | 41 | 41 |
unique | 41 | 21 |
top | rcca_transects | floridakeysreefvisualcensus |
freq | 1 | 15 |
Read to geopandas and plot the bounding boxes#
Next we want to make a simple map of the coverage each dataset has. We can use the extent
field directly in geopandas to make a simple map.
from datetime import datetime
import cartopy.io.shapereader as shpreader
import geopandas
import matplotlib.pyplot as plt
gdf = geopandas.GeoDataFrame(
df_meta, geometry=geopandas.GeoSeries.from_wkt(df_meta["extent"])
)
gdf.set_crs("epsg:4326", inplace=True)
fig, ax = plt.subplots(
figsize=(20, 15),
)
gdf.plot(
ax=ax,
zorder=10,
alpha=0.5,
column="short_name_group",
legend=True,
cmap="turbo",
)
leg = ax.get_legend()
leg.set_bbox_to_anchor((0.0, 0.0, 1.2, 0.9))
shpfilename = shpreader.natural_earth(
resolution="110m",
category="cultural",
name="admin_0_countries",
)
world = geopandas.read_file(shpfilename)
world = world.to_crs(gdf.crs)
# bound map to US.
world.loc[world["ISO_A3"] == "USA"].plot(
ax=ax, color="lightgrey", edgecolor="black", zorder=1
)
today = datetime.now().strftime("%Y-%m-%d")
ax.set_title(
f"Sampling coverage map\nTotal dataset records contributed to OBIS from the US MBON network as of {today}: {gdf.records.sum() }"
)
ax.set_axis_off();
gdf.explore(column="short_name")
Save as a geojson#
Since we have the data, why don’t we write this out to a geoJson file to interact with on GitHub.
fname = f"US_MBON_bounding_boxes_{today.replace('-','')}.geojson"
# can"t export fields with lists or dictionaries - breaks the format.
gdf[["id", "title", "url", "geometry"]].to_file(fname, driver="GeoJSON")
Query the OBIS aggregated points#
Now that we know the general area for where our observations are made, let’s dig a little deeper into each dataset.
We can use pyobis.occurrences.getpoints()
to return all of the points for each dataset, aggregated to Geohash precision 8.
If we’d like a little more metadata about each of the aggregated points, we can query by dataset to collect the aggregated points and some useful metadata.
combined = pd.DataFrame()
query = pyobis.dataset.search(instituteid=institution_id)
df = pd.DataFrame(query.execute())
df_meta = pd.DataFrame.from_records(df["results"])
for datasetid in df_meta["id"]:
dset = pyobis.occurrences.getpoints(datasetid=datasetid).execute()
meta = pyobis.dataset.get(id=datasetid).execute()["results"][0]
short_name = meta["url"].split("=")[-1]
df = pd.DataFrame(dset)
df["dataset_id"] = datasetid
df["short_name"] = meta["url"].split("=")[-1]
df["short_name_group"] = df["short_name"].replace("\d", "", regex=True)
df["url"] = meta["url"]
df["metadata"] = str(meta)
df[["decimalLongitude", "decimalLatitude"]] = pd.DataFrame(
df["coordinates"].tolist()
)
combined = pd.concat([combined, df], ignore_index=True)
gdf_ghsh = geopandas.GeoDataFrame(
combined[
["dataset_id", "url", "short_name_group", "decimalLatitude", "decimalLongitude"]
],
geometry=geopandas.points_from_xy(
combined.decimalLongitude, combined.decimalLatitude
),
crs="epsg:4326",
)
fig, ax = plt.subplots(figsize=(20, 15))
gdf_ghsh.plot(
ax=ax,
zorder=10,
alpha=0.5,
column="short_name_group",
legend=True,
)
world.loc[world["ISO_A3"] == "USA"].plot(
ax=ax, color="lightgrey", edgecolor="white", zorder=1
)
now = datetime.now()
today = now.strftime("%Y-%m-%d")
ax.set_title(
f"Total aggregated occurrence records contributed to OBIS from the US MBON network as of {today}: {gdf_ghsh.shape[0]}"
)
ax.set_axis_off();
<>:18: SyntaxWarning: invalid escape sequence '\d'
<>:18: SyntaxWarning: invalid escape sequence '\d'
/tmp/ipykernel_72684/2245479231.py:18: SyntaxWarning: invalid escape sequence '\d'
df["short_name_group"] = df["short_name"].replace("\d", "", regex=True)
Let’s explore those points a little more with geopandas.GeoDataFrame.explore().#
This allows you to create an interactive map based on folium/leaflet.jsInteractive map based on GeoPandas and folium/leaflet.js
gdf_ghsh.explore(column="short_name_group")