Calling R libraries from Python

Calling R libraries from Python#

Created: 2017-11-30

In this example we will explore the Coral Reef Evaluation and Monitoring Project (CREMP) data available in the Gulf of Mexico Coastal Ocean Observing System (GCOOS) ERDDAP server.

To access the server we will use the rerddap library and export the data to Python for easier plotting.

The first step is to load the rpy2 extension that will allow us to use the R libraries.

%load_ext rpy2.ipython

The first line below has a %%R to make it an R cell. The code below specify the GCOOS server and fetches the data information for the fk_CREMP_yearly_revisited_DATA_v3_1996 dataset.

For more information on rerddap please see https://docs.ropensci.org/rerddap/.

%%R

library('rerddap')

url <- 'https://gcoos4.tamu.edu/erddap'
data_info <- rerddap::info('fk_CREMP_yearly_revisited_DATA_v3_1996', url=url)

data_info
<ERDDAP info> fk_CREMP_yearly_revisited_DATA_v3_1996 
 Base URL: https://gcoos4.tamu.edu/erddap 
 Dataset Type: tabledap 
 Variables:  
     acceptedNameAuthorship: 
     acceptedNameUsage: 
     acceptedNameUsageID: 
     averageNumberOfPoints: 
         Range: 407, 900 
         Units: count 
     basisOfRecord: 
     bottomType: 
     class: 
     country: 
     crs: 
     datasetID: 
     datasetName: 
     depth: 
         Range: 3.0, 54.0 
         Units: m 
     eventDate: 
         Range: 1996, 1996 
         Units: unknown 
     eventID: 
     family: 
     firstYear: 
         Range: 1996, 1996 
         Units: unknown 
     genus: 
     geodeticDatum: 
     habitat: 
     habitatID: 
     kingdom: 
     language: 
     latitude: 
         Range: 24.4517, 25.2953 
         Units: degrees_north 
     license: 
     locality: 
     longitude: 
         Range: -81.9195, -80.2087 
         Units: degrees_east 
     maximumDepthInMeters: 
         Range: 3.0, 54.0 
         Units: m 
     minimumDepthInMeters: 
         Range: 3.0, 54.0 
         Units: m 
     Observations: 
         Range: 0, 11839 
     occurrenceID: 
     occurrenceStatus: 
     order: 
     organismQuantity: 
         Range: 0.0, 0.9475 
         Units: 0.01percent 
     organismQuantityType: 
     ownerInstitutionCode: 
     phylum: 
     recordedBy: 
     samplingEffort: 
     samplingProtocol: 
     scientificName: 
     scientificNameAuthorship: 
     scientificNameID: 
     siteCode: 
     siteID: 
         Range: 10, 81 
         Units: unknown 
     siteName: 
     specificEpithet: 
     stateProvince: 
     stationNumber: 
         Range: 101, 814 
         Units: unknown 
     subRegionID: 
     taxonomicStatus: 
     taxonRank: 
     time: 
         Range: 8.283168E8, 8.283168E8 
         Units: seconds since 1970-01-01T00:00:00Z 
     transectLengthInMeters: 
         Range: 19.0, 26.0 
         Units: m 
     type: 
     vernacularName: 
     waterBody: 

By inspecting the information above we can find the variables available in the dataset and use the tabledap function to download them.

Note that the %%R -o rdf will export the rdf variable back to the Python workspace.

%%R -o df

fields <- c(
    'depth',
    'longitude',
    'latitude',
    'organismQuantity',
    'genus',
    'habitat'
)

df <- tabledap(
    data_info,
    fields=fields,
    url=url
)
R[write to console]: info() output passed to x; setting base url to: https://gcoos4.tamu.edu/erddap

Now we need to export the R DataFrame to a pandas objects and ensure that all numeric types are numbers and not strings.

import pandas as pd

cols = ["longitude", "latitude", "depth", "organismQuantity"]
df[cols] = df[cols].apply(pd.to_numeric)

df.head()
depth longitude latitude organismQuantity genus habitat
2 6.0 -80.3475 25.1736 0.0 Acropora Hard Bottom
3 6.0 -80.3475 25.1736 0.0 Acropora Hard Bottom
4 6.0 -80.3475 25.1736 0.0 Acropora Hard Bottom
5 6.0 -80.3475 25.1736 0.0 Acropora Hard Bottom
6 9.0 -80.3782 25.1201 0.0 Acropora Hard Bottom

We can navigate to ERDDAP’s info page to find the variables description. Let’s check what is organismQuantity:

The is value of the derived information product, such as the numerical value for biomass. This term does not include units. Mean number of observed fish per species for 5 Minutes

We can see that organismQuantity has a lot of zero values, let’s remove that first to plot the data positions only where something was found.

# Filter invalid values (-999).

cremp_1996 = df.loc[df["organismQuantity"] >= 0]
cremp_1996["genus"] = cremp_1996["genus"].str.strip()
cremp_1996.head()
depth longitude latitude organismQuantity genus habitat
2 6.0 -80.3475 25.1736 0.0 Acropora Hard Bottom
3 6.0 -80.3475 25.1736 0.0 Acropora Hard Bottom
4 6.0 -80.3475 25.1736 0.0 Acropora Hard Bottom
5 6.0 -80.3475 25.1736 0.0 Acropora Hard Bottom
6 9.0 -80.3782 25.1201 0.0 Acropora Hard Bottom

What is the most common genus of Coral observed?

avg = cremp_1996.drop("habitat", axis=1).groupby("genus").mean()

avg
depth longitude latitude organismQuantity
genus
23.2375 -81.046667 24.768958 0.034553
Acropora 23.2375 -81.046667 24.768957 0.006003
Agaricia 23.2375 -81.046667 24.768957 0.000005
Cladocora 23.2375 -81.046667 24.768957 0.000000
Colpophyllia 23.2375 -81.046667 24.768957 0.005079
Dendrogyra 23.2375 -81.046667 24.768957 0.001327
Diadema 23.2375 -81.046667 24.768957 0.000000
Dichocoenia 23.2375 -81.046667 24.768957 0.000591
Dictyota 23.2375 -81.046667 24.768957 0.000000
Diploria 23.2375 -81.046667 24.768957 0.001321
Eusmilia 23.2375 -81.046667 24.768957 0.000042
Favia 23.2375 -81.046667 24.768957 0.000031
Gorgonia 23.2375 -81.046667 24.768957 0.000000
Halimeda 23.2375 -81.046667 24.768957 0.000000
Helioseris 23.2375 -81.046667 24.768957 0.000000
Isophyllia 23.2375 -81.046667 24.768957 0.000000
Lobophora 23.2375 -81.046667 24.768957 0.000000
Lyngbia 23.2375 -81.046667 24.768957 0.000000
Madracis 23.2375 -81.046667 24.768957 0.000018
Manicina 23.2375 -81.046667 24.768957 0.000021
Meandrina 23.2375 -81.046667 24.768957 0.000421
Millepora 23.2375 -81.046667 24.768957 0.006130
Montastraea 23.2375 -81.046667 24.768957 0.013394
Mussa 23.2375 -81.046667 24.768957 0.000051
Mycetophyllia 23.2375 -81.046667 24.768957 0.000133
Oculina 23.2375 -81.046667 24.768957 0.000065
Orbicella 23.2375 -81.046667 24.768957 0.034583
Palythoa 23.2375 -81.046667 24.768957 0.000000
Phyllangia 23.2375 -81.046667 24.768957 0.000000
Porites 23.2375 -81.046667 24.768957 0.003351
Pseudodiploria 23.2375 -81.046667 24.768957 0.001383
Scolymia 23.2375 -81.046667 24.768957 0.000000
Siderastrea 23.2375 -81.046667 24.768957 0.004472
Solenastrea 23.2375 -81.046667 24.768957 0.000051
Stephanocoenia 23.2375 -81.046667 24.768957 0.000271
Undaria 23.2375 -81.046667 24.768957 0.001139
Xestospongia 23.2375 -81.046667 24.768957 0.000000

rerddap’s info request does not have enough metadata about the variables to explain the blank, and most abundant, genus. Checking the sever did not help figure that out. We’ll remove that for now to deal with only those that are identified.

cremp_1996 = cremp_1996.loc[cremp_1996["genus"] != ""]

There are also many genus with zero biomass count. In this example we’ll choose to do a biased analysis of occurrence and eliminate those where nothing was observed.

# Filter zero values (nothing was observed).

cremp_1996 = cremp_1996.loc[cremp_1996["organismQuantity"] > 0]

Now we can check the quantificationValue average by genus.

%matplotlib inline

avg = (
    cremp_1996.drop("habitat", axis=1).groupby("genus").mean()
)  # re-compute the "biased" average.
ax = avg["organismQuantity"].plot(kind="bar")
../../../_images/0e49ed7e0fde7961e3e0e367f89a512650b8d0f38fedc7a88eec7f27b844693e.png

and habitat.

ax = (
    cremp_1996.drop("genus", axis=1)
    .groupby("habitat")
    .mean()["organismQuantity"]
    .plot(kind="bar")
)
../../../_images/b374b5a006fba3fb38af64cfe07a7aa0c28cc630a7745095fb19053371d9fc72.png

It seems the most of the biomass was found around the Dendrogyra genus in Patch Reef habitats. But where are those Coral Reefs? How is the distribution of top three species with more biomass around them? With a pandas DataFrame it is easy to group the data by location and count the genus occurrence based on it.

import cartopy.crs as ccrs
import cartopy.feature as cfeature
import matplotlib.pyplot as plt
import numpy as np
from cartopy.mpl.ticker import LatitudeFormatter, LongitudeFormatter


def make_plot():
    bbox = [-82, -80, 24, 26]
    projection = ccrs.PlateCarree()

    fig, ax = plt.subplots(figsize=(9, 9), subplot_kw=dict(projection=projection))

    ax.set_extent(bbox)

    land = cfeature.NaturalEarthFeature(
        "physical", "land", "10m", edgecolor="face", facecolor=[0.85] * 3
    )

    ax.add_feature(land, zorder=0)
    ax.coastlines("10m", zorder=1)

    ax.set_xticks(np.linspace(bbox[0], bbox[1], 3), crs=projection)
    ax.set_yticks(np.linspace(bbox[2], bbox[3], 3), crs=projection)
    lon_formatter = LongitudeFormatter(zero_direction_label=True)
    lat_formatter = LatitudeFormatter()
    ax.xaxis.set_major_formatter(lon_formatter)
    ax.yaxis.set_major_formatter(lat_formatter)
    return fig, ax
count = (
    cremp_1996.loc[cremp_1996["genus"] == "Acropora"]
    .groupby(["longitude", "latitude"])
    .count()
    .reset_index()
)


fig, ax = make_plot()
c = ax.scatter(
    count["longitude"],
    count["latitude"],
    s=200,
    c=count["genus"],
    alpha=0.5,
    cmap=plt.cm.get_cmap("viridis_r", 6),
    zorder=3,
)
cbar = fig.colorbar(c, shrink=0.75, extend="both")
cbar.ax.set_ylabel("Genus occurrence count")
ax.set_title("Acropora");
/tmp/ipykernel_9861/710676339.py:16: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap=plt.cm.get_cmap("viridis_r", 6),
../../../_images/f45a8ba99da5de089ab3e8f7c937e3d86ddbe0e48f1c3c457d700a8f939805f8.png
count = (
    cremp_1996.loc[cremp_1996["genus"] == "Dendrogyra"]
    .groupby(["longitude", "latitude"])
    .count()
    .reset_index()
)


fig, ax = make_plot()
c = ax.scatter(
    count["longitude"],
    count["latitude"],
    s=200,
    c=count["genus"],
    alpha=0.5,
    cmap=plt.cm.get_cmap("viridis_r", 6),
    zorder=3,
)
cbar = fig.colorbar(c, shrink=0.75, extend="both")
cbar.ax.set_ylabel("Genus occurrence count")
ax.set_title("Dendrogyra");
/tmp/ipykernel_9861/1312933967.py:16: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap=plt.cm.get_cmap("viridis_r", 6),
../../../_images/a8ad1d4243ed4ec3e540d5cb9e12c88606c6dd92a48c2e23c9493e2daf5254d6.png
count = (
    cremp_1996.loc[cremp_1996["genus"] == "Orbicella"]
    .groupby(["longitude", "latitude"])
    .count()
    .reset_index()
)


fig, ax = make_plot()
c = ax.scatter(
    count["longitude"],
    count["latitude"],
    s=200,
    c=count["genus"],
    alpha=0.5,
    cmap=plt.cm.get_cmap("viridis_r", 6),
    zorder=3,
)
cbar = fig.colorbar(c, shrink=0.75, extend="both")
cbar.ax.set_ylabel("Genus occurrence count")
ax.set_title("Orbicella");
/tmp/ipykernel_9861/1632937538.py:16: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  cmap=plt.cm.get_cmap("viridis_r", 6),
../../../_images/b4a75ad12317d9399ae3b62621c3ed521b6b864b0f46a30a42e968a3af6a4207.png

This demonstration showed the power of mixing Python and R to reduce developer time and allow the research to focus on the data and not the programming language.