Using BagIt to tag oceanographic data

Using BagIt to tag oceanographic data#

Created: 2017-11-01

BagIt is a packaging format that supports storage of arbitrary digital content. The “bag” consists of arbitrary content and “tags,” the metadata files. BagIt packages can be used to facilitate data sharing with federal archive centers - thus ensuring digital preservation of oceanographic datasets within IOOS and its regional associations. NOAA NCEI supports reading from a Web Accessible Folder (WAF) containing bagit archives. For an example please see: http://ncei.axiomdatascience.com/cencoos/

On this notebook we will use the python interface for BagIt to create a “bag” of a time-series profile data. First let us load our data from a comma separated values file (CSV).

import os

import pandas as pd

fname = os.path.join("..", "data", "timeseriesProfile.csv")

df = pd.read_csv(fname, parse_dates=["time"])
df.head()
time lon lat depth station humidity temperature
0 1990-01-01 00:00:00 -76.5 37.5 0.0 Station1 89.708794 15.698009
1 1990-01-01 00:00:00 -76.5 37.5 10.0 Station1 55.789471 10.916656
2 1990-01-01 00:00:00 -76.5 37.5 20.0 Station1 50.176994 15.666663
3 1990-01-01 00:00:00 -76.5 37.5 30.0 Station1 36.855045 1.158752
4 1990-01-01 01:00:00 -76.5 37.5 0.0 Station1 65.016937 31.059647

Instead of “bagging” the CSV file we will use this create a metadata rich netCDF file.

We can convert the table to a DSG, Discrete Sampling Geometry, using pocean.dsg. The first thing we need to do is to create a mapping from the data column names to the netCDF axes.

axes = {"t": "time", "x": "lon", "y": "lat", "z": "depth"}

Now we can create a Orthogonal Multidimensional Timeseries Profile object…

import os
import tempfile

from pocean.dsg import OrthogonalMultidimensionalTimeseriesProfile as omtsp

output_fp, output = tempfile.mkstemp()
os.close(output_fp)

ncd = omtsp.from_dataframe(df.reset_index(), output=output, axes=axes, mode="a")

… And add some extra metadata before we close the file.

naming_authority = "ioos"
st_id = "Station1"

ncd.naming_authority = naming_authority
ncd.id = st_id
print(ncd)
ncd.close()
<class 'pocean.dsg.timeseriesProfile.om.OrthogonalMultidimensionalTimeseriesProfile'>
root group (NETCDF4 data model, file format HDF5):
    Conventions: CF-1.6
    date_created: 2021-08-24T23:45:00Z
    featureType: timeSeriesProfile
    cdm_data_type: TimeseriesProfile
    naming_authority: ioos
    id: Station1
    dimensions(sizes): station(1), time(100), depth(4)
    variables(dimensions): <class 'str'> station(station), float64 lat(station), float64 lon(station), int32 crs(), float64 time(time), float64 depth(depth), int32 index(time, depth, station), float64 humidity(time, depth, station), float64 temperature(time, depth, station)
    groups: 

Time to create the archive for the file with BagIt. We have to create a folder for the bag.

temp_bagit_folder = tempfile.mkdtemp()
temp_data_folder = os.path.join(temp_bagit_folder, "data")

Now we can create the bag and copy the netCDF file to a data sub-folder.

import shutil

import bagit

bag = bagit.make_bag(temp_bagit_folder, checksum=["sha256"])

shutil.copy2(output, temp_data_folder + "/parameter1.nc")
'/tmp/tmp30n1un_k/data/parameter1.nc'

Last, but not least, we have to set bag metadata and update the existing bag with it.

urn = f"urn:ioos:station:{naming_authority}:{st_id}"

bag_meta = {
    "Bag-Count": "1 of 1",
    "Bag-Group-Identifier": "ioos_bagit_testing",
    "Contact-Name": "Kyle Wilcox",
    "Contact-Phone": "907-230-0304",
    "Contact-Email": "axiom+ncei@axiomdatascience.com",
    "External-Identifier": urn,
    "External-Description": f"Sensor data from station {urn}",
    "Internal-Sender-Identifier": urn,
    "Internal-Sender-Description": f"Station - URN:{urn}",
    "Organization-address": "1016 W 6th Ave, Ste. 105, Anchorage, AK 99501, USA",
    "Source-Organization": "Axiom Data Science",
}


bag.info.update(bag_meta)
bag.save(manifests=True, processes=4)

That is it! Simple and efficient!!

The cell below illustrates the bag directory tree.

(Note that the commands below will not work on Windows and some *nix systems may require the installation of the command tree, however, they are only need for this demonstration.)

!tree $temp_bagit_folder
!cat $temp_bagit_folder/manifest-sha256.txt
/tmp/tmp30n1un_k
├── bag-info.txt
├── bagit.txt
├── data
│   └── parameter1.nc
├── manifest-sha256.txt
└── tagmanifest-sha256.txt

1 directory, 5 files
966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8  data/parameter1.nc

We can add more files to the bag as needed.

shutil.copy2(output, temp_data_folder + "/parameter2.nc")
shutil.copy2(output, temp_data_folder + "/parameter3.nc")
shutil.copy2(output, temp_data_folder + "/parameter4.nc")

bag.save(manifests=True, processes=4)
!tree $temp_bagit_folder
!cat $temp_bagit_folder/manifest-sha256.txt
/tmp/tmp30n1un_k
├── bag-info.txt
├── bagit.txt
├── data
│   ├── parameter1.nc
│   ├── parameter2.nc
│   ├── parameter3.nc
│   └── parameter4.nc
├── manifest-sha256.txt
└── tagmanifest-sha256.txt

1 directory, 8 files
966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8  data/parameter1.nc
966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8  data/parameter2.nc
966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8  data/parameter3.nc
966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8  data/parameter4.nc