Using BagIt to tag oceanographic data#
Created: 2017-11-01
BagIt
is a packaging format that supports storage of arbitrary digital content. The “bag” consists of arbitrary content and “tags,” the metadata files. BagIt
packages can be used to facilitate data sharing with federal archive centers - thus ensuring digital preservation of oceanographic datasets within IOOS and its regional associations. NOAA NCEI supports reading from a Web Accessible Folder (WAF) containing bagit archives. For an example please see: http://ncei.axiomdatascience.com/cencoos/
On this notebook we will use the python interface for BagIt
to create a “bag” of a time-series profile data. First let us load our data from a comma separated values file (CSV
).
import os
import pandas as pd
fname = os.path.join("..", "data", "timeseriesProfile.csv")
df = pd.read_csv(fname, parse_dates=["time"])
df.head()
time | lon | lat | depth | station | humidity | temperature | |
---|---|---|---|---|---|---|---|
0 | 1990-01-01 00:00:00 | -76.5 | 37.5 | 0.0 | Station1 | 89.708794 | 15.698009 |
1 | 1990-01-01 00:00:00 | -76.5 | 37.5 | 10.0 | Station1 | 55.789471 | 10.916656 |
2 | 1990-01-01 00:00:00 | -76.5 | 37.5 | 20.0 | Station1 | 50.176994 | 15.666663 |
3 | 1990-01-01 00:00:00 | -76.5 | 37.5 | 30.0 | Station1 | 36.855045 | 1.158752 |
4 | 1990-01-01 01:00:00 | -76.5 | 37.5 | 0.0 | Station1 | 65.016937 | 31.059647 |
Instead of “bagging” the CSV
file we will use this create a metadata rich netCDF file.
We can convert the table to a DSG
, Discrete Sampling Geometry, using pocean.dsg
. The first thing we need to do is to create a mapping from the data column names to the netCDF axes
.
axes = {"t": "time", "x": "lon", "y": "lat", "z": "depth"}
Now we can create a Orthogonal Multidimensional Timeseries Profile object…
import os
import tempfile
from pocean.dsg import OrthogonalMultidimensionalTimeseriesProfile as omtsp
output_fp, output = tempfile.mkstemp()
os.close(output_fp)
ncd = omtsp.from_dataframe(df.reset_index(), output=output, axes=axes, mode="a")
… And add some extra metadata before we close the file.
naming_authority = "ioos"
st_id = "Station1"
ncd.naming_authority = naming_authority
ncd.id = st_id
print(ncd)
ncd.close()
<class 'pocean.dsg.timeseriesProfile.om.OrthogonalMultidimensionalTimeseriesProfile'>
root group (NETCDF4 data model, file format HDF5):
Conventions: CF-1.6
date_created: 2021-08-24T23:45:00Z
featureType: timeSeriesProfile
cdm_data_type: TimeseriesProfile
naming_authority: ioos
id: Station1
dimensions(sizes): station(1), time(100), depth(4)
variables(dimensions): <class 'str'> station(station), float64 lat(station), float64 lon(station), int32 crs(), float64 time(time), float64 depth(depth), int32 index(time, depth, station), float64 humidity(time, depth, station), float64 temperature(time, depth, station)
groups:
Time to create the archive for the file with BagIt
. We have to create a folder for the bag.
temp_bagit_folder = tempfile.mkdtemp()
temp_data_folder = os.path.join(temp_bagit_folder, "data")
Now we can create the bag and copy the netCDF file to a data
sub-folder.
import shutil
import bagit
bag = bagit.make_bag(temp_bagit_folder, checksum=["sha256"])
shutil.copy2(output, temp_data_folder + "/parameter1.nc")
'/tmp/tmp30n1un_k/data/parameter1.nc'
Last, but not least, we have to set bag metadata and update the existing bag with it.
urn = f"urn:ioos:station:{naming_authority}:{st_id}"
bag_meta = {
"Bag-Count": "1 of 1",
"Bag-Group-Identifier": "ioos_bagit_testing",
"Contact-Name": "Kyle Wilcox",
"Contact-Phone": "907-230-0304",
"Contact-Email": "axiom+ncei@axiomdatascience.com",
"External-Identifier": urn,
"External-Description": f"Sensor data from station {urn}",
"Internal-Sender-Identifier": urn,
"Internal-Sender-Description": f"Station - URN:{urn}",
"Organization-address": "1016 W 6th Ave, Ste. 105, Anchorage, AK 99501, USA",
"Source-Organization": "Axiom Data Science",
}
bag.info.update(bag_meta)
bag.save(manifests=True, processes=4)
That is it! Simple and efficient!!
The cell below illustrates the bag directory tree.
(Note that the commands below will not work on Windows and some *nix systems may require the installation of the command tree
, however, they are only need for this demonstration.)
!tree $temp_bagit_folder
!cat $temp_bagit_folder/manifest-sha256.txt
/tmp/tmp30n1un_k
├── bag-info.txt
├── bagit.txt
├── data
│ └── parameter1.nc
├── manifest-sha256.txt
└── tagmanifest-sha256.txt
1 directory, 5 files
966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8 data/parameter1.nc
We can add more files to the bag as needed.
shutil.copy2(output, temp_data_folder + "/parameter2.nc")
shutil.copy2(output, temp_data_folder + "/parameter3.nc")
shutil.copy2(output, temp_data_folder + "/parameter4.nc")
bag.save(manifests=True, processes=4)
!tree $temp_bagit_folder
!cat $temp_bagit_folder/manifest-sha256.txt
/tmp/tmp30n1un_k
├── bag-info.txt
├── bagit.txt
├── data
│ ├── parameter1.nc
│ ├── parameter2.nc
│ ├── parameter3.nc
│ └── parameter4.nc
├── manifest-sha256.txt
└── tagmanifest-sha256.txt
1 directory, 8 files
966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8 data/parameter1.nc
966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8 data/parameter2.nc
966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8 data/parameter3.nc
966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8 data/parameter4.nc