{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import subprocess\n", "import sys\n", "\n", "COLAB = \"google.colab\" in sys.modules\n", "\n", "\n", "def _install(package):\n", " if COLAB:\n", " ans = input(f\"Install { package }? [y/n]:\")\n", " if ans.lower() in [\"y\", \"yes\"]:\n", " subprocess.check_call(\n", " [sys.executable, \"-m\", \"pip\", \"install\", \"--quiet\", package]\n", " )\n", " print(f\"{ package } installed!\")\n", "\n", "\n", "def _colab_install_missing_deps(deps):\n", " import importlib\n", "\n", " for dep in deps:\n", " if importlib.util.find_spec(dep) is None:\n", " if dep == \"iris\":\n", " dep = \"scitools-iris\"\n", " _install(dep)\n", "\n", "\n", "deps = [\"bagit\", \"pocean-core\"]\n", "_colab_install_missing_deps(deps)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Using BagIt to tag oceanographic data\n", "\n", "Created: 2017-11-01\n", "\n", "[`BagIt`](https://en.wikipedia.org/wiki/BagIt) is a packaging format that supports storage of arbitrary digital content. The \"bag\" consists of arbitrary content and \"tags,\" the metadata files. `BagIt` packages can be used to facilitate data sharing with federal archive centers - thus ensuring digital preservation of oceanographic datasets within IOOS and its regional associations. NOAA NCEI supports reading from a Web Accessible Folder (WAF) containing bagit archives. For an example please see: http://ncei.axiomdatascience.com/cencoos/\n", "\n", "On this notebook we will use the [python interface](http://libraryofcongress.github.io/bagit-python) for `BagIt` to create a \"bag\" of a time-series profile data. First let us load our data from a comma separated values file (`CSV`)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timelonlatdepthstationhumiditytemperature
01990-01-01 00:00:00-76.537.50.0Station189.70879415.698009
11990-01-01 00:00:00-76.537.510.0Station155.78947110.916656
21990-01-01 00:00:00-76.537.520.0Station150.17699415.666663
31990-01-01 00:00:00-76.537.530.0Station136.8550451.158752
41990-01-01 01:00:00-76.537.50.0Station165.01693731.059647
\n", "
" ], "text/plain": [ " time lon lat depth station humidity temperature\n", "0 1990-01-01 00:00:00 -76.5 37.5 0.0 Station1 89.708794 15.698009\n", "1 1990-01-01 00:00:00 -76.5 37.5 10.0 Station1 55.789471 10.916656\n", "2 1990-01-01 00:00:00 -76.5 37.5 20.0 Station1 50.176994 15.666663\n", "3 1990-01-01 00:00:00 -76.5 37.5 30.0 Station1 36.855045 1.158752\n", "4 1990-01-01 01:00:00 -76.5 37.5 0.0 Station1 65.016937 31.059647" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import os\n", "\n", "import pandas as pd\n", "\n", "fname = os.path.join(\"..\", \"data\", \"timeseriesProfile.csv\")\n", "\n", "df = pd.read_csv(fname, parse_dates=[\"time\"])\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of \"bagging\" the `CSV` file we will use this create a metadata rich netCDF file.\n", "\n", "We can convert the table to a `DSG`, Discrete Sampling Geometry, using `pocean.dsg`. The first thing we need to do is to create a mapping from the data column names to the netCDF `axes`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "axes = {\"t\": \"time\", \"x\": \"lon\", \"y\": \"lat\", \"z\": \"depth\"}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can create a [Orthogonal Multidimensional Timeseries Profile](http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_orthogonal_multidimensional_array_representation_of_time_series) object..." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import os\n", "import tempfile\n", "\n", "from pocean.dsg import OrthogonalMultidimensionalTimeseriesProfile as omtsp\n", "\n", "output_fp, output = tempfile.mkstemp()\n", "os.close(output_fp)\n", "\n", "ncd = omtsp.from_dataframe(df.reset_index(), output=output, axes=axes, mode=\"a\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... And add some extra metadata before we close the file." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "root group (NETCDF4 data model, file format HDF5):\n", " Conventions: CF-1.6\n", " date_created: 2021-08-24T23:45:00Z\n", " featureType: timeSeriesProfile\n", " cdm_data_type: TimeseriesProfile\n", " naming_authority: ioos\n", " id: Station1\n", " dimensions(sizes): station(1), time(100), depth(4)\n", " variables(dimensions): station(station), float64 lat(station), float64 lon(station), int32 crs(), float64 time(time), float64 depth(depth), int32 index(time, depth, station), float64 humidity(time, depth, station), float64 temperature(time, depth, station)\n", " groups: \n" ] } ], "source": [ "naming_authority = \"ioos\"\n", "st_id = \"Station1\"\n", "\n", "ncd.naming_authority = naming_authority\n", "ncd.id = st_id\n", "print(ncd)\n", "ncd.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Time to create the archive for the file with `BagIt`. We have to create a folder for the bag." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "temp_bagit_folder = tempfile.mkdtemp()\n", "temp_data_folder = os.path.join(temp_bagit_folder, \"data\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can create the bag and copy the netCDF file to a `data` sub-folder." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/tmp/tmp30n1un_k/data/parameter1.nc'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import shutil\n", "\n", "import bagit\n", "\n", "bag = bagit.make_bag(temp_bagit_folder, checksum=[\"sha256\"])\n", "\n", "shutil.copy2(output, temp_data_folder + \"/parameter1.nc\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Last, but not least, we have to set bag metadata and update the existing bag with it." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "urn = f\"urn:ioos:station:{naming_authority}:{st_id}\"\n", "\n", "bag_meta = {\n", " \"Bag-Count\": \"1 of 1\",\n", " \"Bag-Group-Identifier\": \"ioos_bagit_testing\",\n", " \"Contact-Name\": \"Kyle Wilcox\",\n", " \"Contact-Phone\": \"907-230-0304\",\n", " \"Contact-Email\": \"axiom+ncei@axiomdatascience.com\",\n", " \"External-Identifier\": urn,\n", " \"External-Description\": f\"Sensor data from station {urn}\",\n", " \"Internal-Sender-Identifier\": urn,\n", " \"Internal-Sender-Description\": f\"Station - URN:{urn}\",\n", " \"Organization-address\": \"1016 W 6th Ave, Ste. 105, Anchorage, AK 99501, USA\",\n", " \"Source-Organization\": \"Axiom Data Science\",\n", "}\n", "\n", "\n", "bag.info.update(bag_meta)\n", "bag.save(manifests=True, processes=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That is it! Simple and efficient!!\n", "\n", "The cell below illustrates the bag directory tree.\n", "\n", "(Note that the commands below will not work on Windows and some \\*nix systems may require the installation of the command `tree`, however, they are only need for this demonstration.)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[01;34m/tmp/tmp30n1un_k\u001b[00m\n", "├── bag-info.txt\n", "├── bagit.txt\n", "├── \u001b[01;34mdata\u001b[00m\n", "│   └── parameter1.nc\n", "├── manifest-sha256.txt\n", "└── tagmanifest-sha256.txt\n", "\n", "1 directory, 5 files\n", "966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8 data/parameter1.nc\n" ] } ], "source": [ "!tree $temp_bagit_folder\n", "!cat $temp_bagit_folder/manifest-sha256.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can add more files to the bag as needed." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "shutil.copy2(output, temp_data_folder + \"/parameter2.nc\")\n", "shutil.copy2(output, temp_data_folder + \"/parameter3.nc\")\n", "shutil.copy2(output, temp_data_folder + \"/parameter4.nc\")\n", "\n", "bag.save(manifests=True, processes=4)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[01;34m/tmp/tmp30n1un_k\u001b[00m\n", "├── bag-info.txt\n", "├── bagit.txt\n", "├── \u001b[01;34mdata\u001b[00m\n", "│   ├── parameter1.nc\n", "│   ├── parameter2.nc\n", "│   ├── parameter3.nc\n", "│   └── parameter4.nc\n", "├── manifest-sha256.txt\n", "└── tagmanifest-sha256.txt\n", "\n", "1 directory, 8 files\n", "966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8 data/parameter1.nc\n", "966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8 data/parameter2.nc\n", "966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8 data/parameter3.nc\n", "966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8 data/parameter4.nc\n" ] } ], "source": [ "!tree $temp_bagit_folder\n", "!cat $temp_bagit_folder/manifest-sha256.txt" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 2 }