{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import subprocess\n",
    "import sys\n",
    "\n",
    "COLAB = \"google.colab\" in sys.modules\n",
    "\n",
    "\n",
    "def _install(package):\n",
    "    if COLAB:\n",
    "        ans = input(f\"Install { package }? [y/n]:\")\n",
    "        if ans.lower() in [\"y\", \"yes\"]:\n",
    "            subprocess.check_call(\n",
    "                [sys.executable, \"-m\", \"pip\", \"install\", \"--quiet\", package]\n",
    "            )\n",
    "            print(f\"{ package } installed!\")\n",
    "\n",
    "\n",
    "def _colab_install_missing_deps(deps):\n",
    "    import importlib\n",
    "\n",
    "    for dep in deps:\n",
    "        if importlib.util.find_spec(dep) is None:\n",
    "            if dep == \"iris\":\n",
    "                dep = \"scitools-iris\"\n",
    "            _install(dep)\n",
    "\n",
    "\n",
    "deps = [\"bagit\", \"pocean-core\"]\n",
    "_colab_install_missing_deps(deps)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using BagIt to tag oceanographic data\n",
    "\n",
    "Created: 2017-11-01\n",
    "\n",
    "[`BagIt`](https://en.wikipedia.org/wiki/BagIt) is a packaging format that supports storage of arbitrary digital content. The \"bag\" consists of arbitrary content and \"tags,\" the metadata files. `BagIt` packages can be used to facilitate data sharing with federal archive centers - thus ensuring digital preservation of oceanographic datasets within IOOS and its regional associations. NOAA NCEI supports reading from a Web Accessible Folder (WAF) containing bagit archives. For an example please see: http://ncei.axiomdatascience.com/cencoos/\n",
    "\n",
    "On this notebook we will use the [python interface](http://libraryofcongress.github.io/bagit-python) for `BagIt` to create a \"bag\" of a time-series profile data. First let us load our data from a comma separated values file (`CSV`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>time</th>\n",
       "      <th>lon</th>\n",
       "      <th>lat</th>\n",
       "      <th>depth</th>\n",
       "      <th>station</th>\n",
       "      <th>humidity</th>\n",
       "      <th>temperature</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1990-01-01 00:00:00</td>\n",
       "      <td>-76.5</td>\n",
       "      <td>37.5</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Station1</td>\n",
       "      <td>89.708794</td>\n",
       "      <td>15.698009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1990-01-01 00:00:00</td>\n",
       "      <td>-76.5</td>\n",
       "      <td>37.5</td>\n",
       "      <td>10.0</td>\n",
       "      <td>Station1</td>\n",
       "      <td>55.789471</td>\n",
       "      <td>10.916656</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1990-01-01 00:00:00</td>\n",
       "      <td>-76.5</td>\n",
       "      <td>37.5</td>\n",
       "      <td>20.0</td>\n",
       "      <td>Station1</td>\n",
       "      <td>50.176994</td>\n",
       "      <td>15.666663</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1990-01-01 00:00:00</td>\n",
       "      <td>-76.5</td>\n",
       "      <td>37.5</td>\n",
       "      <td>30.0</td>\n",
       "      <td>Station1</td>\n",
       "      <td>36.855045</td>\n",
       "      <td>1.158752</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1990-01-01 01:00:00</td>\n",
       "      <td>-76.5</td>\n",
       "      <td>37.5</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Station1</td>\n",
       "      <td>65.016937</td>\n",
       "      <td>31.059647</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 time   lon   lat  depth   station   humidity  temperature\n",
       "0 1990-01-01 00:00:00 -76.5  37.5    0.0  Station1  89.708794    15.698009\n",
       "1 1990-01-01 00:00:00 -76.5  37.5   10.0  Station1  55.789471    10.916656\n",
       "2 1990-01-01 00:00:00 -76.5  37.5   20.0  Station1  50.176994    15.666663\n",
       "3 1990-01-01 00:00:00 -76.5  37.5   30.0  Station1  36.855045     1.158752\n",
       "4 1990-01-01 01:00:00 -76.5  37.5    0.0  Station1  65.016937    31.059647"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "fname = os.path.join(\"..\", \"data\", \"timeseriesProfile.csv\")\n",
    "\n",
    "df = pd.read_csv(fname, parse_dates=[\"time\"])\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of \"bagging\" the `CSV` file we will use this create a metadata rich netCDF file.\n",
    "\n",
    "We can convert the table to a `DSG`, Discrete Sampling Geometry, using `pocean.dsg`. The first thing we need to do is to create a mapping from the data column names to the netCDF `axes`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "axes = {\"t\": \"time\", \"x\": \"lon\", \"y\": \"lat\", \"z\": \"depth\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can create a [Orthogonal Multidimensional Timeseries Profile](http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_orthogonal_multidimensional_array_representation_of_time_series) object..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import tempfile\n",
    "\n",
    "from pocean.dsg import OrthogonalMultidimensionalTimeseriesProfile as omtsp\n",
    "\n",
    "output_fp, output = tempfile.mkstemp()\n",
    "os.close(output_fp)\n",
    "\n",
    "ncd = omtsp.from_dataframe(df.reset_index(), output=output, axes=axes, mode=\"a\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "... And add some extra metadata before we close the file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pocean.dsg.timeseriesProfile.om.OrthogonalMultidimensionalTimeseriesProfile'>\n",
      "root group (NETCDF4 data model, file format HDF5):\n",
      "    Conventions: CF-1.6\n",
      "    date_created: 2021-08-24T23:45:00Z\n",
      "    featureType: timeSeriesProfile\n",
      "    cdm_data_type: TimeseriesProfile\n",
      "    naming_authority: ioos\n",
      "    id: Station1\n",
      "    dimensions(sizes): station(1), time(100), depth(4)\n",
      "    variables(dimensions): <class 'str'> station(station), float64 lat(station), float64 lon(station), int32 crs(), float64 time(time), float64 depth(depth), int32 index(time, depth, station), float64 humidity(time, depth, station), float64 temperature(time, depth, station)\n",
      "    groups: \n"
     ]
    }
   ],
   "source": [
    "naming_authority = \"ioos\"\n",
    "st_id = \"Station1\"\n",
    "\n",
    "ncd.naming_authority = naming_authority\n",
    "ncd.id = st_id\n",
    "print(ncd)\n",
    "ncd.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Time to create the archive for the file with `BagIt`. We have to create a folder for the bag."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "temp_bagit_folder = tempfile.mkdtemp()\n",
    "temp_data_folder = os.path.join(temp_bagit_folder, \"data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can create the bag and copy the netCDF file to a `data` sub-folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'/tmp/tmp30n1un_k/data/parameter1.nc'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import shutil\n",
    "\n",
    "import bagit\n",
    "\n",
    "bag = bagit.make_bag(temp_bagit_folder, checksum=[\"sha256\"])\n",
    "\n",
    "shutil.copy2(output, temp_data_folder + \"/parameter1.nc\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Last, but not least, we have to set bag metadata and update the existing bag with it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "urn = f\"urn:ioos:station:{naming_authority}:{st_id}\"\n",
    "\n",
    "bag_meta = {\n",
    "    \"Bag-Count\": \"1 of 1\",\n",
    "    \"Bag-Group-Identifier\": \"ioos_bagit_testing\",\n",
    "    \"Contact-Name\": \"Kyle Wilcox\",\n",
    "    \"Contact-Phone\": \"907-230-0304\",\n",
    "    \"Contact-Email\": \"axiom+ncei@axiomdatascience.com\",\n",
    "    \"External-Identifier\": urn,\n",
    "    \"External-Description\": f\"Sensor data from station {urn}\",\n",
    "    \"Internal-Sender-Identifier\": urn,\n",
    "    \"Internal-Sender-Description\": f\"Station - URN:{urn}\",\n",
    "    \"Organization-address\": \"1016 W 6th Ave, Ste. 105, Anchorage, AK 99501, USA\",\n",
    "    \"Source-Organization\": \"Axiom Data Science\",\n",
    "}\n",
    "\n",
    "\n",
    "bag.info.update(bag_meta)\n",
    "bag.save(manifests=True, processes=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That is it! Simple and efficient!!\n",
    "\n",
    "The cell below illustrates the bag directory tree.\n",
    "\n",
    "(Note that the commands below will not work on Windows and some \\*nix systems may require the installation of the command `tree`, however, they are only need for this demonstration.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[01;34m/tmp/tmp30n1un_k\u001b[00m\n",
      "├── bag-info.txt\n",
      "├── bagit.txt\n",
      "├── \u001b[01;34mdata\u001b[00m\n",
      "│   └── parameter1.nc\n",
      "├── manifest-sha256.txt\n",
      "└── tagmanifest-sha256.txt\n",
      "\n",
      "1 directory, 5 files\n",
      "966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8  data/parameter1.nc\n"
     ]
    }
   ],
   "source": [
    "!tree $temp_bagit_folder\n",
    "!cat $temp_bagit_folder/manifest-sha256.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can add more files to the bag as needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "shutil.copy2(output, temp_data_folder + \"/parameter2.nc\")\n",
    "shutil.copy2(output, temp_data_folder + \"/parameter3.nc\")\n",
    "shutil.copy2(output, temp_data_folder + \"/parameter4.nc\")\n",
    "\n",
    "bag.save(manifests=True, processes=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[01;34m/tmp/tmp30n1un_k\u001b[00m\n",
      "├── bag-info.txt\n",
      "├── bagit.txt\n",
      "├── \u001b[01;34mdata\u001b[00m\n",
      "│   ├── parameter1.nc\n",
      "│   ├── parameter2.nc\n",
      "│   ├── parameter3.nc\n",
      "│   └── parameter4.nc\n",
      "├── manifest-sha256.txt\n",
      "└── tagmanifest-sha256.txt\n",
      "\n",
      "1 directory, 8 files\n",
      "966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8  data/parameter1.nc\n",
      "966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8  data/parameter2.nc\n",
      "966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8  data/parameter3.nc\n",
      "966f9dda7df28cf50304d5cc67e08084020446e13521b40ee94cce35e5c75ec8  data/parameter4.nc\n"
     ]
    }
   ],
   "source": [
    "!tree $temp_bagit_folder\n",
    "!cat $temp_bagit_folder/manifest-sha256.txt"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}