catalog_builder.py

USAGE

Generate ESM-intake catalogs for datasets stored using the CMIP6, CESM, and GFDL archive directory and retrieval structures (DRSs).

To run interactively:

> cd MDTF-diagnostics/tools/catalog_builder
> conda activate _MDTF_base
> python3 catalog_builder.py --config [CONFIG FILE NAME].yml

Submit a SLURM batch job:

> sbatch catalog_builder_slurm.csh -config [CONFIG FILE NAME].yml

Input

Yaml file with configuration to build an ESM intake catalog

Output

A csv file with ESM-intake catalog entries for the target root directory(ies) in the configuration file, and a json file with the catalog column headers. Example catalog and header files for CMIP6 dataset stored on the GFDL uda file system are located in the examples/cmip subdirectory.

Required packages:

The required packages are included in the _MDTF_base conda environment:

  • click

  • dask

  • datetime

  • ecgtools

  • intake

  • os

  • pathlib

  • shutil

  • sys

  • time

  • traceback

  • typing

  • xarray

  • yaml

Configuration file:

The configuration file defines the following parameters to generate the ESM-intake catalog:

  • convention (required): DRS convention to use: cmip (default), gfdl, or cesm

  • data_root_dirs (required): a list of root directory paths with files to query

  • dir_depth (required): the directory depth to traverse in the paths. A dir_depth=1 means that the files are in the root directory(ies), a dir_depth=2 means the files are in one or more subdirectories one level down from the root directory(ies) and so on

  • output_dir (required): directory where catalog and header files will be written

  • output_filename (required): base name of the catalog and header files (.csv and .json are appended by the program)

  • num_threads (required): number of cpu threads to run with

  • include_patterns (optional): list of patterns to include in search; supports wildcards

  • exclude_patterns (optional): list of patterns to exclude from search; supports wildcards

Templates for the configuration file and a slurm batch submission script for GFDL PPAN are located in the examples/templates subdirectory.