catalog_builder.py¶
USAGE¶
Generate ESM-intake catalogs for datasets stored using the CMIP6, CESM, and GFDL archive directory and retrieval structures (DRSs).
To run interactively:
> cd MDTF-diagnostics/tools/catalog_builder
> conda activate _MDTF_base
> python3 catalog_builder.py --config [CONFIG FILE NAME].yml
Submit a SLURM batch job:
> sbatch catalog_builder_slurm.csh -config [CONFIG FILE NAME].yml
Input¶
Yaml file with configuration to build an ESM intake catalog
Output¶
A csv file with ESM-intake catalog entries for the target root directory(ies) in the configuration file, and a json file with the catalog column headers. Example catalog and header files for CMIP6 dataset stored on the GFDL uda file system are located in the examples/cmip subdirectory.
Required packages:¶
The required packages are included in the _MDTF_base conda environment:
click
dask
datetime
ecgtools
intake
os
pathlib
shutil
sys
time
traceback
typing
xarray
yaml
Configuration file:¶
The configuration file defines the following parameters to generate the ESM-intake catalog:
convention (required): DRS convention to use: cmip (default), gfdl, or cesm
data_root_dirs (required): a list of root directory paths with files to query
dir_depth (required): the directory depth to traverse in the paths. A dir_depth=1 means that the files are in the root directory(ies), a dir_depth=2 means the files are in one or more subdirectories one level down from the root directory(ies) and so on
output_dir (required): directory where catalog and header files will be written
output_filename (required): base name of the catalog and header files (.csv and .json are appended by the program)
num_threads (required): number of cpu threads to run with
include_patterns (optional): list of patterns to include in search; supports wildcards
exclude_patterns (optional): list of patterns to exclude from search; supports wildcards
Templates for the configuration file and a slurm batch submission script for GFDL PPAN are located in the examples/templates subdirectory.