Framework configuration and parsing

This section describes the src.cli module, responsible for parsing input configuration. Familiarity with the python argparse module is recommended.

CLI functionality

Overview

Flexibility and extensibility are among the MDTF project’s design goals, which must be accommodated by the package’s configuration logic. Our use case requires the following features:

  • Allow for specifying and recording user input in a file, to allow provenance of package runs and to eliminate the need for long strings of CLI flags.

  • Record whether the user has explicitly set an option (to a value which may or may not be the default), or whether the option is unset and its default value is being used.

  • Define “plug-ins” for specific tasks (such as model data retrieval) which can define their own CLI settings. This is necessary to avoid confusing the user with settings that are irrelevant for their specified analysis; e.g. the --version-date flag used by the CMIP6 local file data source data source would be meaningless for a source of data that didn’t have a revision history.

  • Enable site-specific customizations, which can add to or modify any of the above properties.

  • Define CLIs through configuration files instead of code to streamline the process of defining all of the above.

No third-party CLI package implements all of the above features, so the MDTF package provides its own solution, described here.

CLI subcommands

Subcommands are used to organize different aspects of a program’s functionality: e.g. git status and git log are both provided by git, but each git subcommand takes its own options and flags. Subcommand parsing is currently implemented in the src.cli module but not used: subcommands are manually dispatched in mdtf_framework.py. Full use of subcommands was planned for inclusion in a future release, to avoid excessive changes to the UI.

Currently recognized subcommands are:

  • mdtf (no subcommand; default), or mdtf run: Run analyses on model data.

  • mdtf info: Implemented in the src.mdtf_info module. Displays information on currently installed PODs and the variables needed to run individual diagnostics.

  • mdtf help: display help on command-line options and exit, equivalent to the -h/--help flag.

In addition, the following subcommands were planned:

  • mdtf verify: User-facing interface to the src.verify_links module as a standalone script. This parses the HTML pages from a completed run of the package and determines if all linked plots exist.

  • mdtf install: This would invoke the src.install module to do initial installation of the package, conda environments and supporting POD and model data. This installer script is currently unused, on the grounds that the manual installation process described in the user-facing documentation is less error-prone.

  • mdtf update: Would invoke a subset of the installer’s functions to ensure that all code, supporting data and third-party dependencies are updated to their current versions.

Additional package manager-like commands could be added to allow users to selectively install the subset of PODs of interest to them (and their corresponding supporting data and conda environments.)

CLI Plugins

“Plug-ins” provide different ways to implement the same type of task, following a common API. One example is obtaining model data from different sources: different code is needed for reading the sample model data from a local directory vs. accessing remote data via a catalog interface. In the plug-in system, the code for these two cases would be written as distinct data source plug-ins, and the data retrieval method to use would be selected at runtime by the user via the --data-manager CLI flag. This allows new functionalities to be developed and tested independently of each other, and without requiring changes to the common logic of the framework.

The categories of plug-ins are fixed by the framework. Currently these are data_manager, which retrieves model data, and environment_manager, which sets up each POD’s third-party code dependencies. Two other plug-ins are defined but not exposed to the user through the UI, because only one option is currently implemented for them: runtime_manager, which controls how PODs are executed, and output_manager, which controls how the PODs’ output files are collected and processed.

Allowed values for each of these plug-in categories are defined in the cli_plugins.jsonc files: the “base” one in /src, and optionally one in the site-specific directory selected by the user.

As noted in the overview above, for a manageable interface we need to allow each plug-in to define its own CLI options. These are defined in the cli attribute for each plug-in definition in the cli_plugins.jsonc file, following the syntax described below. When the CLI parser is being configured, the user input is first partially parsed to determine what plug-ins the user has selected, and then their specific CLI options are added to the “full” CLI parser.

File-based CLI definition

The CLI for the package is constructed from a set of JSONC configuration files. The syntax for these files is essentially a direct JSON serialization of the arguments given to ArgumentParser, with a few extensions described below.

Location of configuration files

The top-level configuration files have hard-coded names:

Plugins define their own CLI options in the cli attribute in their entry in the plugins file, using the syntax described below. On the other hand, each subcommand defines its CLI through a separate file, given in the cli_file attribute. Chief among these is

  • src/cli_template.jsonc, which defines the CLI for running the package in the absence of site-specific modifications.

CLI configuration file syntax

A subcommand cli_file is a JSONC struct which may contain:

  • Arguments taken by the constructor for ArgumentParser;

  • An attribute named arguments, containing a list of argument structs not in any argument group;

  • An attribute named argument_groups, containing a list of structs each containing arguments taken by the add_argument_group() method of ArgumentParser, and an arguments attribute.

The arguments attribute referred to above defines a list of CLI options, in the order they’re to be listed in online help (following basic unix convention, the order options are given doesn’t affect their parsing). This is also the syntax used by the cli argument for each CLI plugin.

Attributes of a struct in the arguments list can include:

  • Arguments taken by the add_argument() method of ArgumentParser, in particular:

    • name corresponds to the name_or_flags argument to add_argument(). It can be either a string, or list of strings, all of which will be taken to define the same flag. Initial hyphens (GNU syntax) are added, and underscores are converted to hyphens: name: "hyphen_opt" defines an option that can be set with either --hyphen_opt or --hyphen-opt. If dest is not supplied, the first entry will be taken as the destination variable for the setting.

    • action is one of the allowed values recognized by add_argument, or the fully qualified (module) name of a custom Action subclass, which will be imported if it’s not present in the current namespace.

  • The following extensions to this set of arguments:

    • short_name, optional, is used to define single-letter abbreviated flags for the most commonly used options. These are added to the synonymous flags defined via name. Use of full-word (GNU style) flags is preferred, as it makes the set of arguments more comprehensible.

    • is_positional, default False, is a boolean used to identify positional arguments (as opposed to flag-based arguments, which are identified by their flag rather than their position on the command line.)

    • hidden, default False, is a boolean used to identify options that are recognized by the parser but not displayed to the user in online help.

Use in the code

src.cli module defines a hierarchy of classes representing objects in a CLI parser specification, which are instantiated by values from the configuration files. At the root of the hierarchy is CLIConfigManager, a Singleton which reads all the files, begins the object creation process, and stores the results. The other classes in the hierarchy are, in descending order:

  • CLICommand: Dataclass representing a subcommand or a plug-in. This wraps a parser (parser attribute) and objects in the classes below, corresponding to configuration for that parser, which are initialized from the configuration files (cli attribute.) It also implements a call() method for dispatching parsed values to the initialization method of the class implementing the subcommand or plug-in.

  • CLIParser: Dataclass representing arguments passed to the constructor for ArgumentParser. A parser object (next section) is configured with information in objects in the classes below via this class’s configure method.

  • CLIArgumentGroup: Dataclass representing arguments passed to add_argument_group(). This only affects the formatting in the online help.

  • CLIArgument: Dataclass representing arguments passed to add_argument(), as described above.

CLI parsers

Parser classes

As described above, the CLI used on a specific run of the package depends on the values of some of the CLI arguments: the --site, and the values chosen for recognized plug-ins. This introduces a chicken-and-egg level of complexity, in which we need to parse some arguments in order to determine how to proceed with the rest of the parsing. The src.cli module does this by defining several parser classes, all of which inherit from ArgumentParser.

Defaults and argument parsing precedence

Long strings of command-line arguments are cumbersome for users. At the same time, provenance and reproducibility of package runs are simplified if all configuration is handled by the same code. For this reason, we implement multiple ways for users to provide CLI arguments:

  1. Options explicitly given on the command line.

  2. Option values defined in a JSONC file and passed with the -f/--input-file flag.

  3. Option values defined in a JSONC file named defaults.jsonc located in the directory of the currently selected site.

  4. Option values defined in a JSONC file named defaults.jsonc located in the /sites directory.

  5. The default value (if any) specified in each CLI argument’s definition.

The value assigned to every option is determined by the lowest-numbered method that explicitly specifies that value: for example, explicit command-line options override values given in a file passed with --input-file, which in turn override the option defaults listed in the online help.

The intended use case for these different methods is to enable the user to focus on the settings that matter for each run. Continuing the example above, the user could specify the analysis period and desired PODs with explicit flags, options for data from the experiment being analyzed in an input file, and options describing the paths to POD supporting data and conda environments in a site-specific defaults.jsonc file (see user documentation for site customization.)

File-based input (2, 3 and 4) is read in by the init_user_defaults() method of MDTFTopLevelArgParser. The full precedence logic is implemented in the parse_known_args() method, inherited by MDTFTopLevelArgParser from MDTFArgParser.

Walkthough of CLI creation and parsing

Building the CLI

  • The mdtf wrapper script activates the _MDTF_base conda environment and calls mdtf_framework.py.

  • mdtf_framework.py manually determines the subcommand from the currently recognized values, and constructs the CLI appropriate to it. In this example, we’re running the package, so the MDTFTopLevelArgParser is initialized and its setup() method is called.

    • This calls init_user_defaults(), which parses the value of --input-file and, if set, reads the file and stores its contents in the user_defaults attribute of CLIConfigManager.

    • It then calls init_site(), which parses the value of the selected site and reads the site-specific defaults files (if any).

    • Now that we know which site we’re using, we know the full set of subcommands and plug-in values (built-in and site-specific). read_subcommands() and read_plugins() read this information and parse it into CLICommand objects stored in the CLIConfigManager.

    • Another MDTFArgPreparser is created to parse the subcommand and plug-in values. The corresponding plugin-specific arguments are added.

  • We’re now ready to build the “real” CLI parser, with configure().

    • This simply sets some options relevant for the help text, and adds the CLI arguments (parsed as CLIArgument objects) to the parser in add_contents(), which calls the configure() method on the CLIParser object for the chosen subcommand.

  • At this point the MDTFTopLevelArgParser is fully configured and ready to parse user input.

Parsing CLI arguments

  • Parsing of user input is done by the dispatch() method of the configured MDTFTopLevelArgParser object.

  • The parsed option values are stored as a dict in the config attribute of the MDTFTopLevelArgParser object. This will be the starting point for further validation of user input done in the MDTFFramework class.

  • The dispatch() then imports the modules for all selected plug-in objects. We do this import “on demand,” rather than simply always importing everything, because a plug-in may make use of third-party modules that the user hasn’t installed (e.g. if the plug-in is site-specific and the user is at a different site.)

  • Finally, dispatch() calls the call() method on the selected subcommand to hand off execution. As noted above, subcommand functionality is implemented but unused, so currently we always hand off the the first (only) subcommand, mdtf run, regardless of input. The corresponding entry point, as specified in src/cli_plugins.jsonc, is the __init__ method of MDTFFramework.

Extending the user interface

Currently, the only method for the user to configure a run of the package is the CLI described above, which parses command-line options and configuration files.

In the future it may be desirable to provide additional invocation mechanisms, e.g. from a larger workflow engine or a web-based front end.

Parsing and validation logic is split between the src.cli module and the MDTFFramework class. In order to avoid duplicating logic and ensure that configuration gets parsed consistently across the different methods, the raw user input should be introduced into the chain of methods in the parsing logic (described above) as early as possible.