util subpackage¶
This section summarizes code in the src/util
subpackage, which contains utility
functions needed at many places in the code.
It’s implemented as a package because putting all the code in a single module
would be difficult to navigate. All modules depend on the python standard
library only.
This section is intended to give an introduction and context for the overall code organization, which might be difficult to gain from the complete docstring listing at the Internal code documentation. In particular, we don’t describe every class or function in detail here.
Modules in the package¶
__init__.py¶
The util package contains a non-trivial __init__.py
, describing the “public”
members of each module that are provided when the util
package is imported
as a whole. New classes or functions added to a module in the package should be
listed here in order to be usable outside the package.
src.util.basic module¶
This module contains implementations of the simplest data structures and utility functions needed by the framework. In general, code is placed in this module to avoid the need for circular imports (which can be done safely in python, but which we avoid.)
src.util.dataclass module¶
This module contains extensions to the python standard library’s
dataclasses
and re
modules. Recall that python dataclasses
provide an alternative syntax for class definition that’s most useful in
defining “passive” classes that mainly serve as a container for typed data
fields (hence the name), for example, records of a database. The code in this
module extends this functionality to the use case of instantiating dataclasses
from the results of a regex match on a string: our main use case is assembling
a data catalog on the fly by using a regex to parse paths of model data files in
a set directory hierarchy convention.
For a simple example of how the major functionality of this module is used in
the framework, examine the code for src.cmip6.CMIP6_VariantLabel
. This
class simply parses a variant label of the type used in the CMIP6 conventions:
given a string of the form r1i2f3
, return a CMIP6_VariantLabel object with
realization_index
attribute set to 1
, etc.
Regexes¶
The RegexPattern
class extends the functionality of
the standard library’s re.Pattern
: it wraps a regex where all
capture groups are named, and provides a dict-like interface to the values
captured by those groups upon a successful match.
A ChainedRegexPattern
is instantiated from multiple
RegexPattern
s: given an input string, it tries
parsing it with each RegexPattern in order, stopping at the first one that
matches successfully. In other words, this is a convenience wrapper for taking
the logical ‘or’ of the RegexPatterns, which is cumbersome to do at the level of
the regexes themselves.
Dataclasses¶
We implement the src.util.dataclass.mdtf_dataclass()
decorator to smooth
over the following rough edges in the standard library implementation of
dataclasses – otherwise, usage is unchanged from
dataclasses.dataclass()
.
Re-implement checking for mandatory fields. Standard library dataclasses allow for both mandatory and optional fields, but the optional fields must be declared after the mandatory ones, which breaks when dataclasses inherit from other dataclasses (the parent class’s fields are declared first in the auto-generated
__init__
method).Our workaround is to always declare fields as optional (in the context of the standard library’s dataclass, that we’re wrapping) and denote those that are meant to be mandatory with a default sentinel value.
Perform type coercion on instance creation (after the class’s
__init__
and__post_init__
). Python is committed to being a weakly (“duck”) typed language, which won’t do for our use case: the field values returned by the regex will all be strings, and we want to coerce these to ints, dates, etc. using the pre-existing dataclass type annotation syntax.The logic for doing so is in
_mdtf_dataclass_typecheck()
: implementing full type awareness (as done bymypy
or similar projects) is far beyond our scope, so this only does coercion on the simplest cases that actually arise in practice and throws aDataclassParseError
if it encounters anything it can’t understand.
“Regex dataclasses”¶
The regex and dataclass functionalities described above are combined using the
regex_dataclass()
decorator. Its argument is a
RegexPattern instance, and it decorates a mdtf_dataclass, and its main function
is to wrap the auto-generated __init__
method to allow the mdtf_dataclass
to be instantiated from parsing a string using the RegexPattern.
Extra effort is needed to make this work properly under composition (i.e., if
the types of one or more of the fields of the current regex_dataclass are also
regex_dataclasses.) This is mainly done in
_regex_dataclass_preprocess_kwargs()
: we parse the
constituent regex_dataclasses in depth-first order, and keep track of their
field assignments in a ConsistentDict
which throws an
exception if we try to alter a previously defined value.
Other functionality¶
Interoperability between standard library dataclasses is cumbersome: e.g. if a
dataclass has a field named id
, there’s no straightforward way to relate it
to the id
field on a different class, even if one inherits from the other.
We implement two functions for this purpose, which are roughly inverses of each
other.
filter_dataclass()
returns a dict of the field values
in one dataclass that correspond to fields names that are present in a second
dataclass. coerce_to_dataclass()
creates an instance
of a given dataclass using field values specified by a second dataclass, or a
dict.
src.util.datelabel module¶
This module implements classes for representing the date range of data sets and the frequency with which they are sampled. As the warnings on the module’s docstring should make clear, this is not intended to provide a full implementation of calendar math. The intended use case is parsing date ranges given as parts of filenames (hence “datelabel”) for the purpose of determining whether that data falls within the analysis period.
Date ranges and dates¶
Date ranges are described by the DateRange
class.
This stores the two endpoints of the date range as datetime.datetime
objects, as well as a precision attribute specified by the
DatePrecision
enum. DateRanges are always
closed intervals; e.g. DateRange('1990-1999')
starts at 0:00 on 1 Jan
1990 and ends at 23:59 on 31 Dec 1999. In all cases, the DateRange is defined to
be the maximal range of dates consistent with the input string (i.e., the
precision with which that string was specified).
Because we retain precision information, the Date
class is implemented as a DateRange, rather than the other way around; for
example DateRange('1990')
has yearly precision, so it maps to the range of
dates from 0:00 on 1 Jan 1990 to 23:59 on 31 Dec 1990.
Sampling frequencies¶
The frequency with which data is sampled is represented by the
DateFrequency
class, which is essentially a wrapper
for the standard library’s datetime.timedelta
that provides string
parsing logic.
Static data¶
The module defines FXDateRange
,
FXDateMin
, FXDateMax
and FXDateFrequency
placeholder objects to describe
static data with no time dependence. These are defined at the module level, so
they behave like singletons. Comparisons and logic with normal DateRange, Date
and DateFrequency objects work correctly.
src.util.exceptions module¶
In order to simplify the set of modules imported by other framework modules, all
framework-specific exceptions are defined in this module, regardless of context.
All framework-specific exceptions inherit from
MDTFBaseException
.
src.util.filesystem module¶
Functionality that touches the filesystem: path operations, searching for and loading files (note that the parsing of files is done elsewhere), and (simple) HTML templating for the src.output_manager module.
src.util.logs module¶
Functionality involving logging configuration and output. Code in this module
extends the functionality of the python standard library logging
module, which we use for all user communication during framework operation
(instead of print()
statements). Python’s built-in logging facilities are
powerful, going most of the way towards implementing an event-driven programming
paradigm within the language, and not very clearly documented. The tutorial is a
must-read.
Configuration¶
In keeping with the framework’s philosophy of extensibility, we want to allow
the user to configure logging themselves (e.g., they may want errors raised by
the MDTF package to be reported to a larger workflow engine.) We do this by
simply exposing the logging module’s configuration interface to the user:
specifically, the dictConfig()
schema,
with the contents of the dict serialized as a .jsonc file. We do this rather
than using the fileConfig()
interface, because the
latter uses files in .ini format, and we currently use .jsonc for all other
configuration files in the package.
Specifically, the framework looks for logging configuration in a file named
logging.jsonc
, as part of the MDTFFramework
's
__init__
method. It first looks in the site
directory specified by the
user; if no file with that name is found, it falls back to the default
configuration in src/logging.jsonc.
The contents of this file are stored in the ConfigManager
and
actually used to configure the logger by case_log_config()
,
which gets called by the __init__
method of
DataSourceBase
.
Caching¶
The configuration strategy described above creates a chicken-and-egg problem, as
we need to be able to log issues that arise before the logger itself has been
configured. We do this with the MultiFlushMemoryHandler
log handler, which acts as a temporary cache: all logging events prior to
configuration are captured by this handler. Once the “real” handlers have been
configured by case_log_config()
, the contents of this
handler are copied (“flushed”) to each of them in turn. This handler is set up
in the top-level script, which also calls
configure_console_loggers()
to set up conventional
stdout/stderr logging destinations.
Most of the rest of the code in this module deals with formatting and
presentation of logs, e.g. MDTFHeaderFileHandler
which
writes a header with useful debugging information (such as the git commit hash)
to the log file.
src.util.processes module¶
Functionality that involves external subprocesses spawned by the framework. This
is the mechanism by which the framework calls all external executables, e.g.
tar
. We implement two main functions which take the same arguments:
run_shell_command()
, for running commands in a shell
environment (e.g. permitting the use of environment variables), and
run_command()
, for spawning a subprocess with the
executable directly, without the overhead of starting up a shell. Both of these
are effectively convenience wrappers around the python standard library’s
subprocess.Popen
.
Note that, due to implementation reasons,
SubprocessRuntimeManager
doesn’t call
run_shell_command()
but instead implements its own
wrapper.