src.xr_parser module

Code for normalizing metadata in xarray Datasets; see Data layer: Preprocessing.

Familiarity with the cf_xarray package, used as a third-party dependency, as well as the src.data_model module is recommended.

src.xr_parser.ATTR_NOT_FOUND = sentinel.AttrNotFound

Sentinel object serving as a placeholder for netCDF metadata attributes that are expected, but not present in the data.

class src.xr_parser.PlaceholderScalarCoordinate(name: str, axis: str, standard_name: str = sentinel.AttrNotFound, units: str = sentinel.AttrNotFound)[source]

Bases: object

Dummy object used to describe scalar coordinates referred to by name only in the ‘coordinates’ attribute of a variable or dataset. We do this so that the attributes match those of coordinates represented by real netCDF Variables.

name: str
axis: str
standard_name: str = sentinel.AttrNotFound
units: str = sentinel.AttrNotFound
__init__(name: str, axis: str, standard_name: str = sentinel.AttrNotFound, units: str = sentinel.AttrNotFound)None

Initialize self. See help(type(self)) for accurate signature.

__post_init__(*args, **kwargs)
src.xr_parser.patch_cf_xarray_accessor(mod)[source]

Monkey-patches _get_axis_coord, a module-level function in cf_xarray, to obtain desired axis-to-coordinate lookup behavior. Specifically, if a variable has been recognized as one of the coordinates in the dict above and no variable has been set as the corresponding axis, recognize the variable as that axis as well. See discussion at https://github.com/xarray-contrib/cf-xarray/issues/23.

class src.xr_parser.MDTFCFAccessorMixin[source]

Bases: object

Properties we add to both xarray Dataset and DataArray objects via the accessor extension mechanism.

property is_static

Returns bool according to whether the Dataset/DataArray has/is a time coordinate.

property calendar

Reads ‘calendar’ attribute on time axis (intended to have been set by DefaultDatasetParser.normalize_calendar()). Returns None if no time axis.

property dim_axes_set

Returns a frozenset of names of axes which are dimension coordinates.

property axes_set

Returns a frozenset of all axes names.

__init__()

Initialize self. See help(type(self)) for accurate signature.

class src.xr_parser.MDTFCFDatasetAccessorMixin[source]

Bases: src.xr_parser.MDTFCFAccessorMixin

Methods we add for xarray Dataset objects via the accessor extension mechanism.

scalar_coords(var_name=None)[source]

Return a list of the Dataset variable objects corresponding to scalar coordinates on the entire Dataset, or on var_name if given. If a coordinate was defined as an attribute only, return its name in a PlaceholderScalarCoordinate object instead.

get_scalar(ax_name, var_name=None)[source]

If the axis label ax_name is a scalar coordinate, return the corresponding xarray DataArray (or PlaceholderScalarCoordinate), otherwise return None. Applies to the entire Dataset, or to var_name if given.

axes(var_name=None, filter_set=None)[source]

Override cf_xarray accessor behavior (from _old_axes_dict()).

Parameters
  • var_name (optional) – If supplied, return a dict containing the subset of coordinates used by the dependent variable var_name, instead of all coordinates in the dataset.

  • filter_set (optional) – Optional iterable of coordinate names. If supplied, restrict the returned dict to coordinates in filter_set.

Returns

Dict mapping axis labels to lists of the Dataset variables themselves, instead of their names.

dim_axes(var_name=None)[source]

Override cf_xarray accessor behavior by having values of the ‘axes’ dict be the Dataset variables themselves, instead of their names.

__init__()

Initialize self. See help(type(self)) for accurate signature.

property axes_set

Returns a frozenset of all axes names.

property calendar

Reads ‘calendar’ attribute on time axis (intended to have been set by DefaultDatasetParser.normalize_calendar()). Returns None if no time axis.

property dim_axes_set

Returns a frozenset of names of axes which are dimension coordinates.

property is_static

Returns bool according to whether the Dataset/DataArray has/is a time coordinate.

class src.xr_parser.MDTFDataArrayAccessorMixin[source]

Bases: src.xr_parser.MDTFCFAccessorMixin

Methods we add for xarray DataArray objects via the accessor extension mechanism.

dim_axes()[source]

Map axes labels to the (unique) coordinate variable name, instead of a list of names as in cf_xarray. Filter on dimension coordinates only (eliminating any scalar coordinates.)

axes()[source]

Map axes labels to the (unique) coordinate variable name, instead of a list of names as in cf_xarray.

property formula_terms

name in dataset) pairs parsed from formula_terms attribute. If attribute not present, returns empty dict.

Type

Returns dict of (name in formula

__init__()

Initialize self. See help(type(self)) for accurate signature.

property axes_set

Returns a frozenset of all axes names.

property calendar

Reads ‘calendar’ attribute on time axis (intended to have been set by DefaultDatasetParser.normalize_calendar()). Returns None if no time axis.

property dim_axes_set

Returns a frozenset of names of axes which are dimension coordinates.

property is_static

Returns bool according to whether the Dataset/DataArray has/is a time coordinate.

class src.xr_parser.MDTFCFDatasetAccessor[source]

Bases: src.xr_parser.MDTFCFDatasetAccessorMixin, object

Accessor that’s registered (under the attribute cf) for xarray Datasets. Combines methods in MDTFCFDatasetAccessorMixin and the cf_xarray Dataset accessor.

__init__()

Initialize self. See help(type(self)) for accurate signature.

axes(var_name=None, filter_set=None)

Override cf_xarray accessor behavior (from _old_axes_dict()).

Parameters
  • var_name (optional) – If supplied, return a dict containing the subset of coordinates used by the dependent variable var_name, instead of all coordinates in the dataset.

  • filter_set (optional) – Optional iterable of coordinate names. If supplied, restrict the returned dict to coordinates in filter_set.

Returns

Dict mapping axis labels to lists of the Dataset variables themselves, instead of their names.

property axes_set

Returns a frozenset of all axes names.

property calendar

Reads ‘calendar’ attribute on time axis (intended to have been set by DefaultDatasetParser.normalize_calendar()). Returns None if no time axis.

dim_axes(var_name=None)

Override cf_xarray accessor behavior by having values of the ‘axes’ dict be the Dataset variables themselves, instead of their names.

property dim_axes_set

Returns a frozenset of names of axes which are dimension coordinates.

get_scalar(ax_name, var_name=None)

If the axis label ax_name is a scalar coordinate, return the corresponding xarray DataArray (or PlaceholderScalarCoordinate), otherwise return None. Applies to the entire Dataset, or to var_name if given.

property is_static

Returns bool according to whether the Dataset/DataArray has/is a time coordinate.

scalar_coords(var_name=None)

Return a list of the Dataset variable objects corresponding to scalar coordinates on the entire Dataset, or on var_name if given. If a coordinate was defined as an attribute only, return its name in a PlaceholderScalarCoordinate object instead.

class src.xr_parser.MDTFCFDataArrayAccessor[source]

Bases: src.xr_parser.MDTFDataArrayAccessorMixin, object

Accessor that’s registered (under the attribute cf) for xarray DataArrays. Combines methods in MDTFDataArrayAccessorMixin and the cf_xarray DataArray accessor.

__init__()

Initialize self. See help(type(self)) for accurate signature.

axes()

Map axes labels to the (unique) coordinate variable name, instead of a list of names as in cf_xarray.

property axes_set

Returns a frozenset of all axes names.

property calendar

Reads ‘calendar’ attribute on time axis (intended to have been set by DefaultDatasetParser.normalize_calendar()). Returns None if no time axis.

dim_axes()

Map axes labels to the (unique) coordinate variable name, instead of a list of names as in cf_xarray. Filter on dimension coordinates only (eliminating any scalar coordinates.)

property dim_axes_set

Returns a frozenset of names of axes which are dimension coordinates.

property formula_terms

name in dataset) pairs parsed from formula_terms attribute. If attribute not present, returns empty dict.

Type

Returns dict of (name in formula

property is_static

Returns bool according to whether the Dataset/DataArray has/is a time coordinate.

class src.xr_parser.DefaultDatasetParser(data_mgr, pod)[source]

Bases: object

Class containing MDTF-specific methods for cleaning and normalizing xarray metadata.

Top-level methods are parse() and get_unmapped_names().

__init__(data_mgr, pod)[source]

Constructor.

Parameters
  • data_mgr – DataSource instance calling the preprocessor.

  • pod (Diagnostic) – POD whose variables are being preprocessed.

setup(data_mgr, pod)[source]

Hook for use by child classes (currently unused) to do additional configuration immediately before parse() is called on each variable for pod.

Parameters
  • data_mgr – DataSource instance calling the preprocessor.

  • pod (Diagnostic) – POD whose variables are being preprocessed.

guess_attr(attr_desc, attr_name, options, default=None, comparison_func=None)[source]

Select and return element of options equal to attr_name. If none are equal, try a case-insensititve string match.

Parameters
  • attr_desc (str) – Description of the attribute (only used for log messages.)

  • attr_name (str) – Expected name of the attribute.

  • options (iterable of str) – Attribute names that are present in the data.

  • default (str, default None) – If supplied, default value to return if no match.

  • comparison_func (optional, default None) – String comparison function to use.

Raises

KeyError – if no element of options can be coerced to match key_name.

Returns

Element of options matching attr_name.

normalize_attr(new_attr_d, d, key_name, key_startswith=None)[source]

Sets the value in dict d corresponding to the key key_name.

If key_name is in d, no changes are made. If key_name is not in d, we check possible nonstandard representations of the key (case-insensitive match via guess_attr() and whether the key starts with the string key_startswith.) If no match is found for key_name, its value is set to the sentinel value ATTR_NOT_FOUND.

Parameters
  • new_attr_d (dict) – dict to store all found attributes. We don’t change attributes on d here, since that can interfere with xarray.decode_cf(), but instead modify this dict in place and pass it to restore_attrs() so they can be set once that’s done.

  • d (dict) – dict of Dataset attributes, whose keys are to be searched for key_name.

  • key_name (str) – Expected name of the key.

  • key_startswith (optional, str) – If provided and if key_name isn’t found in d, a key starting with this string will be accepted instead.

normalize_calendar(attr_d)[source]

Finds the calendar attribute, if present, and normalizes it to one of the values in the CF standard before xarray.decode_cf() decodes the time axis.

normalize_pre_decode(ds)[source]

Initial munging of xarray Dataset attribute dicts, before any parsing by xarray.decode_cf() or the cf_xarray accessor.

restore_attrs_backup(ds)[source]

xarray.decode_cf() and other functions appear to un-set some of the attributes defined in the netCDF file. Restore them from the backups made in munge_ds_attrs(), but only if the attribute was deleted.

normalize_standard_name(new_attr_d, attr_d)[source]

Method for munging standard_name attribute prior to parsing.

normalize_unit(new_attr_d, attr_d)[source]

Hook to convert unit strings to values that are correctly parsed by cfunits/UDUnits2. Currently we handle the case where “mb” is interpreted as “millibarn”, a unit of area (see UDUnits mailing list.) New cases of incorrectly parsed unit strings can be added here as they are discovered.

normalize_dependent_var(var, ds)[source]

Use heuristics to determine the name of the dependent variable from among all the variables in the Dataset ds, if the name doesn’t match the value we expect in our_var.

normalize_metadata(var, ds)[source]

Normalize name, standard_name and units attributes after decode_cf and cf_xarray setup steps and metadata dict has been restored, since those methods don’t touch these metadata attributes.

compare_attr(our_attr_tuple, ds_attr_tuple, comparison_func=None, fill_ours=True, fill_ds=False, overwrite_ours=None)[source]

Worker function to compare two attributes (on our_var, the framework’s record, and on ds, the “ground truth” of the dataset) and update one in the event of disagreement.

This handles the special cases where the attribute isn’t defined on our_var or ds.

Parameters
  • our_attr_tuple – tuple specifying the attribute on our_var

  • ds_attr_tuple – tuple specifying the same attribute on ds

  • comparison_func – function of two arguments to use to compare the attributes; defaults to __eq__.

  • fill_ours (bool) – If the attr on our_var is missing, fill it in with the value from ds.

  • fill_ds (bool) – If the attr on ds is missing, fill it in with the value from our_var.

  • overwrite_ours (bool) –

    Action to take if both attrs are defined but have different values:

    • None (default): Update our_var if fill_ours is True,

      but in any case raise a MetadataEvent.

    • True: Change our_var to match ds.

    • False: Change ds to match our_var.

reconcile_name(our_var, ds_var_name, overwrite_ours=None)[source]

Reconcile the name of the variable between the ‘ground truth’ of the dataset we downloaded (ds_var) and our expectations based on the model’s convention (our_var).

reconcile_attr(our_var, ds_var, our_attr_name, ds_attr_name=None, **kwargs)[source]

Compare attribute of a DMVariable (our_var) with what’s set in the xarray.Dataset (ds_var).

reconcile_names(our_var, ds, ds_var_name, overwrite_ours=None)[source]

Reconcile the name and standard_name attributes between the ‘ground truth’ of the dataset we downloaded (ds_var_name) and our expectations based on the model’s convention (our_var).

Parameters
  • our_var (TranslatedVarlistEntry) – Expected attributes of the dataset variable, according to the data request.

  • ds – xarray Dataset.

  • ds_var_name (str) – Name of the variable in ds we expect to correspond to our_var.

  • overwrite_ours (bool, default False) – If True, always update the name of our_var to what’s found in ds.

reconcile_units(our_var, ds_var)[source]

Reconcile the units attribute between the ‘ground truth’ of the dataset we downloaded (ds_var) and our expectations based on the model’s convention (our_var).

Parameters
  • our_var (TranslatedVarlistEntry) – Expected attributes of the dataset variable, according to the data request.

  • ds_var – xarray DataArray.

reconcile_time_units(our_var, ds_var)[source]

Special case of reconcile_units() for the time variable. In normal operation we don’t know (or need to know) the calendar or reference date (for time units of the form ‘days since 1970-01-01’), so it’s OK to set these from the dataset.

Parameters
  • our_var (TranslatedVarlistEntry) – Expected attributes of the dataset variable, according to the data request.

  • ds_var – xarray DataArray.

reconcile_scalar_value_and_units(our_var, ds_var)[source]

Compare scalar coordinate value of a DMVariable (our_var) with what’s set in the xarray.Dataset (ds_var). If there’s a discrepancy, log an error but change the entry in our_var.

reconcile_coord_bounds(our_coord, ds, ds_coord_name)[source]

Reconcile standard_name and units attributes between the ‘ground truth’ of the dataset we downloaded (ds_var_name) and our expectations based on the model’s convention (our_var), for the bounds on the dimension coordinate our_coord.

reconcile_dimension_coords(our_var, ds)[source]

Reconcile name, standard_name and units attributes between the ‘ground truth’ of the dataset we downloaded (ds_var_name) and our expectations based on the model’s convention (our_var), for all dimension coordinates used by our_var.

Parameters
  • our_var (TranslatedVarlistEntry) – Expected attributes of the dataset variable, according to the data request.

  • ds – xarray Dataset.

reconcile_scalar_coords(our_var, ds)[source]

Reconcile name, standard_name and units attributes between the ‘ground truth’ of the dataset we downloaded (ds_var_name) and our expectations based on the model’s convention (our_var), for all scalar coordinates used by our_var.

Parameters
  • our_var (TranslatedVarlistEntry) – Expected attributes of the dataset variable, according to the data request.

  • ds – xarray Dataset.

reconcile_variable(var, ds)[source]

Top-level method for the MDTF-specific dataset validation: attempts to reconcile name, standard_name and units attributes for the variable and coordinates in translated_var (our expectation, based on the DataSource’s naming convention) with attributes actually present in the Dataset ds.

check_calendar(ds)[source]

Checks the ‘calendar’ attribute has been set correctly for time-dependent data (assumes CF conventions).

Sets the “calendar” attr on the time coordinate, if it exists, in order to be read by the calendar property defined in the cf_xarray accessor.

check_metadata(ds_var, *attr_names)[source]

Wrapper for normalize_attr(), specialized to the case of getting a variable’s standard_name.

check_ds_attrs(var, ds)[source]

Final checking of xarray Dataset attribute dicts before starting functions in src.preprocessor.

Only checks attributes on the dependent variable var and its coordinates: any other netCDF variables in the file are ignored.

parse(var, ds)[source]

Calls the above metadata parsing functions in the intended order; intended to be called immediately after the Dataset ds is opened.

Note

decode_cf=False should be passed to the xarray open_dataset method, since that parsing is done here instead.

  • Calls normalize_pre_decode() to do basic cleaning of metadata attributes.

  • Call xarray’s decode_cf, using cftime to decode CF-compliant date/time axes.

  • Assign axis labels to dimension coordinates using cf_xarray.

  • Verify that calendar is set correctly (check_calendar()).

  • Reconcile metadata in var and ds (reconcile_* methods).

  • Verify that the name, standard_name and units for the variable and its

    coordinates are set correctly (check_* methods).

Parameters
  • var (VarlistEntry) – VerlistEntry describing metadata we expect to find in ds.

  • ds (Dataset) – xarray Dataset of locally downloaded model data.

Returns

ds, with data unchanged but metadata normalized to expected values. Except in specific cases, attributes of var are updated to reflect the ‘ground truth’ of data in ds.

static get_unmapped_names(ds)[source]

Get a dict whose keys are variable or attribute names referred to by variables in the Dataset ds, but not present in the dataset itself.

Returns

Values of the dict are sets of names of variables in the dataset that referred to the missing name (keys).

Return type

(dict)