src.preprocessor module

Functionality for transforming model data into the format expected by PODs once it’s been downloaded; see Data layer: Preprocessing.

src.preprocessor.copy_as_alternate(old_v, data_mgr, **kwargs)[source]

Wrapper for replace() that creates a copy of an existing VarlistEntry old_v and sets appropriate attributes to designate it as an alternate variable.

src.preprocessor.edit_request_wrapper(wrapped_edit_request_func)[source]

Decorator implementing the most typical (so far) use case for PreprocessorFunctionBase.edit_request(), in which we look at each variable request in the varlist separately and, optionally, add a new alternate VarlistEntry based on that request.

This decorator wraps a function which either constructs and returns the desired new alternate VarlistEntry, or returns None if no alternates are to be added for the given variable request. It adds logic for updating the list of alternates for the pod’s varlist.

class src.preprocessor.PreprocessorFunctionBase(data_mgr, pod)[source]

Bases: abc.ABC

Abstract interface for implementing a specific preprocessing functionality. We prefer to put each set of operations in its own child class, rather than dumping everything into a general Preprocessor class, in order to keep the logic easier to follow.

It’s up to individual Preprocessor child classes to select which functions to use, and in what order to perform them.

edit_request(data_mgr, pod)[source]

Edit the data requested in pod’s Varlist queue, based on the transformations the functionality can perform. If the function can transform data in format X to format Y and the POD requests X, this method should insert a backup/fallback request for Y.

abstract process(var, dataset)[source]

Apply functionality to the input dataset.

Parameters
  • var (VarlistEntry) – POD varlist entry instance describing POD’s data request, which is the desired end result of preprocessing work.

  • datasetxarray.Dataset instance.

class src.preprocessor.CropDateRangeFunction(data_mgr, pod)[source]

Bases: src.preprocessor.PreprocessorFunctionBase

A PreprocessorFunctionBase class which trims the time axis of the dataset to the user-requested analysis period.

static cast_to_cftime(dt, calendar)[source]

Workaround to cast python datetime dt to cftime.datetime with given calendar. Python stdlib datetime has no support for different calendars.

process(var, ds)[source]

Parse quantities related to the calendar for time-dependent data. In particular, date_range was set from user input before we knew the model’s calendar. Workaround here to cast those values into cftime.datetime objects so they can be compared with the model data’s time axis.

edit_request(data_mgr, pod)

Edit the data requested in pod’s Varlist queue, based on the transformations the functionality can perform. If the function can transform data in format X to format Y and the POD requests X, this method should insert a backup/fallback request for Y.

class src.preprocessor.PrecipRateToFluxFunction(data_mgr, pod)[source]

Bases: src.preprocessor.PreprocessorFunctionBase

Convert units on the dependent variable of var, as well as its (non-time) dimension coordinate axes, from what’s specified in the dataset attributes to what’s given in the VarlistEntry.

edit_request(v, pod, data_mgr)[source]

Edit the POD’s Varlist prior to query. If v has a standard_name in the list above, insert an alternate varlist entry whose translation requests the complementary type of variable (ie, if given rate, add an entry for flux; if given flux, add an entry for rate.)

process(var, ds)[source]

Apply functionality to the input dataset.

Parameters
  • var (VarlistEntry) – POD varlist entry instance describing POD’s data request, which is the desired end result of preprocessing work.

  • datasetxarray.Dataset instance.

class src.preprocessor.ConvertUnitsFunction(data_mgr, pod)[source]

Bases: src.preprocessor.PreprocessorFunctionBase

Convert units on the dependent variable of var, as well as its (non-time) dimension coordinate axes, from what’s specified in the dataset attributes to what’s given in the VarlistEntry.

process(var, ds)[source]

Convert units on the dependent variable and coordinates of var from what’s specified in the dataset attributes to what’s given in the VarlistEntry var. Units attributes are updated on the TranslatedVarlistEntry.

edit_request(data_mgr, pod)

Edit the data requested in pod’s Varlist queue, based on the transformations the functionality can perform. If the function can transform data in format X to format Y and the POD requests X, this method should insert a backup/fallback request for Y.

class src.preprocessor.RenameVariablesFunction(data_mgr, pod)[source]

Bases: src.preprocessor.PreprocessorFunctionBase

process(var, ds)[source]

Apply functionality to the input dataset.

Parameters
  • var (VarlistEntry) – POD varlist entry instance describing POD’s data request, which is the desired end result of preprocessing work.

  • datasetxarray.Dataset instance.

edit_request(data_mgr, pod)

Edit the data requested in pod’s Varlist queue, based on the transformations the functionality can perform. If the function can transform data in format X to format Y and the POD requests X, this method should insert a backup/fallback request for Y.

class src.preprocessor.ExtractLevelFunction(data_mgr, pod)[source]

Bases: src.preprocessor.PreprocessorFunctionBase

Extract a single pressure level from a Dataset. Unit conversions of pressure are handled by cfunits, (see src.units module) but paramateric vertical coordinates are not handled: interpolation is not implemented here. If the exact level is not provided by the data, KeyError is raised.

edit_request(v, pod, data_mgr)[source]

Edit the pod’s Varlist prior to data query. If given a VarlistEntry v which specifies a scalar Z coordinate, return a copy with that scalar_coordinate removed to be used as an alternate variable for v.

process(var, ds)[source]

Determine if level extraction is needed, and return appropriate slice of Dataset if it is.

class src.preprocessor.ApplyScaleAndOffsetFunction(data_mgr, pod)[source]

Bases: src.preprocessor.PreprocessorFunctionBase

If the variable has scale_factor and add_offset attributes set, apply the corresponding constant linear transformation to the variable’s values and unset these attributes. By default this function is not applied.

See CF convention documentation on the scale_factor and add_offset attributes.

process(var, ds)[source]

Apply functionality to the input dataset.

Parameters
  • var (VarlistEntry) – POD varlist entry instance describing POD’s data request, which is the desired end result of preprocessing work.

  • datasetxarray.Dataset instance.

edit_request(data_mgr, pod)

Edit the data requested in pod’s Varlist queue, based on the transformations the functionality can perform. If the function can transform data in format X to format Y and the POD requests X, this method should insert a backup/fallback request for Y.

class src.preprocessor.MDTFPreprocessorBase(*args, **kwargs)[source]

Bases: object

Base class for preprocessing data after it’s been fetched, in order to put it into a format expected by PODs. The only functionality implemented here is parsing data axes and CF attributes; all other functionality is provided by PreprocessorFunctionBase functions, which are called in order.

edit_request(data_mgr, pod)[source]

Edit pod’s data request, based on the child class’s functionality. If the child class has a function that can transform data in format X to format Y and the POD requests X, this method should insert a backup/fallback request for Y.

setup(data_mgr, pod)[source]

Method to do additional configuration immediately before process() is called on each variable for pod.

property open_dataset_kwargs

Arguments passed to xarray open_dataset() and open_mfdataset().

property save_dataset_kwargs

Arguments passed to xarray to_netcdf().

read_one_file(var, path_list)[source]
abstract read_dataset(var)[source]
clean_nc_var_encoding(var, name, ds_obj)[source]

Clean up the attrs and encoding dicts of obj prior to writing to a netCDF file, as a workaround for the following known issues:

  • Missing attributes may be set to the sentinel value ATTR_NOT_FOUND by xr_parser.DefaultDatasetParser. Depending on context, this may not be an error, but attributes with this value need to be deleted before writing.

  • Delete the _FillValue attribute for all independent variables (coordinates and their bounds), which is specified in the CF conventions but isn’t the xarray default; see https://github.com/pydata/xarray/issues/1598.

  • ‘NaN’ is not recognized as a valid _FillValue by NCL (see https://www.ncl.ucar.edu/Support/talk_archives/2012/1689.html), so unset the attribute for this case.

  • xarray to_netcdf() raises an error if attributes set on a variable have the same name as those used in its encoding, even if their values are the same. We delete these attributes prior to writing, after checking equality of values.

clean_output_attrs(var, ds)[source]

Call clean_nc_var_encoding() on all sets of attributes in the Dataset ds.

log_history_attr(var, ds)[source]

Update history attribute on xarray Dataset ds with log records of any metadata modifications logged to var’s _nc_history_log log handler. Out of simplicity, events are written in chronological rather than reverse chronological order.

write_dataset(var, ds)[source]

Writes processed Dataset ds to location specified by dest_path attribute of var, using xarray to_netcdf()

load_ds(var)[source]

Top-level method to load dataset and parse metadata; spun out so that child classes can modify it. Calls child class read_dataset().

process_ds(var, ds)[source]

Top-level method to apply selected functions to dataset; spun out so that child classes can modify it.

write_ds(var, ds)[source]

Top-level method to write out processed dataset; spun out so that child classes can modify it. Calls child class write_dataset().

process(var)[source]

Top-level wrapper for doing all preprocessing of data files.

class src.preprocessor.SingleFilePreprocessor(*args, **kwargs)[source]

Bases: src.preprocessor.MDTFPreprocessorBase

A MDTFPreprocessorBase for preprocessing model data that is provided as a single netcdf file per variable, for example the sample model data.

read_dataset(var)[source]

Read a single file Dataset specified by the local_data attribute of var, using read_one_file().

clean_nc_var_encoding(var, name, ds_obj)

Clean up the attrs and encoding dicts of obj prior to writing to a netCDF file, as a workaround for the following known issues:

  • Missing attributes may be set to the sentinel value ATTR_NOT_FOUND by xr_parser.DefaultDatasetParser. Depending on context, this may not be an error, but attributes with this value need to be deleted before writing.

  • Delete the _FillValue attribute for all independent variables (coordinates and their bounds), which is specified in the CF conventions but isn’t the xarray default; see https://github.com/pydata/xarray/issues/1598.

  • ‘NaN’ is not recognized as a valid _FillValue by NCL (see https://www.ncl.ucar.edu/Support/talk_archives/2012/1689.html), so unset the attribute for this case.

  • xarray to_netcdf() raises an error if attributes set on a variable have the same name as those used in its encoding, even if their values are the same. We delete these attributes prior to writing, after checking equality of values.

clean_output_attrs(var, ds)

Call clean_nc_var_encoding() on all sets of attributes in the Dataset ds.

edit_request(data_mgr, pod)

Edit pod’s data request, based on the child class’s functionality. If the child class has a function that can transform data in format X to format Y and the POD requests X, this method should insert a backup/fallback request for Y.

load_ds(var)

Top-level method to load dataset and parse metadata; spun out so that child classes can modify it. Calls child class read_dataset().

log_history_attr(var, ds)

Update history attribute on xarray Dataset ds with log records of any metadata modifications logged to var’s _nc_history_log log handler. Out of simplicity, events are written in chronological rather than reverse chronological order.

property open_dataset_kwargs

Arguments passed to xarray open_dataset() and open_mfdataset().

process(var)

Top-level wrapper for doing all preprocessing of data files.

process_ds(var, ds)

Top-level method to apply selected functions to dataset; spun out so that child classes can modify it.

read_one_file(var, path_list)
property save_dataset_kwargs

Arguments passed to xarray to_netcdf().

setup(data_mgr, pod)

Method to do additional configuration immediately before process() is called on each variable for pod.

write_dataset(var, ds)

Writes processed Dataset ds to location specified by dest_path attribute of var, using xarray to_netcdf()

write_ds(var, ds)

Top-level method to write out processed dataset; spun out so that child classes can modify it. Calls child class write_dataset().

class src.preprocessor.DaskMultiFilePreprocessor(*args, **kwargs)[source]

Bases: src.preprocessor.MDTFPreprocessorBase

A MDTFPreprocessorBase that uses xarray’s dask support to preprocessing model data provided as one or several netcdf files per variable.

__init__(data_mgr, pod)[source]

Initialize self. See help(type(self)) for accurate signature.

edit_request(data_mgr, pod)[source]

Edit POD’s data request, based on the child class’s functionality. If the child class has a function that can transform data in format X to format Y and the POD requests X, this method should insert a backup/fallback request for Y.

read_dataset(var)[source]

Open multi-file Dataset specified by the local_data attribute of var, wrapping xarray open_mfdataset().

clean_nc_var_encoding(var, name, ds_obj)

Clean up the attrs and encoding dicts of obj prior to writing to a netCDF file, as a workaround for the following known issues:

  • Missing attributes may be set to the sentinel value ATTR_NOT_FOUND by xr_parser.DefaultDatasetParser. Depending on context, this may not be an error, but attributes with this value need to be deleted before writing.

  • Delete the _FillValue attribute for all independent variables (coordinates and their bounds), which is specified in the CF conventions but isn’t the xarray default; see https://github.com/pydata/xarray/issues/1598.

  • ‘NaN’ is not recognized as a valid _FillValue by NCL (see https://www.ncl.ucar.edu/Support/talk_archives/2012/1689.html), so unset the attribute for this case.

  • xarray to_netcdf() raises an error if attributes set on a variable have the same name as those used in its encoding, even if their values are the same. We delete these attributes prior to writing, after checking equality of values.

clean_output_attrs(var, ds)

Call clean_nc_var_encoding() on all sets of attributes in the Dataset ds.

load_ds(var)

Top-level method to load dataset and parse metadata; spun out so that child classes can modify it. Calls child class read_dataset().

log_history_attr(var, ds)

Update history attribute on xarray Dataset ds with log records of any metadata modifications logged to var’s _nc_history_log log handler. Out of simplicity, events are written in chronological rather than reverse chronological order.

property open_dataset_kwargs

Arguments passed to xarray open_dataset() and open_mfdataset().

process(var)

Top-level wrapper for doing all preprocessing of data files.

process_ds(var, ds)

Top-level method to apply selected functions to dataset; spun out so that child classes can modify it.

read_one_file(var, path_list)
property save_dataset_kwargs

Arguments passed to xarray to_netcdf().

setup(data_mgr, pod)

Method to do additional configuration immediately before process() is called on each variable for pod.

write_dataset(var, ds)

Writes processed Dataset ds to location specified by dest_path attribute of var, using xarray to_netcdf()

write_ds(var, ds)

Top-level method to write out processed dataset; spun out so that child classes can modify it. Calls child class write_dataset().

class src.preprocessor.SampleDataPreprocessor(*args, **kwargs)[source]

Bases: src.preprocessor.SingleFilePreprocessor

Implementation class for MDTFPreprocessorBase intended for use on sample model data distributed with the package. Assumes all data is in one netCDF file.

clean_nc_var_encoding(var, name, ds_obj)

Clean up the attrs and encoding dicts of obj prior to writing to a netCDF file, as a workaround for the following known issues:

  • Missing attributes may be set to the sentinel value ATTR_NOT_FOUND by xr_parser.DefaultDatasetParser. Depending on context, this may not be an error, but attributes with this value need to be deleted before writing.

  • Delete the _FillValue attribute for all independent variables (coordinates and their bounds), which is specified in the CF conventions but isn’t the xarray default; see https://github.com/pydata/xarray/issues/1598.

  • ‘NaN’ is not recognized as a valid _FillValue by NCL (see https://www.ncl.ucar.edu/Support/talk_archives/2012/1689.html), so unset the attribute for this case.

  • xarray to_netcdf() raises an error if attributes set on a variable have the same name as those used in its encoding, even if their values are the same. We delete these attributes prior to writing, after checking equality of values.

clean_output_attrs(var, ds)

Call clean_nc_var_encoding() on all sets of attributes in the Dataset ds.

edit_request(data_mgr, pod)

Edit pod’s data request, based on the child class’s functionality. If the child class has a function that can transform data in format X to format Y and the POD requests X, this method should insert a backup/fallback request for Y.

load_ds(var)

Top-level method to load dataset and parse metadata; spun out so that child classes can modify it. Calls child class read_dataset().

log_history_attr(var, ds)

Update history attribute on xarray Dataset ds with log records of any metadata modifications logged to var’s _nc_history_log log handler. Out of simplicity, events are written in chronological rather than reverse chronological order.

property open_dataset_kwargs

Arguments passed to xarray open_dataset() and open_mfdataset().

process(var)

Top-level wrapper for doing all preprocessing of data files.

process_ds(var, ds)

Top-level method to apply selected functions to dataset; spun out so that child classes can modify it.

read_dataset(var)

Read a single file Dataset specified by the local_data attribute of var, using read_one_file().

read_one_file(var, path_list)
property save_dataset_kwargs

Arguments passed to xarray to_netcdf().

setup(data_mgr, pod)

Method to do additional configuration immediately before process() is called on each variable for pod.

write_dataset(var, ds)

Writes processed Dataset ds to location specified by dest_path attribute of var, using xarray to_netcdf()

write_ds(var, ds)

Top-level method to write out processed dataset; spun out so that child classes can modify it. Calls child class write_dataset().

class src.preprocessor.DefaultPreprocessor(*args, **kwargs)[source]

Bases: src.preprocessor.DaskMultiFilePreprocessor

Implementation class for MDTFPreprocessorBase for the general use case. Includes all implemented functionality and handles multi-file data.

__init__(data_mgr, pod)

Initialize self. See help(type(self)) for accurate signature.

clean_nc_var_encoding(var, name, ds_obj)

Clean up the attrs and encoding dicts of obj prior to writing to a netCDF file, as a workaround for the following known issues:

  • Missing attributes may be set to the sentinel value ATTR_NOT_FOUND by xr_parser.DefaultDatasetParser. Depending on context, this may not be an error, but attributes with this value need to be deleted before writing.

  • Delete the _FillValue attribute for all independent variables (coordinates and their bounds), which is specified in the CF conventions but isn’t the xarray default; see https://github.com/pydata/xarray/issues/1598.

  • ‘NaN’ is not recognized as a valid _FillValue by NCL (see https://www.ncl.ucar.edu/Support/talk_archives/2012/1689.html), so unset the attribute for this case.

  • xarray to_netcdf() raises an error if attributes set on a variable have the same name as those used in its encoding, even if their values are the same. We delete these attributes prior to writing, after checking equality of values.

clean_output_attrs(var, ds)

Call clean_nc_var_encoding() on all sets of attributes in the Dataset ds.

edit_request(data_mgr, pod)

Edit POD’s data request, based on the child class’s functionality. If the child class has a function that can transform data in format X to format Y and the POD requests X, this method should insert a backup/fallback request for Y.

load_ds(var)

Top-level method to load dataset and parse metadata; spun out so that child classes can modify it. Calls child class read_dataset().

log_history_attr(var, ds)

Update history attribute on xarray Dataset ds with log records of any metadata modifications logged to var’s _nc_history_log log handler. Out of simplicity, events are written in chronological rather than reverse chronological order.

property open_dataset_kwargs

Arguments passed to xarray open_dataset() and open_mfdataset().

process(var)

Top-level wrapper for doing all preprocessing of data files.

process_ds(var, ds)

Top-level method to apply selected functions to dataset; spun out so that child classes can modify it.

read_dataset(var)

Open multi-file Dataset specified by the local_data attribute of var, wrapping xarray open_mfdataset().

read_one_file(var, path_list)
property save_dataset_kwargs

Arguments passed to xarray to_netcdf().

setup(data_mgr, pod)

Method to do additional configuration immediately before process() is called on each variable for pod.

write_dataset(var, ds)

Writes processed Dataset ds to location specified by dest_path attribute of var, using xarray to_netcdf()

write_ds(var, ds)

Top-level method to write out processed dataset; spun out so that child classes can modify it. Calls child class write_dataset().