8. POD coding best practices¶
In this section we describe issues we’ve seen in POD code that have caused problems in the form of bugs, inefficiencies, or unintended consequences.
8.1. All languages¶
PS vs. EPS figures: Save vector plots as .eps (Encapsulated PostScript), not .ps (regular PostScript).
Why: Postscript (.ps) is perhaps the most common vector graphics format, and almost all plotting packages are able to output postscript files. Encapsulated Postscript (.eps) includes bounding box information that describes the physical extent of the plot’s contents. This is used by the framework to generate bitmap versions of the plots correctly: the framework calls ghostscript for the conversion, and if not provided with a bounding box ghostscript assumes the graphics use an entire sheet of (letter or A4) paper. This can cause plots to be cut off if they extend outside of this region.
Note that many plotting libraries will set the format of the output file automatically from the filename extension. The framework will process both *.ps and *.eps files.
8.2. Python: General¶
Whitespace: Indent python code with four spaces per indent level.
Why: Python uses indentation to delineate nesting and scope within a program, and indentation that’s not done consistently is a syntax error. Using four spaces is not required, but is the generally accepted standard.
Beyond this, we don’t impose requirements on how your code is formatted, but voluntarily following standard best practices (such as described in PEP8 or the Google style guide) will make it easier for you and others to understand your code, find bugs, etc.
Why: Hard-coding unix commands makes code less portable. Calling out to a subprocess introduces overhead and makes error handling and logging more difficult. The main reason, however, is that Python already provides these tools in a portable way. Please see the documentation for the os and shutil modules, summarized in this table:
Construct a path from dir1, dir2, …, filename
os.path.join(dir1, dir2, …, filename)
Split a path into directory and filename
List files in directory dir
Move or rename a file or directory from old_path to new_path
Create a directory or sequence of directories dir
Copy a file from path to new_path
Copy a directory dir, and everything inside it, to new_dir
Delete a single file at path
Delete a directory dir and everything inside it
In particular, using os.path.join is more verbose than joining strings but eliminates bugs arising from missing or redundant directory separators.
8.3. Python: Arrays¶
To obtain acceptable performance for numerical computation, people use Python interfaces to optimized, compiled code. NumPy is the standard module for manipulating numerical arrays in Python. xarray sits on top of NumPy and provides a higher-level interface to its functionality; any advice about NumPy applies to it as well.
NumPy and xarray both have extensive documentation and many tutorials, such as:
A demonstration of the features of xarray using earth science data;
Eliminate explicit for loops: Use NumPy/xarray functions instead of writing for loops in Python that loop over the indices of your data array. In particular, nested for loops on multidimensional data should never need to be used.
Why: For loops in Python are very slow compared to C or Fortran, because Python is an interpreted language. You can think of the NumPy functions as someone writing those for-loops for you in C, and giving you a way to call it as a Python function.
It’s beyond the scope of this document to cover all possible situations, since this is the main use case for NumPy. We refer to the tutorials above for instructions, and to the following blog posts that discuss this specific issue:
“Look Ma, no for-loops,” by Brad Solomon;
“Turn your conditional loops to Numpy vectors,” by Tirthajyoti Sarkar;
“’Vectorized’ Operations: Optimized Computations on NumPy Arrays”, part of “Python like you mean it,” a free resource by Ryan Soklaski.
Use xarray with netCDF data:
Why: This is xarray’s use case. You can think of NumPy as implementing multidimensional matrices in the fully general, mathematical sense, and xarray providing the specialization to the case where the matrix contains data on a lat-lon-time-(etc.) grid.
xarray lets you refer to your data with human-readable labels such as ‘latitude,’ rather than having to remember that that’s the second dimension of your array. This bookkeeping is essential when writing code for the MDTF framework, when your POD will be run on data from models you haven’t been able to test on.
In particular, xarray provides seamless support for time axes, with support for all CF convention calendars through the
cftimelibrary. You can, eg, subset a range of data between two dates without having to manually convert those dates to array indices.
See the xarray tutorials linked above for more examples of xarray’s features.
Memory use and views vs. copies: Use scalar indexing and slices (index specifications of the form start_index:stop_index:stride) to get subsets of arrays whenever possible, and only use advanced indexing features (indexing arrays with other arrays) when necessary.
Why: When advanced indexing is used, NumPy will need to create a new copy of the array in memory, which can hurt performance if the array contains a large amount of data. By contrast, slicing or basic indexing is done in-place, without allocating a new array: the NumPy documentation calls this a “view.”
Note that array slices are native Python objects, so you can define a slice in a different place from the array you intend to use it on. Both NumPy and xarray arrays recognize slice objects.
This is easier to understand if you think about NumPy as a wrapper around C-like functions: array indexing in C is implemented with pointer arithmetic, since the array is implemented as a contiguous block of memory. An array slice is just a pointer to the same block of memory, but with different offsets. More complex indexing isn’t guaranteed to follow a regular pattern, so NumPy needs to copy the requested data in that case.
See the following references for more information:
MaskedArrays instead of NaNs or sentinel values: Use NumPy’s MaskedArrays for data that may contain missing or invalid values, instead of setting those entries to NaN or a sentinel value.
Why: One sometimes encounters code which sets array entries to fixed “sentinel values” (such as 1.0e+20 or NaN) to indicate missing or invalid data. This is a dangerous and error-prone practice, since it’s frequently not possible to detect if the invalid entries are being used by mistake. For example, computing the variance of a timeseries with missing elements set to 1e+20 will either result in a floating-point overflow, or return zero.
NumPy provides a better solution in the form of MaskedArrays, which behave identically to regular arrays but carry an extra boolean mask to indicate valid/invalid status. All the NumPy mathematical functions will automatically use this mask for error propagation. For example, trying to divide an array element by zero or taking the square root of a negative element will mask it off, indicating that the value is invalid: you don’t need to remember to do these sorts of checks explicitly.
8.4. Python: Plotting¶
Use the ‘Agg’ backend when testing your POD: For reproducibility, set the shell environment variable
Aggwhen testing your POD outside of the framework.
Why: Matplotlib can use a variety of backends: interfaces to low-level graphics libraries. Some of these are platform-dependent, or require additional libraries that the MDTF framework doesn’t install. In order to achieve cross-platform portability and reproducibility, the framework specifies the
'Agg'non-interactive (ie, writing files only) backend for all PODs, by setting the
When developing your POD, you’ll want an interactive backend – for example, this is automatically set up for you in a Jupyter notebook. When it comes to testing your POD outside of the framework, however, you should be aware of this backend difference.
Pass the cartopy CRS to plotting functions: See cartopy’s documentation. A coordinate reference system (CRS) must be passed as a
projectionargument when plot axes are created. This should be passed to subsequent functions that set the plot range (
set_extent: avoid the use of
set_ylim) and to plotting functions (
Note that this applies even to simple lat/lon plots, for which the appropriate CRS is
PlateCarree(). Not specifying a CRS in this case will give rise to subtle errors, e.g. when trying to set longitude ranges of [-180,180] or [0, 360] in which the bounds map to the same location.
Large file support: By default, NCL cannot read netCDF files larger than 2gb. To drop this limitation, call setfileoption with the following arguments in every script before any file operations:
setfileoption("nc", "Format", getenv("MDTF_NC_FORMAT"))
"netCDF4"can also be used as the requested format in the above call.
Deprecated calendar functions: Check the function reference to verify that the functions you use are not deprecated in the current version of NCL. This is especially necessary for date/calendar functions.
Why: The framework uses a current version of NCL (6.6.x), to avoid plotting bugs that were present in earlier versions. This is especially relevant for calendar functions: the
ut_*set of functions have been deprecated in favor of counterparts beginning with
cd_that take identical arguments (so code can be updated using find/replace). For example, use cd_calendar instead of the deprecated ut_calendar.
This change is necessary because only the
cd_*functions support all calendars defined in the CF conventions, which is needed to process data from some models (eg, weather or seasonal models are typically run with a Julian calendar.)