5. POD coding best practices

In this section we describe issues we’ve seen in POD code that have caused problems in the form of bugs, inefficiencies, or unintended consequences.

5.1. All languages

  • PS vs. EPS figures: Save vector plots as .eps (Encapsulated PostScript), not .ps (regular PostScript).

    Why: Postscript (.ps) is perhaps the most common vector graphics format, and almost all plotting packages are able to output postscript files. Encapsulated Postscript (.eps) includes bounding box information that describes the physical extent of the plot’s contents. This is used by the framework to generate bitmap versions of the plots correctly: the framework calls ghostscript for the conversion, and if not provided with a bounding box ghostscript assumes the graphics use an entire sheet of (letter or A4) paper. This can cause plots to be cut off if they extend outside of this region.

    Note that many plotting libraries will set the format of the output file automatically from the filename extension. The framework will process both *.ps and *.eps files.

5.2. Python: General

  • Whitespace: Indent python code with four spaces per indent level.

    Why: Python uses indentation to delineate nesting and scope within a program, and indentation that’s not done consistently is a syntax error. Using four spaces is not required, but is the generally accepted standard.

    Indentation can be configured in most text editors, or fixed with scripts such as reindent.py described here. We recommend using a linter such as pylint to find common bugs and syntax errors.

    Beyond this, we don’t impose requirements on how your code is formatted, but voluntarily following standard best practices (such as described in PEP8 or the Google style guide) will make it easier for you and others to understand your code, find bugs, etc.

  • Filesystem commands: Use commands in the os and shutil modules to interact with the filesystem, instead of running unix commands using os.system(), commands (which is deprecated), or subprocess.

    Why: Hard-coding unix commands makes code less portable. Calling out to a subprocess introduces overhead and makes error handling and logging more difficult. The main reason, however, is that Python already provides these tools in a portable way. Please see the documentation for the os and shutil modules, summarized in this table:

    In particular, using os.path.join is more verbose than joining strings but eliminates bugs arising from missing or redundant directory separators.

5.3. Python: Arrays

To obtain acceptable performance for numerical computation, people use Python interfaces to optimized, compiled code. NumPy is the standard module for manipulating numerical arrays in Python. xarray sits on top of NumPy and provides a higher-level interface to its functionality; any advice about NumPy applies to it as well.

NumPy and xarray both have extensive documentation and many tutorials, such as:

  • NumPy’s own basic and intermediate tutorials; xarray’s overview and climate and weather examples;

  • A demonstration of the features of xarray using Earth science data;

  • The 2020 SciPy conference has open-source, interactive tutorials you can work through on your own machine or fully online using Binder. In particular, there are tutorials for NumPy and xarray.

  • Eliminate explicit for loops: Use NumPy/xarray functions instead of writing for loops in Python that loop over the indices of your data array. In particular, nested for loops on multidimensional data should never need to be used.

    Why: For loops in Python are very slow compared to C or Fortran, because Python is an interpreted language. You can think of the NumPy functions as someone writing those for-loops for you in C, and giving you a way to call it as a Python function.

    It’s beyond the scope of this document to cover all possible situations, since this is the main use case for NumPy. We refer to the tutorials above for instructions, and to the following blog posts that discuss this specific issue:

    by Tirthajyoti Sarkar;

    part of “Python like you mean it,” a free resource by Ryan Soklaski.

  • Use xarray with netCDF data:

    Why: This is xarray’s use case. You can think of NumPy as implementing multidimensional matrices in the fully general, mathematical sense, and xarray providing the specialization to the case where the matrix contains data on a lat-lon-time-(etc.) grid.

    xarray lets you refer to your data with human-readable labels such as ‘latitude,’ rather than having to remember that that’s the second dimension of your array. This bookkeeping is essential when writing code for the MDTF framework, when your POD will be run on data from models you haven’t been able to test on.

    In particular, xarray provides seamless support for time axes, with support for all CF convention calendars through the cftime library. You can, eg, subset a range of data between two dates without having to manually convert those dates to array indices.

    See the xarray tutorials linked above for more examples of xarray’s features.

  • Memory use and views vs. copies: Use scalar indexing and

slices

(index specifications of the form start_index:stop_index:stride) to get subsets of arrays whenever possible, and only use advanced indexing features (indexing arrays with other arrays) when necessary.

Why: When advanced indexing is used, NumPy will need to create a new copy of the array in memory, which can hurt performance if the array contains a large amount of data. By contrast, slicing or basic indexing is done in-place, without allocating a new array: the NumPy documentation calls this a “view.”

Note that array slices are native Python objects, so you can define a slice in a different place from the array you intend to use it on. Both NumPy and xarray arrays recognize slice objects.

This is easier to understand if you think about NumPy as a wrapper around C-like functions: array indexing in C is implemented with pointer arithmetic, since the array is implemented as a contiguous block of memory. An array slice is just a pointer to the same block of memory, but with different offsets. More complex indexing isn’t guaranteed to follow a regular pattern, so NumPy needs to copy the requested data in that case.

See the following references for more information:

by Jessica Yung;

on stackoverflow.

  • MaskedArrays instead of NaNs or sentinel values: Use NumPy’s MaskedArrays for data that may contain missing or invalid values, instead of setting those entries to NaN or a sentinel value.

    Why: One sometimes encounters code which sets array entries to fixed “sentinel values” (such as 1.0e+20 or NaN) to indicate missing or invalid data. This is a dangerous and error-prone practice, since it’s frequently not possible to detect if the invalid entries are being used by mistake. For example, computing the variance of a timeseries with missing elements set to 1e+20 will either result in a floating-point overflow, or return zero.

    NumPy provides a better solution in the form of MaskedArrays, which behave identically to regular arrays but carry an extra boolean mask to indicate valid/invalid status. All the NumPy mathematical functions will automatically use this mask for error propagation. For example, trying to divide an array element by zero or taking the square root of a negative element will mask it off, indicating that the value is invalid: you don’t need to remember to do these sorts of checks explicitly.

5.4. Python: Plotting

  • Use the ‘Agg’ backend when testing your POD: For reproducibility, set the shell environment variable MPLBACKEND to Agg when testing your POD outside of the framework.

    Why: Matplotlib can use a variety of backends: interfaces to low-level graphics libraries. Some of these are platform-dependent, or require additional libraries that the MDTF framework doesn’t install. In order to achieve cross-platform portability and reproducibility, the framework specifies the 'Agg' non-interactive (ie, writing files only) backend for all PODs, by setting the MPLBACKEND environment variable.

    When developing your POD, you’ll want an interactive backend – for example, this is automatically set up for you in a Jupyter notebook. When it comes to testing your POD outside of the framework, however, you should be aware of this backend difference.