scipy/doc/source/dev/api-dev/nan_policy.rst

A Design Specification for ``nan_policy``
=========================================

Many functions in `scipy.stats` have a parameter called ``nan_policy``
that determines how the function handles data that contains ``nan``.  In
this section, we provide SciPy developer guidelines for how ``nan_policy``
is intended to be used, to ensure that as this parameter is added to new
functions, we maintain a consistent API.

The basic API
-------------

The parameter ``nan_policy`` accepts three possible strings: ``'omit'``,
``'raise'`` and ``'propagate'``.  The meanings are:

* ``nan_policy='omit'``:
  Ignore occurrences of ``nan`` in the input.  Do not generate a warning
  if the input contains ``nan`` (unless the equivalent input with the
  ``nan`` values removed would generate a warning). For example, for the
  simple case of a function that accepts a single array and returns a
  scalar (and ignoring the possible use of ``axis`` for the moment)::

      func([1.0, 3.0, np.nan, 5.0], nan_policy='omit')

  should behave the same as::

      func([1.0, 3.0, 5.0])

  More generally, for functions that return a scalar,
  ``func(a, nan_policy='omit')`` should behave the same as
  ``func(a[~np.isnan(a)])``.

  For functions that transform a vector to a new vector of the same
  size and for which each entry in the output array depends on
  more than just the corresponding value in the input array [#f1]_ (e.g.
  `scipy.stats.zscore`, `scipy.stats.boxcox` *when* ``lmbda`` *is None*),::

      y = func(a, nan_policy='omit')

  should behave the same as::

      nan_mask = np.isnan(a)
      y = np.empty(a.shape, dtype=np.float64)
      y[~nan_mask] = func(a[~nan_mask])
      y[nan_mask] = np.nan

  (In general, the dtype of ``y`` might depend on ``a`` and on the expected
  behavior of ``func``).  In other words, a `nan` in the input gives a
  corresponding `nan` in the output, but the presence of that `nan` does not
  affect the calculation of the non-`nan` values.

  Unit tests for this property should be used to test functions that
  handle ``nan_policy``.

  For functions that return a scalar and that accept two or more arguments
  but whose values are not related (e.g. `scipy.stats.ansari`,
  `scipy.stats.f_oneway`), the same idea applies to each input array.  So::

      func(a, b, nan_policy='omit')

  should behave the same as::

      func(a[~np.isnan(a)], b[~np.isnan(b)])

  For inputs with *related* or *paired* values (e.g. `scipy.stats.pearsonr`,
  `scipy.stats.ttest_rel`) the recommended behavior is to omit all the values
  for which any of the related values are ``nan``.  For a function with two
  related array inputs, this means::

      y = func(a, b, nan_policy='omit')

  should behave the same as::

      hasnan = np.isnan(a) | np.isnan(b)  # Union of the isnan masks.
      y = func(a[~hasnan], b[~hasnan])

  The docstring for such a function should clearly state this behavior.

* ``nan_policy='raise'``:
  Raise a ``ValueError``.
* ``nan_policy='propagate'``:
  Propagate the ``nan`` value to the output.  Typically, this means just
  execute the function without checking for ``nan``, but see

      https://github.com/scipy/scipy/issues/7818

  for an example where that might lead to unexpected output.


``nan_policy`` combined with an ``axis`` parameter
--------------------------------------------------
There is nothing surprising here--the principle mentioned above still
applies when the function has an ``axis`` parameter.  Suppose, for example,
``func`` reduces a 1-d array to a scalar, and handles n-d arrays as a
collection of 1-d arrays, with the ``axis`` parameter specifying the axis
along which the reduction is to be applied.  If, say::

    func([1, 3, 4])     -> 10.0
    func([2, -3, 8, 2]) ->  4.2
    func([7, 8])        ->  9.5
    func([])            -> -inf

then::

    func([[  1, nan,   3,   4],
          [  2,  -3,   8,   2],
          [nan,   7, nan,   8],
          [nan, nan, nan, nan]], nan_policy='omit', axis=-1)

must give the result::

    np.array([10.0, 4.2, 9.5, -inf])


Edge cases
----------
A function that implements the ``nan_policy`` parameter should gracefully
handle the case where *all* the values in the input array(s) are ``nan``.
The basic principle described above still applies::

    func([nan, nan, nan], nan_policy='omit')

should behave the same as::

    func([])

In practice, when adding ``nan_policy`` to an existing function, it is
not unusual to find that the function doesn't already handle this case
in a well-defined manner, and some thought and design may have to be
applied to ensure that it works.  The correct behavior (whether that be
to return ``nan``, return some other value, raise an exception, or something
else) will be determined on a case-by-case basis.


Why doesn't ``nan_policy`` also apply to ``inf``?
--------------------------------------------------
Although we learn in grade school that "infinity is not a number", the
floating point values ``nan`` and ``inf`` are qualitatively different.
The values ``inf`` and ``-inf`` act much more like regular floating
point values than ``nan``.

* One can compare ``inf`` to other floating point values and it behaves
  as expected, e.g. ``3 < inf`` is True.
* For the most part, arithmetic works "as expected" with ``inf``,
  e.g. ``inf + inf = inf``, ``-2*inf = -inf``, ``1/inf = 0``,
  etc.
* Many existing functions work "as expected" with ``inf``:
  ``np.log(inf) = inf``, ``np.exp(-inf) = 0``,
  ``np.array([1.0, -1.0, np.inf]).min() = -1.0``, etc.

So while ``nan`` almost always means "something went wrong" or "something
is missing", ``inf`` can in many cases be treated as a useful floating
point value.

It is also consistent with the NumPy ``nan`` functions to not ignore
``inf``::

    >>> np.nanmax([1, 2, 3, np.inf, np.nan])
    inf
    >>> np.nansum([1, 2, 3, np.inf, np.nan])
    inf
    >>> np.nanmean([8, -np.inf, 9, 1, np.nan])
    -inf


How *not* to implement ``nan_policy``
-------------------------------------
In the past (and possibly currently), some ``stats`` functions handled
``nan_policy`` by using a masked array to mask the ``nan`` values, and
then computing the result using the functions in the ``mstats`` subpackage.
The problem with this approach is that the masked array code might convert
``inf`` to a masked value, which we don't want to do (see above).  It also
means that, if care is not taken, the return value will be a masked array,
which will likely be a surprise to the user if they passed in regular arrays.


.. rubric:: Footnotes

.. [#f1] If an element of the output depends only on the corresponding
         element of the input (e.g. `numpy.sin`, `scipy.special.gamma`),
         then there is no need for a ``nan_policy`` parameter.