NaN, None and Experimental NA

An illustrative guide to missing values conventions in pandas

Deepak Tunuguntla
Towards Data Science

--

Modern day technological innovations often involve processing and analysing datasets with missing values. And, to effectively handle these datasets, different libraries choose to represent the missing values in different ways. Given their importance and frequent occurrence, we will begin by diving into the current state of the art, discuss their attributes, and then chronologically illustrate the available missing values conventions employed in pandas.

Image made by author, using diagrams.

In order to represent the missing values, we see two approaches that are commonly applied to the data in tables or dataframes. The first approach involves a mask to point out the missing values whereas the second uses a datatype-specific sentinel value to represent a missing value.

When masking, a mask could either be a global or a local one. A global mask consists of a separate boolean array for each data array (Figure 1) whereas a local mask utilises a single bit in the element’s bit-wise representation. For example, a signed integer also reserves a single bit to use as a local mask to indicate the positive/negative sign of an integer.

Figure 1: A global boolean mask approach. note that MV denotes missing value. Image by author, using diagrams.

On the other hand, in a sentinel approach, a datatype-specific sentinel value is defined. This could either be a typical value based on best practices or a uniquely defined bit-wise representation. For the missing values of floating-point types, libraries typically choose the standard IEEE 754 floating-point representation called NaN (Not a Number), for example, see Figure 2. Similarly, there are libraries that also define unique bit-wise patterns for other data types, for example R.

Figure 2: Illustrates a bit-wise IEEE 754 single precision (32-bit) NaN representation. Based on Wikipedia, IEEE 754 NaNs are encoded with the exponent field filled with ones (like in infinity value representations), and some non-zero number “x” in the significand field (“x” equals zero denotes infinities). This allows for multiple distinct NaN values, depending on which bits are set in the significand field, but also depending on the value of the leading sign bit “s”. It appears that the IEEE 754 standard defines 16,777,214 (²²⁴-2) floating point values as NaNs, or 0.4% of all possible values. The subtracted two values are positive and negative infinity. Also note that the first bit from x is used to determine the type of NaN: “quiet NaN” or “signaling NaN”. The remaining bits are used to encode a payload (most often ignored in applications). Image by author, made using diagrams.

Although the above masking and sentinel approaches are widely employed, they have their trade-offs. A separate global boolean mask adds extra burden in terms of storage and computation; whereas, a bit-style sentinel puts a limit on the range of valid values that could be missing entries. Besides that, type-specific bit-wise patterns for sentinels, also require additional logic to be implemented for performing bit-level operations.

As pandas is built on NumPy, it simply incorporates the IEEE standard NaN value as a sentinel value for floating-point data types. However, NumPy does not have built-in sentinels for non-floating-point data types. Hence, implying that pandas could either utilise a mask or a sentinel for the non-floating-point types. That is, pandas could either have a global boolean mask, or locally reserve one bit in the element’s bit-wise representation, or have unique type-specific bit-wise representations such as the IEEE’s NaN.

However, as mentioned earlier, each of the above-mentioned three possibilities [boolean mask, bit-level mask, and sentinels (bit-wise patterns)] do come at a price. When it comes to utilising global boolean masks, pandas could build upon the NumPy’s masked array (ma) module. But, the required upkeep of the code base, memory allocations, and computational effort, makes it less practical. Similarly, on a local level, pandas could also reserve a single bit in each of its element’s bit-wise representation. But then again, for smaller 8-bit data units, loosing a bit to use as a local mask will remarkably reduce the range of values it can store. Therefore, deeming, both, global and local masking as less favourable. That said, this brings us to the third option, which is type-specific sentinels. Although a possible solution, pandas’ dependence on NumPy makes type-specific sentinels unfeasible. For example, the package supports 14 different integer types accounting for precisions, endianness, and signedness. So, if unique IEEE-like standard bit-wise representations are to be specified and maintained for all the different data types NumPy supports, pandas will again end up with a mammoth development task at hand.

Because of the above-mentioned practical concerns and as a good compromise between computational efficiency and upkeep, pandas utilises two existing Python sentinels to denote the nullness. These are the IEEE’s standard floating-point value NaN (available as numpy.nan), and the Python singleton object None (as used in the Python code).

However, from v1.0 onward, in January 2020 pandas introduced an in-house experimental NA value (a singleton object) for scalar missing values 🎉 According to the documentation, the goal of this new pandas.NA singleton is to provide a generic “missing value” indicator that can be employed consistently across all the data types. That is, use pandas.NA overall instead of hopping around numpy.nan, None or pandas.NaT, which are type-specific. Note that pandas.NaT is used for representing datetime missing values.

When it comes to NaN and None, pandas is built to conveniently switch (convert) between these two sentinels, as and when needed. For example, in the below illustration, we construct a simple dataframe from a list of floating-point values containing one missing value, which we denote using numpy.nan or None,

Figure 3: Illustrates a simple single-columned dataframe constructed from a list of float-point values comprising one missing value. Note that, by default, pandas infers float values as float64. Image by author, using diagrams.

Although it is known that floating-point missing values are to be denoted using numpy.nan, for convenient reasons, we could also use None to denote a missing value. In such cases, as shown above, pandas implicitly switches from None to NaN value.

Similarly, let us consider another example where we construct a dataframe from a list of integers with just one missing value, see below,

Figure 4: Illustrates the implicit int to float type-casting phenomenon in a simple dataframe constructed from a list of integers comprising a missing value. Image by author, using diagrams.

There are two things to notice in the above illustration. Firstly, when denoting the integer-type missing values using numpy.nan, pandas type-casts the integer-type data (inferred as int64 by default) to float-type data (float64 by default). And, secondly, pandas again allows us to utilise either of the two sentinels to denote an integer-type missing value. That is, when None is used to denote a missing value, it implicitly converts it to floating-point NaN value, see the above resulting dataframe. However, automatic type-casting of integers to floating-point values is not always handy. Especially, when we would like the integers to remain as integers. Because, sometimes integers are also utilised for indexing purposes as identifiers. Do not worry 😉 because pandas got this sorted 😎

Again as of v1.0, released in January 2020, all pandas’ existing nullable-integer dtypes, such as the (U)Int64, use the new experimental pandas.NA as a missing value indicator, instead of NaN value. This is fantastic 😃 because by using any of the pandas’ extension integer dtypes, we can avoid the integer-to-float type-casting, as and when needed. Note that to differentiate from NumPy’s integer types, e.g. “int64”, the first letter(s) in the extension-dtypes’s string-alias is capitalised, e.g. “Int64”. Hence, as an example, let us construct a simple dataframe using the extension type Int64 and a list of integers comprising a missing value, see below

Figure 5: Illustrates a single-columned dataframe constructed from a list of integers containing a missing value, using the Int64 extension type as the inferred dtype. Note that you could also specify the dtype alias “Int64” as pd.Int64Dtype(). Additionally, you could also use smaller bit-size variants such as Int16 or Int8. Image by author, using diagrams.

As illustrated above, when Int64 is specified as a dtype, pandas does not type-cast integers to float-point data, and it uses the new experimental NA value to represent all the scalar missing values, instead of NaN.

Version 1.0 also provides us with two new experimental extension types, similar to Integer<bit-size>Dtype. These are the string and nullable-boolean dtypes that are fully dedicated to string and boolean data. The new dtypes are available as StringDtype and BooleanDtype alongside their corresponding aliases “string” and “boolean”. As an example, below we illustrate a simple dataframe constructed using the new experimental extension type StringDtype together with a list of strings with missing values,

Figure 6: Illustrates a single-columned dataframe constructed from a list of strings containing missing values, using the pd.StringDtype() extension data type alias “string”. Image by author, using diagrams.

Even for the new extension string type, when we use numpy.nan or None to denote a missing value, pandas implicitly converts it to the new experimental NA scalar value in the resulting dataframe. It is since v1.1, released in July 2020, where all dtypes can be converted to StringDtype.

Similarly, as shown in Figure 6, we can also use the new extension type BooleanDtype for constructing a dataframe from a list of bool and missing values, see below

Figure 7: Illustrates a single-columned dataframe constructed from a list of bool and missing values, using the new extension pd.BooleanDtype() dtype alias “boolean”. Image by author, using diagrams.

And, as of December 2020 with the release of v1.2, pandas supports two additional experimental Float32Dtype/Float64Dtype extension data types, which are fully dedicated to floating-point data. Both the nullable-float dtypes can hold the experimental pandas.NA value. Although the NumPy’s float uses NaN value to represent a missing value, these new extension dtypes are now inline with the already existing nullable-integer and -boolean dtypes. See below for an example that shows the nullable-float dtype Float64 at work,

Figure 8: Illustrates a dataframe construction using the pd.Float64Dtype() alias Float64, which is an extension of the NumPy’s float64 data type that supports the pd.NA singleton. Note that, as an alternative, there is only one smaller bit-size variant available, which is Float32. Image by author, using diagrams.

Besides the above-mentioned extension data types, i.e., the nullable-integer, -boolean, -float and string dtype, pandas also comes with their corresponding extension array data types. These are the IntegerArray (available since v0.24, updated in v1.0), StringArray (v1.0), BooleanArray (v1.0), and FloatingArray (v1.2, see pandas-dev source for documentation). So, when the dtype is specified as “boolean”, all the BooleanDtype data is stored in a BooleanArray. Similarly, when the dtype to infer is specified as “Int64” it stores all the Integer64Dtype data in an IntegerArray.

On a closer look, a dataframe that is constructed using an extension data type as its inferred dtype, actually uses the corresponding extension array data type to store the (series) data. Under the hood, these extension array data types are, in essence, represented by two NumPy arrays, see Figure 1. The first is used to store the data whereas the second is used as a global boolean mask, which indicates the values that are missing in the first array (boolean value True means missing). However, note that a StringArray only stores StringDtype objects and does not have a second NumPy array to be used as a boolean mask. On the other hand, note that when the extension array data types are not utilised in constructing a dataframe, all the array-like data with missing values are, by default, stored using the default NumPy float- or object-dtype NumPy array. These arrays use NaN or None to represent the missing values. With that, below we tabulate all the missing values conventions available in pandas,

Table 1: Missing values conventions in pandas. Note that the nullable- and string-dtype is an extension data type. Strings are typically stored in object-dtype NumPy arrays, read here. However, with the introduction of string dtype, they can now be stored in a StringArray, see here. Image by author, made using diagrams.

And, that brings us to the end. Hopefully, the above discussions and illustrations systematically showcase, both, the regular and experimental missing values representations in pandas. Although the new pandas.NA scalar value is labelled as experimental, recent release features and enhancements do show the library’s drive and intention to make pandas.NA the generic missing value indicator. Thank you and keep having fun with pandas 😃 🙏

Another way of constructing dataframes

Rather than constructing the above illustrated dataframes using a list of data with missing values and prescribing a specific (extension) dtype to infer, we could also use the good old pandas.arrays() to construct a dataframe. As of v1.0, pandas.arrays(), by default, infers an extension dtype for an array-like input of integers, bools, strings, floats and more, see here.

Further reading

--

--