Richer Missing Values
Exploring a data frame with greater power, flexibility, and safety

One of my responsibilities at work is to maintain a collection of survey data. At first it was a straightforward project. With a short script I could compile the first wave of data into a nicely organized data frame, summarize it in a few different ways, and send out a report. But then, one participant left the study without responding. Another few participants answered a question incorrectly. Later, after dozens of responses were already collected, we added a new question to the survey. With each new issue the data became harder to maintain.
Every time a piece of data was missing, regardless of the cause and consequences, it would be replaced with the same value, NA. And this oversimplification led to a variety of problems. The missing values looked the same, so it was easier to predict what values I might encounter in my dataset. But they affected calculations in different ways, and I found it harder to verify that operations were handling them safely. I looked for ways to represent the real story behind the NAs, but the workarounds I found were verbose and complicated. The simplicity of NA made my code more complex.
The most useful way to improve NA is to split it into two values: an applicable missing value, and an inapplicable one. Consider the example of a new survey question, added after half the responses have been collected. A response could be missing for two reasons: either someone skipped the question, or they were never asked at all. The first is applicable; the second is not. If you are taking an average, you can safely delete inapplicable values. But it is unsafe to ignore applicable missing values, because learning their true values (or imputing them) would change your results.
Unfortunately, most programming languages, including R and python, make it difficult to represent this information. In R, NA is treated as an unknown but applicable value. If you try to add it with something, it will result in another NA. But there is no way to represent a different kind of missing value. Since you cannot distinguish between different types of NA, you must detect and handle them on your own.
In practice, people use a number of different workarounds to this problem. One option is to use an imputation method to deal with all applicable missing values, and then boldly use the na.rm = TRUE option available in most arithmetic and statistical functions. This treats all the NAs as inapplicable, allowing you to take a sum or average without removing missing values first. But blindly removing missing values can be unsafe. In cases when NA could be applicable, we almost always want the sum to be missing, because this helps us interpret the result. A missing result means there is a missing value left to deal with. A real value guarantees that we are done.
In order to avoid na.rm = TRUE, you could create variables that indicate whether the record is applicable or not. This seems safer, but it can become verbose. Instead of a simple sum(variable) you start to accumulate expressions like sum(variable[isApplicable]), and the more complex the data becomes, the harder it is to remember how to write isApplicable. You will need multiple indicator variables, one for every way you want to determine whether something is applicable. And the data structure can’t tell you which indicator variable corresponds with which data variable, so you must keep track of them yourself. This can lead to a whole different set of errors.
The third and best strategy is to split the data into smaller data.frames, an approach that can remove any problematic missing values. To do this, you group records with the same missing value, so they don’t need the missing column at all. The goal is to remove all inapplicable values, so that every NA can be unambiguously treated as applicable and neither of the previous workarounds are necessary. The downside is that some datasets require a large number of tables, and working across different tables is harder than working within one. Although this approach is much cleaner than the other two, it is still adds complexity and creates more work.
Insights from relational database theory
In 1986, E.F. Codd wrote a paper titled “Missing Information (Applicable and Inapplicable) in Relational Databases.” Years earlier, Codd had developed the theory of the relational database, a model that remains extremely popular and, along with the matrix, inspired the design of data frames. (In fact, Hadley Wickham’s famous concept of “tidy” data is explicitly based on Codd’s third normal form). In the 1986 paper, Codd explores missing data and responds to issues that had been raised about how it was handled in relational databases. The most popular method at the time had been selecting values, like -99, to represent different kinds of missing. This put the burden on the user to parse the values correctly in order to manipulate the data. It was a flawed and unsafe approach.
Codd proposes a system of A-marks and I-marks, where A is applicable missing and I is inapplicable. His paper describes how the database can handle, and not just represent, both kinds of missing information, so that special values or additional variables are no longer required. In a later paper, Gessert discusses how it is superior to the solution of splitting tables to remove all the inapplicable values, which creates a lot of unnecessary work for the programmers and users. Because the database can understand missing values, it can handle them appropriately, avoiding the need for workarounds by the user.
In his paper, Codd outlines proposals for how values, missing values, and inapplicable values can be handled when combined in computations. He says that I-marks should be considered stronger than A-marks, and A-marks should be stronger than values, and that an operation combining different types should return the stronger value. For example, A + 2 = A: the sum is unknown because the value of A is unknown. Similarly, A + I = I, and this is appropriate because the sum A + I is meaningless; it cannot be applicable to anything. The same rules apply to all arithmetic operations, as well as other operations like concatenation.
When the operation is logical, such as an equality or comparison, the rules are more complicated. Codd presents a simple example: what is the value of a statement like “BIRTHDATE > 1–1–66”? It is true or false when the birthday is available, but it is unknown otherwise. So, Codd proposes a three-valued logic that includes TRUE, FALSE, and MAYBE.
The three-valued logic is widely used today. In R, MAYBE is represented as the logical `NA`. Logical operations are more complicated than they are with just TRUE and FALSE, because MAYBE can’t be interpreted as a concrete logical value. Instead, MAYBE is a temporary placeholder, reflecting our current uncertainty about its real value. For example, in R, NA | TRUE evaluates to TRUE, because it is true regardless of what happens to the left side. However, NA | FALSE is NA, because we can’t be sure of the real value without learning the missing information. The logic for AND operations is similar. NA & TRUE is NA, because we can’t tell what it is, while NA & FALSE is clearly FALSE. Negation is the simplest operation: !FALSE is TRUE, and !NA is NA. The three-valued logic can be counterintuitive at first, but all the rules stem from the central principle of NA: it is not a value but a state of uncertainty.
While three-valued logic is the simplest way to treat missing data, Codd soon realized that it was inadequate. In a follow-up article the next year, he presents a four-valued logic, which, like his A-marks and I-marks, distinguishes between applicable MAYBE and inapplicable MAYBE. Interestingly, Codd writes that he doesn’t believe it is worth the effort to implement this logic, but notes that it is more precise than three-valued logic. He expects that it would be integrated later on (it wasn’t).
The four-valued logic extends the three-valued logic, just as A-marks and I-marks extend missing values. Like I-marks, I-MAYBE is stronger than A-MAYBE, so A-MAYBE & I-MAYBE is I-MAYBE. In the case of AND operations, the strongest value is FALSE, because the result is guaranteed to be false regardless of the value on the other side. Similarly, an OR with TRUE will always be TRUE. The most complex expression is I-MAYBE | A-MAYBE, which is A-MAYBE. This is because A-MAYBE is a placeholder for a logical value, and if that value is TRUE, the expression overall will be meaningful too — it will be TRUE. And the best way to represent a potentially TRUE but unknown value is A-MAYBE.
Codd has mixed feelings about the four-valued logic, and thinks it may be too complex to be worthwhile. His belief is that users of relational databases prefer simple systems over complex ones because complex systems take too long to learn. But while three-valued logic is simpler than four-valued logic, it is far less expressive. Gessert later argues for four-valued logic, giving the simple example of a fee that is owed. It is crucial to know whether the fee is missing because it is not yet known, or missing because it is not applicable. Gessert says that restricting the database to three-valued logic prevents users from taking full advantage of that distinction. He proposes to add new operators that can handle the new missing values, and notes that these operators would be straightforward for users to learn, at which point they would understand the new logic completely.
A-marks, I-marks, and four-valued logic all present the same choice between simplicity and power. These features certainly add complexity to the data and the system that manages it, but they also add expressiveness. If we can distinguish between applicable and inapplicable information in our data, we can manipulate the data more efficiently and perform calculations more safely. The distinction is crucial when applicable missing values can obscure errors or bias our results.
Implications for data frames
The theory explored by Codd and Gessert matters because data frames have the same semantics as relational databases. In the dplyr package for R, many functions are named after their SQL equivalents: select(), *_join(), coalesce(), and more behave exactly as they do in SQL. In R, NA is the equivalent of the SQL NULL. And like SQL, R does not implement anything like Codd’s A-marks and I-marks. It is limited to three-valued logic, where the maybe value is represented as NA (NA_logical_).
If these data structures distinguished between applicable and inapplicable missing values, calculations would become simpler. Instead of using one of the current options — relying on na.rm = TRUE, or selecting subsets of records, or splitting tables — we would just drop all the inapplicable values. As an example, if we want to add up a variable containing dollar values, we can safely discard any entries that aren’t applicable. But if there is an applicable missing value, we will propagate this uncertainty, and our result will be “applicable missing.” This is much safer than arbitrarily discarding all the missing values, and it doesn’t require us to write extra code to select the applicable records. It is both safer and more efficient than having one NA.
The system of A-marks and I-marks also makes the structure of the data more flexible. For example, it is common to collect data from multiple observations of the same set of individuals. You may have some variables that you collect with every observation, and other variables that describe the individuals. The problem is that an individual’s variables will only show up if you collect at least one observation from them. If no observations are made, the individual is deleted. I-marks make this easier. We add at least one row for every individual and include the description variables; but whenever an individual has no data, we mark all data fields as inapplicable. In theory, the same approach could be used to represent data of any shape, or to combine any number of tables, with no loss of information.
This system proposed by Codd is already powerful, but could be extended even further. We could have additional marks that record applicability in the context of certain operations or other conditions. We could extend marks to values that are known, so that known values can also be removed from some calculations due to being inapplicable, and we could use them as a way to group data. Marks could be extended in a number of ways that match a simple data structure to messy, real-world contexts.
Implementing a rich missing value system in R
Recent developments in R make it possible to actually implement these ideas, so it is worth briefly exploring how this would work. Historically, we have been limited to atomic, base R vectors. Every value in an atomic vector is the same type. In a double vector, every value must be a double or NA_real_. But the vctrs package offers a way to implement objects that behave like vectors. It works by providing replacements for all base R functions, including constructors, casting, subsetting with [ and [[, arithmetic and other calculations, and visual output with format. Because objects are more flexible than atomic vectors, they can model applicable and inapplicable types.
One possible proposal is to represent A-marks and I-marks using a record structure, which has one field for the data and one for marks. Missing values can then be represented as a NA in the data field and an A or I in the mark field. By implementing the base R functions over a vector of our new type, we can achieve our desired behavior.
Once we’ve implemented the data structure, we can write better functions for it. First, we can discard inapplicable missings by default. This is especially relevant to functions like sum(), mean(), and quantile(), %in%,¹ and Reduce() (or purrr::reduce()). Second, we can implement binary operators so that they conform to Codd’s rules for arithmetic. In particular, the expression A + A should be A, I + I should be I, and A + I should be I. Finally, we can implement four-valued logic, extending base R’s three-valued logic to account for applicable and inapplicable results. This would result in a complete implementation of an A- and I-mark system, and is a project I plan to explore.
The cost of simplicity
It is extremely valuable to have useful representations of our data, because datasets are complex. A theoretical plan for a dataset can quickly become derailed by real-world problems, like incomplete, corrupted or lost data. Missing data in particular becomes difficult to interpret, and performing calculations becomes more difficult. This creates a tension between structures that are simple to work with, and structures that are complex but can represent data with more nuance.
There is no doubt that it takes extra effort to learn how to work with a dataset that includes both applicable and inapplicable missing values. NA can be confusing enough on its own (for one thing, there are actually five different kinds of NA²). The presence of two missing values changes how all of the usual operations work, and while their behavior is guided by a core set of principles, it would take time for people to learn how to use them effectively and without errors. Complex tools can be dangerous when users don’t understand them fully.
But extra complexity is often justified. When we use a data structure with only one missing value, we are forced to write code that selects the records we want to operate over, or split our data into different tables. Because our data is overly simple, we are forced to make our code more complicated. And because our code is more complicated, it will be error-prone and difficult to maintain. In contrast, a more powerful missing value representation can allow for simpler and more reliable ways to manipulate the data.
Ever since the first relational databases, users and developers have opted for simple designs that are easy to learn and use. But sometimes an idea that sounds complex is a perfect match for the messiness of the real world.
Notes
¹ One base R function has an interesting behavior: %in%. It actually discards missing values by default. We would expect 4 %in% 1:3 to be FALSE, but we would also expect 4 %in% c(1:3, NA) to be NA, since we can’t rule out that the last value is4. This implementation reflects the fact that most people think about missing values with %in% differently than with sum(), but it is clearly inconsistent with other logical operations. By distinguishing between applicable and inapplicable missing values, we can reimplement this in a way that is both correct and intuitive.
² NA is a logical value, so it is illegal to put it in any non-logical vector. Instead R defines NA_integer_ for integer vectors, NA_real_ for doubles, NA_complex_ for complex numbers, and NA_character_ for strings. These are hidden from the user by the way c() works and the way R prints out vectors.






