Missing values are commonly encountered when processing large collections of data. A missing value can correspond to an empty variable, an empty list, an empty dictionary, a missing element in a column, an empty dataframe and or even an invalid value. Being able to define empty variables or objects is important for many software applications, particularly when handling missing and invalid data values. This is important for tasks such as variable initialization, type checking and specifying function default arguments.
Depending on the use case, there are a few options for specifying an empty variable. The most common method is storing an empty value using the keyword None. This is useful because it clearly indicates that the value for a variable is missing or not valid. While this helps with treating missing values this is not useful in situations where calculations are required. For example, it is often useful to convert a missing value of None into an NaN value for categorical values, floating point values and integer values. An NaN value corresponds to "Not a Number" and it is a useful way to mark missing values while still being able to perform useful calculations. For example, we can use the pandas method in python to replace NaN values with statistics such as the mean, median, or mode.
In addition to specifying None and NaN types it is also very useful to specify empty data structures. For example, if you populate a list in python, you typically need to define an initialized empty list. Further, you may have a function that requires a list as input in order to run properly whether or not the list is populated or empty. A similar argument can be made in terms of populating a dictionary. It is often useful to define an empty dictionary followed by logic that populates the dictionary with keys corresponding to their respective values. Similar to lists, you may define a function that needs to list an empty dictionary in order to run successfully. This same logic holds for data frames as well.
Finally defining empty variables and data structures are useful for tasks such as type checking and setting default arguments. In terms of type checking, empty variables and data structures can be used to inform some control flow logic. For example, if presented with an empty data structure, perform "X" logic to populate that data structure. In terms of type checking and setting default arguments, there may be cases where the instance of an empty data structure should kick off some logic that allows a function call to succeed under unexpected conditions. For example if you define a function that is called several times on different lists of floating point numbers and calculate the average, it will work as long as the function is provided with a list of numbers. Conversely, if the function is provided with an empty list it will fail as it will be unable to calculate the average value of an empty list. Type checking and default arguments can then be used to try to calculate and return an average, and if it fails it returns a default value.
Defining an Empty Variables with None and NaN
Defining empty variables in python is straightforward. If you wish to define a placeholder for a missing value that will not be used for calculations you can define an empty variable using the None keyword. For example, let’s say we have demographic data with sets of values for age, income (USD), name and senior citizen status:
age1 = 35
name1 = "Fred Philips"
income1= 55250.15
senior_citizen1 = False
age2 = 42
name2 = "Josh Rogers"
income2=65240.25
senior_citizen2 = False
age3 = 28
name3 = "Bill Hanson"
income3=79250.65
senior_citizen3 = False
For each person, we have a valid value for age, name, income and senior citizenship status. There may be instances where some information may be missing or include an invalid value. For example, it is possible that we may receive data with invalid values such as a character or string for the age or a floating point or integer for the name. This especially can occur with web applications that use free text user input boxes. If the app isn’t able to detect and alert the user to an invalid input value it will include the invalid values in its database. Consider the following example:
age4 = "#"
name4 = 100
income4 = 45250.65
senior_citizen4 = "Unknown"
For this person, we have an age value of "#", which clearly is invalid. Further, the name entered is an integer value of 100, which also doesn’t make sense. Finally, for our senior_citizen variable we have "Unknown". If we are interested in keeping this data, since the income is valid, it is best to define the age, name and senior_citizen as empty variables using the None keyword.
age4 = None
name4 = None
income4 = 45250.65
senior_citizen4 = None
This way any developer looking at the data will clearly understand the valid values for age, name and senior_citizen are missing. Further, the income values can still be used to calculate statistics along with all other valid data values. A limitation of the None keyword is that it can’t be used in calculations. For example, suppose we wanted to calculate the average age for the four instances we have defined:
avg_age = (age1 + age2 + age3 + age4)/4
If we try to run our script it will throw the following error:
This is a TypeError stating that we are unable to use the ‘+’ operator (addition) between integers and None values.
We can remedy this by using the NaN (not a number) values from numpy as our missing value placeholder:
age4 = np.nan
name4 = np.nan
income4 = 45250.65
senior_citizen4 = np.nan
avg_age = (age1 + age2 + age3 + age4)/4
Now it will be able to run successfully. Since we have a NaN in our calculation the result will also be NaN. This is much more useful since the code is able to run successfully. Further, this is especially useful when dealing with data structures such as dataframes, as there are methods in python that allow you to handle NaN values directly.
In addition to defining empty variables it is often useful to store empty data structures in variables. This has many uses but we will discuss how default empty data structures can be used for type checking.
Defining Empty List for Initialization
The simplest application of storing an empty list in a variable is for initializing a list that will be populated. For example, we can initialize a list for each of the attributes with defined earlier (age, name, income, senior_status):
ages = []
names = []
incomes = []
senior_citizen = []
These empty lists can then be populated using the append method:
ages.append(age1)
ages.append(age2)
ages.append(age3)
ages.append(age4)
print("List of ages: ", ages)
We can do the same for name, income and senior status:
names.append(name1)
names.append(name2)
names.append(name3)
names.append(name4)
print("List of names: ", names)
incomes.append(income1)
incomes.append(income2)
incomes.append(income3)
incomes.append(income4)
print("List of incomes: ", incomes)
senior_citizen.append(income1)
senior_citizen.append(income2)
senior_citizen.append(income3)
senior_citizen.append(income4)
print("List of senior citizen status: ", senior_citizen)
Defining Empty Dictionary for Initialization
We can also use an empty dictionary for initialization:
demo_dict = {}
And use the list we populated earlier to populate the dictionary:
demo_dict['age'] = ages
demo_dict['name'] = names
demo_dict['income'] = incomes
demo_dict['senior_citizen'] = senior_citizen
print("Demographics Dictionary")
print(demo_dict)
Defining Empty DataFrame for Initialization
We can also do something similar with dataframes:
import pandas as pd
demo_df = pd.DataFrame()
demo_df['age'] = ages
demo_df['name'] = names
demo_df['income'] = incomes
demo_df['senior_citizen'] = senior_citizen
print("Demographics Dataframe")
print(demo_df)
Notice the logic for populating dictionaries and data frames are similar. Which data structure you use depends on your needs as an engineer, analyst or data scientist. For example, dictionaries are more useful if you like to produce JSON files and don’t need array lengths to be equal, while data frames are more useful for generating CSV files.
NaN Default Function Arguments
Another use for defining empty variables and data structures are for default function arguments.
For example, consider a function that calculates income after federal tax. The tax rate for the range of incomes we’ve defined so far is around 22%. We can define our function as follows:
def income_after_tax(income):
after_tax = income - 0.22*income
return after_tax
If we call our function with income and print the results we get the following:
after_tax1 = income_after_tax(income1)
print("Before: ", income1)
print("After: ", after_tax1)
This works fine for this example, but what if we have an invalid value for income like an empty string. Let’s pass in an empty string and try to call our function:
after_tax_invalid = income_after_tax('')
We get a TypeError stating that we can multiply a sequence, which is the empty string, by a non-integer type float. The function call fails and after_tax never actually gets defined. We ideally would like to guarantee that the function runs for any value of income and after_tax at least gets defined with some default value. We can do this by defining a default NaN argument for after_tax and type check the income. We only calculate after_tax if income is a float otherwise, after_tax is NaN:
def income_after_tax(income, after_tax = np.nan):
if income is float:
after_tax = income - 0.22*income
return after_tax
We can then pass any invalid valid for income and we will still be able to run our code successfully:
after_tax_invalid1 = income_after_tax('')
after_tax_invalid2 = income_after_tax(None)
after_tax_invalid3 = income_after_tax("income")
after_tax_invalid4 = income_after_tax(True)
after_tax_invalid5 = income_after_tax({})
print("after_tax_invalid1: ", after_tax_invalid1)
print("after_tax_invalid2: ", after_tax_invalid2)
print("after_tax_invalid3: ", after_tax_invalid3)
print("after_tax_invalid4: ", after_tax_invalid4)
print("after_tax_invalid5: ", after_tax_invalid5)
The reader may wonder why an invalid value is passed to a function to begin with. In practice, function calls are often made on thousands to millions of user inputs. If the user input is a free text response, and not a drop down menu, it is difficult to guarantee that the data types are correct unless it is explicitly enforced by the application. Because of this, we’d want to be able to process valid and invalid inputs without the application crashing or failing.
Empty List Default Function Arguments
Defining empty data structures as default arguments can also be useful. Let’s consider a function that takes our list of incomes and calculates the after tax income.
def get_after_tax_list(input_list):
out_list = [x - 0.22*x for x in input_list]
print("After Tax Incomes: ", out_list)
If we call this with our incomes list we get:
get_after_tax_list(incomes)
Now if we call this with a value that is not a list, for example an integer, we get:
get_after_tax_list(5)
Now if we include an empty list as the default value for our output list our script runs successfully:
get_after_tax_list(5)
Empty Dictionary Default Function Arguments
Similar to defining default arguments as empty lists, it is also useful to define functions with empty dictionary default values. Let’s define a function that takes an input dictionary, we will use demo_dict that we define earlier, and it returns a new dictionary with the mean income
def get_income_truth_values(input_dict):
output_dict= {'avg_income': np.mean(input_dict['income'])}
print(output_dict)
return output_dict
Let’s call our function with demo_dict
get_income_truth_values(demo_dict)
Now let’s try passing in an invalid value for input_dict. Let’s pass the integer value 10000:
get_income_truth_values(10000)
We get a type error stating that the integer object, 1000, is not subscriptable. We can correct this by checking if the type of our input is a dictionary, checking if the appropriate key is in the dictionary and setting a default argument for our output dictionary that will be returned if the first two conditions are not met. This way if the conditions are not met we still can successfully run our code without getting an error. For our default argument we will simply specify an empty dictionary for the output_dict
def get_income_truth_values(input_dict, output_dict={}):
if type(input_dict) is dict and 'income' in input_dict:
output_dict= {'avg_income': np.mean(input_dict['income'])}
print(output_dict)
return output_dict
And we can make the same function calls successfully
get_income_truth_values(10000)
We can also define a default dict with an NaN value for the ‘avg_income’. This way we will guarantee that we have a dictionary with the expected key even when we call our function with an invalid input:
def get_income_truth_values(input_dict, output_dict={'avg_income': np.nan}):
if type(input_dict) is dict and 'income' in input_dict:
output_dict= {'avg_income': np.mean(input_dict['income'])}
print(output_dict)
return output_dict
get_income_truth_values(demo_dict)
get_income_truth_values(10000)
Empty Data Frame Default Function Arguments
Similar to our examples with lists and dictionaries, a default function with a default empty data frame can be very useful. Let’s modify the data frame we define to include the state of residents for each person:
demo_df['state'] = ['NY', 'MA', 'NY', 'CA']
Let’s also impute the missing values for age and income using the mean:
demo_df['age'].fillna(demo_df['age'].mean(), inplace=True)
demo_df['income'].fillna(demo_df['income'].mean(), inplace=True)
Next let’s define a function that performs a groupby on the states and calculates the mean for the age and income fields. The result will give use the average age and income for each state:
def income_age_groupby(input_df):
output_df = input_df.groupby(['state'])['age', 'income'].mean().reset_index()
print(output_df)
return output_df
income_age_groupby(demo_df)
As you’d be able to guess by this point if we call our function with a data type that is not a dataframe we will get an error. If we pass a list we get an AttributeError stating that the list object has no attribute ‘groupby’. This makes sense since the groupby method belongs to dataframe objects:
income_age_groupby([1,2,3])
We can define a default data frame containing NaNs for each of the expected fields and check if the necessary columns are present:
def income_age_groupby(input_df, output_df = pd.DataFrame({'state': [np.nan], 'age': [np.nan], 'income':[np.nan]})):
if type(input_df) is type(pd.DataFrame()) and set(['age', 'income', 'state']).issubset(input_df.columns):
output_df = input_df.groupby(['state'])['age', 'income'].mean().reset_index()
print(output_df)
return output_df
income_age_groupby([1,2,3])
We see that our code ran successfully with the invalid data values. While we considered examples for data we made up, these methods can be extended to a variety of data processing tasks whether it be for software engineering, Data Science or machine learning. I encourage you to try applying these techniques in your own data processing code!
The code in this post is available on GitHub.
Conclusions
Defining empty variables and data structures is an essential part of handling missing or invalid values. For variables such as floats, integers, booleans and strings, invalid types can often lead to failing or error-causing code. This can cause programs to crash midway through a large processing job which can lead to a significant waste in time and computational resources. Given that handling invalid and missing data is a big part of data processing, understanding how to define empty variables and data structures as function default values can save the engineer or data scientist much headache down the road. Being able to define functions with sensible default values such that they return a consistent and expected output error free is an essential skill for every programmer.
This post was originally published on the BuiltIn blog. The original piece can be found here.