
As a data scientist, you must have encountered this problem at least once in your Data Science journey: you import your data into a Pandas dataframe and the ID column is displayed as a numeric (integer or float) variable. You may have leading zeros in your original ID column, but they were removed because of the numeric data type.
This is such a common problem and almost always needs to be treated before you carry on your data munging and analysis tasks. This is especially important if you plan to use the ID field as a key to join with other tables later on. So how to fix it in Python?
Let’s look at a simple example. We have a sample dataframe that shows the median home sales price for each county in October 2020. In this dataframe, we have an ID field – FIPS which is a unique 5-digit geographic identifier for each county in the U.S.

The FIPS field is supposed to be a five-digit code and should be imported as a string; instead, it is shown as a float type which doesn’t make sense. We will need to change it to a string type and also add back the leading zeros. We can do it in two ways:
Method 1:
Using this method, we will first change the FIPS field from float
type to integer
type, and then change it to string
. We can use Pandas’ DataFrame.astype()
method to change the data types.
df['FIPS'] = df['FIPS'].fillna(0).astype(int).astype(str)

After the FIPS field is changed to the string
type, we can add back the leading zeros to make it a five-digit code. We can use Pandas’ zfill()
method to do it.
df['FIPS'] = df['FIPS'].str.zfill(5)

Method 2:
Another way to clean the FIPS field is to first change its data type to string
, and then use regex
(regular expressions) in Python to search and replace certain patterns in the string in order to remove the decimal places. The following code uses regex
to remove the ‘.0’ part from the string in the FIPS column.
df['FIPS'] = df['FIPS'].astype(str).replace('.0', '', regex=True)

We then use zfill()
to add back the leading zeros to the FIPS field and make it a five-digit code.
df['FIPS'] = df['FIPS'].str.zfill(5)

You can also choose to replace all the ’00nan’ with ‘00000’ or other values using the following code, but it is optional.
df['FIPS'] = df['FIPS'].replace('00nan', '00000')
One thing I want to point out is that in method 1, please make sure that you use fillna(0)
to replace all the missing values with 0 in the FIPS column before using astype(int)
to change the data type. This is because astype(int)
won’t work if your ID column (i.e., FIPS) has missing values.
For example, if you use the following code to try to change FIPS from float
to integer
directly, you will get a traceback error shown below:
df['FIPS'] = df['FIPS'].astype(int)

In summary, you can easily clean a ‘numeric’ ID column with leading zeros by using one of the following two methods. It is a very common and seemingly easy problem to fix but could be tricky to figure out for python beginners. Therefore, I hope you find this short tutorial and the code helpful, especially for those who just started their data science and python journey. Thanks for reading!
Method 1:
df['FIPS'] = df['FIPS'].fillna(0).astype(int).astype(str)
df['FIPS'] = df['FIPS'].str.zfill(5)
Method 2:
df['FIPS'] = df['FIPS'].astype(str).replace('.0', '', regex=True)
df['FIPS'] = df['FIPS'].str.zfill(5)
Data Source:
[Redfin](https://www.redfin.com/news/data-center/) Data Center: Redfin Monthly Housing Market Data – County Level. This is an open dataset provided by Redfin, a national real estate brokerage, that you can download for free and for your own purposes with citation.
You can unlock full access to my writing and the rest of Medium by signing up for Medium membership ($5 per month) through this referral link. By signing up through this link, I will receive a portion of your membership fee at no additional cost to you. Thank you!