Tips and Tricks

You might have ended up reading this article for a variety of reasons. If you think this post is about forcing panda bears to do hacky stuff with a computer please close this tab and continue with whatever you were up to. You are doing yourself a favour. This never happened.

For the rest of you (us) nerdy Pythonistas and data scientists who cannot live without pandas, this might be your lucky day. Reflect upon the following question: how many times have you created a small DataFrame by hand to test some new piece of code you were working on? What about building a fake dataset to quickly test a machine learning model? If you are like me, probably quite a few times. It usually looks something like this:
import pandas as pd
data = {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'],
'Age':[20, 21, 19, 18],
'Height' : [6.1, 5.9, 6.0, 6.1]
}
df = pd.DataFrame(data)
It takes a little while to write those lines, and the result is a tiny 4×3 dataframe. I feel you there, it’s painful. You think your precious time should be dedicated to more worthy tasks and I totally agree. That is why I want to introduce you to this short piece of code that could help you recover what’s yours. Please don’t read it, you can just copy and paste it into your editor (if you trust me). It’s better to understand how it works by running the examples below. Anyway, here it comes:
A bunch of examples
So how do you generate stuff with this? Pretty simple, you can go like:
generate_fake_dataframe(size = 1000, cols = "cififficcd")
and generate:

The size parameter is obviously the number of rows. cols = 'cififficcd'
is not so straightforward. It just lets you control the type of columns you want to generate and in which order, among these options:
- c: used for columns with categorical variables.
- i: generates a column of integers.
- f: returns a column with floats.
- d: used for columns with date values.
You can see how the columns are named following this pattern: column_n_dtype
. Regarding the categorical features, we see that column_0_cat
is filled with first names, column_7_cat
has animals and column_8_cat
has city names. There is one more family of categorical features ("colors"), and more can be added by editing the categories_dict
. Each family has 15 unique categories used to populate categorical columns.
Here you have two more examples, df1 and df2:

Ok, there are some new parameters. col_names
accepts a list with the names you want to give to the columns, in case you don’t want the ones following the column_n_dtype
pattern. Of course, it should have the same length as the cols
string. The seed
parameter controls the random number generator and lets you reproduce the same results if the other parameters remain the same. If not set (or set to None), a different dataframe will be outputted each time the function is called. Finally, there is the intervals
parameter, which allows to fine-tune how each of the columns is generated.
When you don’t pass anything to intervals, each column type is generated with a default configuration. Integer values are by default uniformly distributed in the interval (0, 10), floats get (0,100) by default and dates are generated between ("2020–01–01","2020–12–31"). Categorical values are generated with 5 elements of the "names" family by default, and then other families are chosen if there is more than one categorical column.

Setting custom intervals
If the default configuration for each datatype does not suit your needs you can change it through the intervals parameter. You might have noticed some "inconsistencies" in the code used to generate df1 and df2. In df1 we are passing a dictionary to the intervals parameter, and in df2 it is a list. Why is that? Well, we can use a dictionary to specify a configuration that will affect all columns of the same type. For example, we can set intervals to be :
intervals = {"d" : ("1996-01-01","1996-12-31"),
"c" : ("colors" , 7)}
Then, we are overriding the default configuration for date columns ("d") , which will now always contain values between those dates, and for categorical columns ("c"), which will now always choose among 7 different values from the colors family (instead of 5 from the names family). The column types which are not present in the intervals parameter will retain their default configuration.
So, to clarify, if we call generate_fake_dataframe(20, "cccd", intervals = {"c" : ("colors",7)})
we will get a dataframe with 3 categorical columns with at most 7 different colors in each, and 1 date column with the default configuration.
On the other hand, we can pass a list to intervals instead of a dictionary. We do this when we want even more control, that is, when we want a specific configuration for each of the columns separately. The ith element of the list just contains the interval we want to assign to the ith column. Let’s take a closer look at the intervals used to create df2.
cols = "cicffcd",
intervals = [("names",10), (18,25),("cities", 15), (73.2,95.0), (1.65,1.95), ("animals", 11), None]
We see in cols
that we have 7 columns, and the intervals list contains 7 elements that refer to each of the columns. Of course, each interval tuple should have the appropriate format depending on the column type (if we replaced ("names", 15)
with, let’s say,(13,20)
the function would throw an error since the first column is set to "c", categorical). With that interval configuration, we can be as specific as we want. The first column contains names among 10 different types; the second, integers between 18 and 25; the third, cities among 15 options, etc…. The last column is set to None, this means that for that column we want to use the default interval.
One more card up my sleeve…

There is one more trick I haven’t told you yet about generate_fake_dataframe
, and it is about introducing custom category families on the go for categorical columns. We have said that all intervals should be defined as a tuple of two elements, no matter if its (int, int), (float,float), (date,date) or (category_family, number_of_elements). Well, this is not exactly true, there is one, and only one exception. For categorical columns, we can replace the (family, n) tuple with a list of objects from which we want to populate the column. Let me show you two more examples:

Notice that for the third and fourth columns in df3 we are not passing tuples, but rather lists of elements (a list of possible skills, and a list containing True and False). This also works perfectly fine with the dictionary in df4. We can set all categorical columns to be fed from the list containing [0,1].
Wrapping up
I hope you found this article useful, and feel free to add generate_fake_dataframe
to your toolbelt. I have dared to open a pull request on pandas to see if this could potentially be included in a future release. The current existing alternatives ( pd.util.testing.makeDataFrame
and pd.util.testing.makeMissingDataframe
) don’t really get the job done. There are as well other libraries like Faker with a richer variety of data types but are not designed to output DataFrames (afaik). So I firmly believe that there is room for this within pandas.
There are probably many changes and improvements to make before that happens, adapt the code to follow pandas own guidelines, rewrite it to improve clarity, rename some of the parameters (or even the function name itself), etc… If you want this included in pandas natively or have any suggestions feel free to speak your mind in the PR’s discussion. For any inquiries, you can reach me at [email protected]