The world’s leading publication for data science, AI, and ML professionals.

You Don’t Need Sample Data, You Need Python Faker

An extendable Python library that generates fake data to "fill" your project

Image by anncapictures from Pixabay
Image by anncapictures from Pixabay

Python has a built-in module "random" that allows us to generate many types of data randomly, such as numbers and strings. However, it can’t generate any "meaningful" data such as people’s names. Sometimes, when we have to generate some dummy data to facilitate our demonstrations or experiments, it would be much better to have persons’ names as "Christopher Tao" rather than "Llisdfkjwe Asdfsdf" 🙂

One common solution might be downloading some sample data from open-source datasets. However, if we don’t have any preferences on the distributions of the data or some particular pattern of the dummy data, the best and easiest solution would be generating fake data.

In this article, I’ll introduce such a Python 3rd party library – Faker. It can generate various types of fake data, not only names. Now we should begin.

1. Basics

Image by sarajuggernaut from Pixabay
Image by sarajuggernaut from Pixabay

First of all, we need to install the library using pip as follows.

pip install Faker

Then, let’s begin with some basic usage patterns. Before we can generate any fake data, we need to instantiate the object of Faker as follows.

from faker import Faker
fake = Faker()

Once we have the "fake" instance, everything is quite straightforward. For example, to generate a person’s name, we just use its name() method.

fake.name()

To generate an address, we can use its address() method.

fake.address()

We can also use Faker to generate some text. It won’t make any sense but looks like real sentences at least. Don’t expect too much at this stage, there is no magic like a Machine Learning model here. Everything is based on randomness.

fake.text()

Each type of the above "generators" is called a "provider" in the Faker library terminology. There are too many providers so I can’t list them all. However, some new providers can be found in the rest of this article while you keep reading. If you want to have a full list of them, the documentation is always your friend.

Welcome to Faker’s documentation! – Faker 13.3.3 documentation

2. Batch Generation

Image by S. Hermann & F. Richter from Pixabay
Image by S. Hermann & F. Richter from Pixabay

It won’t be very useful if we can only generate one piece of fake data each time. So, it is important to understand that we can bulk-generate fake data.

2.1 Using For-Loop

Suppose we want to generate some fake user profiles. Each profile should contain the user’s first/last name, home address, job title and the company they are working for. The simplest way of doing so is to put everything in a for-loop as follows.

for _ in range(3):
    print('Name:', fake.name())
    print('Address:', fake.street_address())
    print('Job:', fake.job())
    print('Company:', fake.company())
    print()

It can be seen that the profiles are different although we just call the methods repetitively. The internal mechanism of Faker is based on randomness. Therefore, the fake data we generated each time will be random as well.

2.2 Using Built-In Providers

Fake not only has the ability to generate one type of data at each time. Take the user profiles as an example, we don’t actually have to generate each field individually, because Faker has the ability to generate fake profiles. Just call its fake.profile() method.

import pprint
for _ in range(3):
    pprint.pprint(fake.profile())

2.3 Generate Pandas Dataframe

The beauty of using high-level fake providers such as profile() is that it generates dictionaries for us. That means we can seamlessly integrate it with other tools such as Pandas.

For example, we want to generate user profiles in a Pandas Dataframe. The design of Faker made it very easy. Literally, just one line of code.

import pandas as pd
pd.DataFrame([fake.profile() for _ in range(10)])
Click on the image to zoom in
Click on the image to zoom in

Just post another screenshot with partial columns in case the above full screenshot is not quite readable on your device

3. Extended Providers

Image by Wolfgang Eckert from Pixabay
Image by Wolfgang Eckert from Pixabay

Faker has many built-in providers that will satisfy our requirements most of the time. These can be obtained in the official documentation.

However, what if it turns out that we can’t find the provider we need? Don’t worry, Faker has opened its "protocol" so that the community can contribute more providers. In Faker, they call these "extended providers". Here is a list of them so far.

Community Providers – Faker 13.3.3 documentation

I’ll pick up the vehicle as an example for this showcase. Faker cannot generate vehicle makes, models, etc, but there is already an extended provider there that can help.

Firstly, by following the documentation, we need to install the extended provider first. It can be installed with pip as well.

pip install faker_vehicle

Then, we need to register this extended provider with the faker instance.

from faker_vehicle import VehicleProvider
fake.add_provider(VehicleProvider)

All done, we can start to use it now. A simple example to generate a car make.

fake.vehicle_make()

This extended provider also has some high-level methods. For example, we can generate profiles of cars using the method fake.vehicle_year_make_model().

for _ in range(5):
    print(fake.vehicle_year_make_model())

It is also noticed that the make and the model can match. Much appreciate the author of this extended provider.

4. Customized Lorem

Image by Gerhard G. from Pixabay
Image by Gerhard G. from Pixabay

We can generate a sentence using Faker as follows.

fake.sentence()

However, what if we want the sentence to be generated from our "dictionary"? In other words, we want to use limited words as the pool for all the random sentences we are going to generate.

Some Faker providers support customisation, and the sentence() method is one of them. We can pass in a list of words so that all the sentences will be generated within the pre-defined words pool.

my_words = [
    'only', 'these', 'words', 'are',
    'allowed', 'to', 'be', 'used'
]
for _ in range(5):
    print(fake.sentence(ext_word_list=my_words))

5. Unique Values

Image by anncapictures from Pixabay
Image by anncapictures from Pixabay

By knowing that the mechanism of Fake is randomly generating the fake data from an existing pool behind the scene, you may have a concern that it may generate duplicated data if we are generating a type of fake data for enough volume.

The answer is yes. If we use the method without any annotation, it is possible to generate duplicated data. Let’s have a simple experiment by generating 500 first names. Here we generate first names rather than full names is to make it easier to produce the problem.

names = [fake.first_name() for _ in range(500)]
print('Unique sentences:', len(set(names)))

It can be seen that there are only 244 unique names out of the 500 names that we have generated.

Faker provided the solution to this problem. That is, call the unique property before calling the methods. For example, rather than using fake.first_name(), we should use fake.unique.first_name().

names = [fake.unique.first_name() for _ in range(500)]
print('Unique sentences:', len(set(names)))

This time, we have made all the 500 names unique. However, if you are as curious as me, you may ask the question what if I’m generating much more names? Let’s test with 2000 first names.

names = [fake.unique.first_name() for _ in range(2000)]
print('Unique sentences:', len(set(names)))

Obviously, the name pool was exhausted after 1,000 names. So, Faker has its own limitation as well, but it is understandable because the meaningful dummy data cannot be infinite. The good thing is that it throws the error when it can’t satisfy our requirements.

6. Certainty In The Randomness

Image by anncapictures from Pixabay
Image by anncapictures from Pixabay

We knew that Faker will generate the data randomly. Does that mean we can reproduce some demonstration? No, we can always reproduce the fake data if we plan this at the beginning. That is using the seed number.

Now, let’s instantiate a new faker object. We can specify its seed number after that as follows. The seed number can be whatever an integer. Let’s just use 123.

fake = Faker()
Faker.seed(123)

Then, let’s generate 5 names.

for _ in range(5):
    print(fake.name())

Remember these names. Now, let’s renew the faker instance without assigning a seed number. Then, generate another 5 names.

fake = Faker()
for _ in range(5):
    print(fake.name())

Of course, they will be different because they are random.

Let’s do this another time with the same seed number 123.

If you compare the 5 names with the first time that we were using the same seed number, they are identical.

Therefore, if we want to reproduce the same fake data generation, just use the seed number.

Summary

Image by JingSun from Pixabay
Image by JingSun from Pixabay

In this article, I have introduced the Faker library which is one of the amazing antigravity libraries in the Python Community. It is indeed very easy to use and going to be very useful. When you just need some dummy data and you don’t care about the distribution of the data, consider using Faker first.

Join Medium with my referral link – Christopher Tao

If you feel my articles are helpful, please consider joining Medium Membership to support me and thousands of other writers! (Click the link above)


Related Articles