Generating Fake Data for Data Analytics

In the world of data analytics, getting your hands on a good dataset is of paramount importance. In the real world, you probably have access to a lot of uncleaned data that you likely need to spend some time cleaning. What if you do not have the required data and wanted to hack something out quickly for a proof-of-concept demo? In this type of situation, you often have to cook up your own data, and at the same time you need your data to have some degree of realism. So what do you do? Do you painstakingly make up the data manually, or is there an automated way of doing things?
In this article, I will show you some cool ways to Fake your data, and make them look real!
Generating Names
To generate some fake names, you can use the names
package. To use it, first you need to install it:
!pip install names
You can now use the various functions in the package to generate gender-specific names:
import names
display(names.get_full_name('male'))
display(names.get_first_name())
display(names.get_last_name())
display(names.get_full_name('female'))
display(names.get_first_name())
display(names.get_last_name())
Here are some names generated:
'Gerald Paez'
'Matthew'
'Wiese'
'Dana Mcmullen'
'Heather'
'Oxley'
'Walter Walters'
'Connie'
'Vildosola'
'Nancy Correra'
'Aaron'
'Dawes'
'Randy Meli'
'Yvonne'
'Owen'
'Loretta Patague'
'Sidney'
'Oliver'
Generating UUIDs
Besides names, another type of data that you might want to generate is UUIDs. An UUID (Universal Unique Identifier) is a 128-bit value used to uniquely identify an object or entity on the internet. In the mobile world, UUIDs are often used to identify apps installed on devices.
To generate sample UUIDs, you can use the uuid
package:
!pip install uuid
You can convert the UUID generated to a string:
import uuid
str(uuid.uuid4())
Here is a sample UUID generated:
'54487fd7-0632-450e-b6e3-bcc54bc83133'
The Faker
Package
When generating fake data using Python, the Faker
package is definitely worth mentioning. The faker
package generates all sorts of fake data for your usage. Data that you can generate include:
- address
- barcode
- credit card information
- ISBN
- phone number
- and more!
In the following sections, I will show you how to generate some commonly needed data.
Generating User Profile
The faker
package can generate user profiles, such as username, sex, address, email, and date of birth. The following code snippet creates a simple profile for a male person:
from faker import Faker
fake = Faker()
fake.simple_profile(sex='M') # use 'F' for female
The output is a dictionary containing the various details of a male person:
{'username': 'lisa38',
'name': 'Brandon Gibson',
'sex': 'M',
'address': '406 Brandi InletnWest Christopherville, PR 41632',
'mail': '[email protected]',
'birthdate': datetime.date(2008, 9, 10)}
Generating Dates
One particular type of data I want to generate is the date of birth (DOB) of a person. When storing details of a person, it is always recommended to store the DOB rather than the age (for very obvious reasons).
Using the faker
package, you can generate the birth date of a person that is between 18 and 60 years old:
fake.date_between(start_date='-60y', end_date='-18y')
The data returned is a date
object:
datetime.date(1963, 4, 18)
If you want to convert the result to a string, you can use the strftime()
function:
fake.date_between(start_date='-60y', end_date='-18y').strftime('%Y-%m-%d')
# '1973-07-16'
Note that every time you call a function from the
Faker
object, a new set of data is generated. If you want the data generated to be deterministic (i.e always the same), you can use theseed()
function, like this:Faker.seed(0)
.
Generating Locations
The next type of data I want to generate is location data. For example, you want to get the Latitude and longitude of a location in the US. You can use the local_latlng()
function and specify the country_code
parameter:
fake.local_latlng(country_code = 'US')
The function returns a location known to exist on land in a country specified by country_code
. The informations are enclosed is a tuple that looks like this:
('33.72255', '-116.37697', 'Palm Desert', 'US', 'America/Los_Angeles')
If you only want the latitude and Longitude and not the rest, set coords_only
to True
:
fake.local_latlng(country_code = 'US', coords_only=True)
The country_code
parameters accepts values from the land_coords
constant, such as AU
for Australia:
fake.local_latlng(country_code = 'AU')
# ('-25.54073', '152.70493', 'Maryborough', 'AU', 'Australia/Brisbane')
I couldn’t find the definition for the
land_coords
constant from the Faker documentation, but you can reference the _land_coords
_ variable defined in https://rdrr.io/github/LuYang19/faker/src/R/init.R.
If you want a pair of coordinates that is guaranteed to exist on land, use the location_on_land()
function:
fake.location_on_land(coords_only=True)
# ('54.58048', '16.86194')
Generating Addresses
If you want generate some sample addresses, use the address()
, current_country()
, city()
, country()
, and country_code()
functions:
display(fake.address())
# '910 Jason Green Apt. 954nJonesland, IL 76881'
display(fake.current_country()) # based on the address returned by address()
# 'United States'
display(fake.city())
# 'North Carolyn'
display(fake.country())
# 'Holy See (Vatican City State)'
display(fake.country_code())
# MU
Locales Support in Faker
So far all the names and addresses generated are in English. However, the faker
package also supports different locales. The list of locales supported can be found from: https://faker.readthedocs.io/en/master/locales.html.
The following figure shows an example locale – zh_CN
:

For example, in the zh_CN
locale, you can find the following providers:
faker.providers.address
faker.providers.company
faker.providers.date_time
faker.providers.internet
faker.providers.job
faker.providers.lores
faker.providers.person
faker.providers.phone_number
faker.providers.ssn
This means that all the above listed providers support the zh_CN
locale. Take the faker.providers.address
(https://faker.readthedocs.io/en/master/locales/zh_CN.html#faker-providers-address) as an example. When instantiating a Faker
object, you can pass in one or more locales:
fake = Faker(['zh_CN']) # Chinese in China locale
fake.address()
The above address()
function returns the address in Chinese:
'内蒙古自治区飞市兴山深圳路b座 104347'
If you use the zh_CN
locale, some functions will be tied to this locale, such as:
fake.name()
fake.address()
fake.current_country()
Here are some examples:
'洪凤兰'
'辽宁省波县永川王路s座 292815'
"People's Republic of China"
'吕峰'
'广东省凤英市吉区李路t座 385879'
"People's Republic of China"
'何秀梅'
'浙江省齐齐哈尔市上街潮州路M座 218662'
"People's Republic of China"
The address results will be those locations in China.
Calling other functions such as fake.country()
will return other countries but the result will be in Chinese (based on the zh-CN
locale):
'越南'
越南 is Vietnam.
You can also generate Chinese names using the zh_CN
locale:
fake = Faker(['zh_CN'])
display(fake.first_name_male())
display(fake.last_name_male())
display(fake.name())
Here is a sample output of the above code snippet:
'龙'
'马'
'雷春梅'
Putting Them Altogether
With all the ways to generate the different types of fake data, I want to put them altogether so that I can perform some data analytics on them.
The following code snippet generates 1000 sets of the following data:
- Uuid
- User name
- Latitude, longitude, and country from one of the seven countries
- Gender
- Data of birth
from faker import Faker
import random
import uuid
uuids = []
usernames = []
latitudes = []
longitudes = []
genders = []
countries = []
dobs = []
n = 1000
fake = Faker()
country_codes = ['US','GB','AU','CN','FR','CH','DE']
for gender in ['M','F']:
for i in range(n // 2): # 500 males and 500 females
# uuids
uuids.append(str(uuid.uuid4()))
# username and sex
profile = fake.simple_profile(sex=gender)
usernames.append(profile['username'])
genders.append(profile['sex'])
# dob
dobs.append(fake.date_between(start_date='-78y', end_date='-18y'))
# lat and lng, and country
location = fake.local_latlng(country_code = country_codes[random.randint(0, len(country_codes) -1)])
latitudes.append(location[0])
longitudes.append(location[1])
countries.append(location[3])
I then combined the 1000 sets of data into a Pandas DataFrame:
import pandas as pd
df = pd.DataFrame(data = [uuids, usernames, genders, countries, latitudes, longitudes, dobs])
df = df.T
df.columns = ['uuid', 'username', 'gender', 'country', 'latitude', 'longitude', 'dob']
df
The dataframe now contains 1000 fictitious user accounts and their personal details like app ID, location information, gender, and DOB:

Plotting a map
With the latitude and longitude, it would be interesting to plot the geographical locations of my users. For this I used Folium:
import folium # pip install folium
mymap = folium.Map(location = [22.827806844385826, 4.363328554220703],
width = 950,
height = 600,
zoom_start = 2,
tiles = 'openstreetmap')
folium.TileLayer('Stamen Terrain').add_to(mymap)
folium.TileLayer('Stamen Toner').add_to(mymap)
folium.TileLayer('Stamen Water Color').add_to(mymap)
folium.TileLayer('cartodbpositron').add_to(mymap)
folium.TileLayer('cartodbdark_matter').add_to(mymap)
folium.LayerControl().add_to(mymap)
for lat, lng in zip(df['latitude'], df['longitude']):
station = folium.CircleMarker(
location = [lat, lng],
radius = 5,
color = 'red',
fill = True,
fill_color = 'yellow',
fill_opacity = 0.3)
# add the circle marker to the map
station.add_to(mymap)
mymap
Here’s the map showing the distribution of my users:

I can zoom into the map:

I can also change the tilesets:

Plotting pie chart
I can visualize where my users are from:
df.groupby('country').count().plot.pie(y='username')

I could also make the pie chart more descriptive:
total = df.shape[0]
def fmt(x):
return '{:.2f}%n({:.0f})'.format(x, total * x / 100)
df.groupby('country').count().plot.pie(y='username', autopct=fmt)

Plotting bar chart
The total users from each country can also be plotted using a bar chart:
from matplotlib import cm
import numpy as np
color = cm.inferno_r(np.linspace(.4, .8, len(country_codes)))
df.groupby('country').count().plot.bar(y = 'username',
color = color,
legend = False
)
From the chart you can see that Great Britain has the most number of users while China has the least:

Plotting histogram
I can also find out about the age distribution of my users. For this, I need to first calculate their current age based on their DOB:
from datetime import datetime, date
from dateutil import relativedelta
def cal_age(born):
return relativedelta.relativedelta(date.today(), born).years
df['age'] = df['dob'].apply(cal_age)
df
The dataframe now has an additional column showing the age of each user:

You can now plot a histogram showing the age distribution:
ax = df['age'].hist(bins=15, edgecolor='black', linewidth=1.2, color='yellow')
ax.set_xlabel("Age")
ax.set_ylabel("Total")
ax.set_xticks(range(18,80,5))
ax.set_title("Users age distribution")

If you like reading my articles and that it helped your career/study, please consider signing up as a Medium member. It is $5 a month, and it gives you unlimited access to all the articles (including mine) on Medium. If you sign up using the following link, I will earn a small commission (at no additional cost to you). Your support means that I will be able to devote more time on writing articles like this.
Summary
I hope you are now better equipped to generate any additional data that your projects need. Generating realistic demo data not only allows you to test your algorithms more accurately, it also provides more realism when using them for demos. Let me know in the comments what other types of data you usually need to generate!