How to Create a Custom Dataset in R

Make your own synthetic dataset to analyze for your portfolio

Martina Giron
Towards Data Science

--

Photo by Scott Graham on Unsplash

In your data science journey, you might have come across synthetic datasets, sometimes called toy or dummy datasets. These are useful for practicing data wrangling, visualizing, and modelling techniques. Compared to real world data, they are often clean and easy to understand. This makes working with synthetic data an appealing idea for beginners. Today, I’ll walk you through how you can make your own synthetic data. After reading this article, you’ll be able to create your own big data, with as many variables as you like.

This tutorial will tackle how I created this Human Resources dataset. It contains how well 300 data scientists performed in a hypothetical company’s hiring exam, and their performance evaluation scores after working in the company for 1 year.

Why create your own dataset?

Aside from being ready to analyze, synthetic datasets offer additional advantages over real world data. Here are some scenarios to illustrate:

The data you are looking for is highly niche, specific, or confidential

At first, I tried to find datasets for my project in mind. After a couple hours of searching, I was empty handed. I realized that there may simply not be one online because companies don’t go around publishing their applicants’ data.

You want to create your own tutorial

Let’s say you want to write a tutorial on performing a linear regression (like I will!). You want a guarantee that some predictors will be linearly related to your independent variable, and some will not. If you looked for your own dataset, you would have to test this for each of them. On the other hand, if you created your own, you can design it such that you are guaranteed that this assumption will hold.

You want to own all the rights to your data

No more worrying about copyright, data privacy, or terms of use, since the data is completely yours.

Your goal is to practice your data visualization and communication skills

In my case, my purpose is to showcase my knowledge in HR analytics. Specifically, I want to show my knowledge in performing statistical tests and communicating the results to business managers.

What you’ll need

  • An idea what your dataset will describe
  • Knowledge on the nature of the data
  • Some knowledge on common probability distributions
  • Working knowledge of R

Making the Dataset

R has several functions that allow you to instantly generate random data. The following steps will guide you through choosing the right functions, organizing their outputs, and exporting your finished dataset.

Step 1: List down all variables you want to include

Table of variables to be generated for synthetic Human Resources Dataset. Includes the following variables: Applicant ID, management experience, situational judgment test score, coding exam scores for data cleaning, data visualization, and machine learning, interview scores from three panelists, number of absences, and performance evaluation scores

Note down how many units or rows of data you want. For this project, I want a total of 320 applicants/rows of data.

Step 2: Describe the requirements of each variable

Table of data requirements for each variable. Specifies minimum and maximum scores, lists of categories, and correlated variables.

This is where context and your field experience will come in. For example, interview scores may lie between 0 and 10. But as a recruiter, you know that scores from 0 to 3 are unrealistic, so you set 4 and 10 as the minimum and maximum for this variable. Meanwhile, for scores on a machine learning exam, the realistic minimum and maximum scores are 40 and 100. However, you also know that the average score is 80, most applicants have scores near that, so you also want your data to reflect that trend. If you want some variables to be correlated with one another, indicate which variables as well.

Step 3: Determine an appropriate distribution for your variables

Table of probability distributions for each variable

This is the trickiest part because it’s where you will be using your knowledge on common probability distributions. If you’re just using your dataset to practice wrangling or visualization, you could probably get away with only knowing the uniform and normal distribution. In my case though, I want my data to have all sorts of relationships and peculiarities, so I use a whole bunch.

I used a discrete uniform distribution for interview scores because I want the probability of an applicant getting a 4 to be equal to getting a 4.1, or 4.2, all the way to 10. Meanwhile, I used a multivariate normal distribution for my coding exam scores and performance scores because I wanted them all to be normally distributed and correlated to one another with different strengths.

Take note that each distribution is characterized by its own set of parameters, and that’s where our requirements from step 2 will come in. You can read more about common probability distributions here.

Step 4: Writing the Code

Of course, this part is easier said than done. So let’s work at it one variable (or group of variables) at a time.

Packages

library(MASS)
library(tidyverse)
library(writexl)

First, let’s load all the packages we’ll need. Tidyverse for data manipulation, MASS for generating our multivariate normal distribution, writexl to save our work as an excel file.

Applicant ID Number

id <- paste0("Applicant #", 1:320)

The simplest way to generate unique ID numbers is to create a sequence of whole numbers. Here, we also appended “Applicant #” with paste0().

Management

# Assign Management Experience Randomly
set.seed(0717)
management_random <- rbinom(320, 1, 0.5)
management <- ifelse(management_random == 0, "manager", "non-manager")

One way to randomly generate categorical variables is to first draw a random set of numbers, then assign a category to each number. Here, we randomly selected 0’s and 1’s using rbinom, then assigned Male to 0 and Female to 1

Situational Judgment Test Scores, Number of Absences, and Data Visualization Exam Scores

# Situational Judgment Test Scores
set.seed(0717)
situational <- round(runif(320, 60, 100))
# Number of Absences
set.seed(0717)
absences <- rpois(320, lambda = 1)
# Data Viz Exam Scores
set.seed(0717)
coding_viz <- rnorm(320, 85, 5) %>%
cap_score(., 100) %>%
round(2)

Since we already specified the distribution and parameters we want to use for these variables, all we need to do now is plug them into their respective functions runif for the uniform distribution, rpois for poisson, and rnorm for the univariate normal. We’ll also use that function we created earlier, cap_score, to impose a maximum score of 100 on the Data Visualization Exam Scores.

Correlated Variables

cor_var_means <- c(6.8, 7.2, 8.4, 77, 84, 80)
cor_var_matrix <- matrix(
c(
0.87, 0.6, 0.7, 0.36, 1.55, 0.57,
0.6, 1.2, 0.52, 0.5, 1.2, 2.34,
0.7, 0.52, 0.68, 0.45, 0.89, 0.75,
0.36, 0.5, 0.45, 15.2, 1.09, 1.64,
1.55, 1.2, 0.89, 1.09, 17.2, 1.88,
0.57, 2.34, 0.75, 1.64, 1.88, 9.3
), byrow = T, nrow = 6
)
set.seed(0717)
correlated_vars_df <- as.data.frame(mvrnorm(n = 320, mu = cor_var_means, Sigma = cor_var_matrix))

We will need a vector of means and a variance-covariance matrix to generate our multivariate normal distribution. We have already decided our means from earlier steps, but now, we have to calculate the covariances with the following formula. You might find it easier to do this step in Excel.

correlated_vars_df_cols <- c("interview_p1", "interview_p2", "interview_p3", "coding_cleaning", "coding_ml", "performance")
colnames(correlated_vars_df) <- correlated_vars_df_cols
correlated_vars_df <- correlated_vars_df %>%
mutate(interview_p1 = round(cap_score(interview_p1, 10), 1),
interview_p2 = round(cap_score(interview_p2, 10), 1),
interview_p3 = round(cap_score(interview_p3, 10), 1),
coding_cleaning = round(cap_score(coding_cleaning, 100), 2),
coding_ml = round(cap_score(coding_ml, 100), 2),
performance = round(cap_score(performance, 100))
)

Now, we have our randomly generated interview, coding exam, and performance scores. Let’s clean it up a little. First, we assigned them all column names. Next, let’s round off interview scores off to one decimal place and place a maximum score of 10. I created my own function cap_score, to do this. It takes the column and the maximum score as its arguments. For the coding exam scores, 2 decimal places and a maximum of 100. For performance, we’ll make them whole numbers with a maximum of 100 as well.

Step 5: Gather and Save Your Data

applicant_scores <- cbind(
id, management, situational, coding_viz, correlated_vars_df, absences)
applicant_final <- applicant_scores[1:300, ]write_xlsx(applicant_final, "Employee Selection Scores.xlsx")

We can easily gather the variables and datasets we generated using cbind. I’m saving only the first 300 rows of data, since I plan to use this as a training set. When you’re ready, you can save the final dataset as your desired format. I chose the Excel .xlsx

(optional) Step 6: Share it!

You created this data set probably because you had no access to any alternatives. That means your fellow data analyst in training probably don’t either. Give back to the community by sharing! This is also an opportunity to showcase your R skills and domain knowledge.

You can access my dataset here.

Some final words

Today, we tackled how to create your own dataset with 320 data points and variables following the normal, poisson, and discrete uniform probability distributions. You can now analyze this data and add this to your portfolio. Next time, try adding more data points (you can have hundreds of thousands, or even millions of rows!). You can also experiment by having your variables follow different distributions, such as the beta or geometric distributions.

--

--