The world’s leading publication for data science, AI, and ML professionals.

Generate Simulated Dataset for Linear Model in R

When the real dataset is hard to find, simulate it.

Photo by CHUTTERSNAP on Unsplash
Photo by CHUTTERSNAP on Unsplash

Motivation

In these recent years, research about Machine Learning (ML) has increased along with the increased computation capability. As a result, there is much development in some of the ML models – if not inventing a new model – that performs better than the traditional model.

One of the main problems that the researchers usually encountered when trying to implement the proposed model is the lack of the proper real-world dataset that follows the model’s assumptions. Or in the other case, the real-world dataset exists, but the dataset itself is very expensive and hard to collect.

To overcome those problems, the researchers usually generate a simulated dataset that follows the model’s assumptions. This simulated dataset can be used as a benchmark for the model or real-world dataset replacement in the modeling process, where the simulated dataset is cost-effective than the real-world dataset. This article will explain how to generate a simulated dataset for a linear model using R.

The Concept

The process of generating a simulated dataset can be explained as follows. First, we specify the model that we want to simulate. Next, we determine each independent variable’s coefficient, then simulate the independent variable and error that follows a probability distribution. And finally, compute the dependent variable based on the simulated independent variable (and its predetermined coefficient) and error.

To understand more about this process in practice, here I will give some implementations of generating a simulated dataset for a linear model using R.

Implementation: Linear Regression

For the first example, suppose that we want to simulate the following linear regression model

where x_1 follows Normal distribution with mean 50 and variance 9, x_2 follows Normal distribution with mean 200 and variance 64, and the error follows Normal distribution with mean 0 and variance 16. Suppose too that b0, b1, and b2 are 150, -4, and 2.5, respectively. We can simulate it by writing these lines of code as follows (Note that the result might different with different seeds).

> summary(m1)
Call:
lm(formula = y1 ~ x1 + x2)
Residuals:
    Min      1Q  Median      3Q     Max 
-41.782 -12.913  -0.179  10.802  53.316
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 154.23621   10.71954   14.39   <2e-16 ***
x1           -3.98515    0.19636  -20.30   <2e-16 ***
x2            2.47327    0.02714   91.14   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.96 on 97 degrees of freedom
Multiple R-squared:  0.9885, Adjusted R-squared:  0.9883 
F-statistic:  4179 on 2 and 97 DF,  p-value: < 2.2e-16
The first example model's diagnostic checking using the simulated dataset (Image by the Author)
The first example model’s diagnostic checking using the simulated dataset (Image by the Author)

From the model m1 , we can see that the model is significant (based on the p-value of the overall test) and every independent variable (x_1 and x_2) and the constant are significant (based on the p-value in each variable). We can see too that the estimated coefficients are pretty close to the predetermined value of each coefficient.

Implementation: Linear Regression with Categorical Independent Variable

Now, for the second example, suppose that we want to simulate the following linear regression model

where x_1 and x_2 (and its coefficients), error, and b0 are the same as the first example, but x_3 is a binary categorical variable that follows Binomial distribution with probability of success (in R will be denoted as 1) is 0.7 and b3 is 5. Using the same seed as before, we can simulate it by writing these lines of code as follows.

> summary(m2)
Call:
lm(formula = y2 ~ x1 + x2 + x3)
Residuals:
    Min      1Q  Median      3Q     Max 
-41.914 -12.804  -0.065  10.671  53.178
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 153.84275   11.32094  13.589   <2e-16 ***
x1           -3.98432    0.19751 -20.173   <2e-16 ***
x2            2.47330    0.02728  90.671   <2e-16 ***
x3            5.46641    4.11890   1.327    0.188    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.05 on 96 degrees of freedom
Multiple R-squared:  0.9885, Adjusted R-squared:  0.9882 
F-statistic:  2758 on 3 and 96 DF,  p-value: < 2.2e-16

From the model m2 , we can see that the model is significant, and every independent variable (except x_3) and the constant are significant (based on the p-value in each variable).

Implementation: Count Regression

For the final example, suppose that we want to simulate the following count linear model (Poisson regression model, to be specific)

where x_1 follows Normal distribution with mean 2 and variance 1, x_2 follows Normal distribution with mean 1 and variance 1, and b0, b1, and b2 is 5, -4, and 2.5, respectively. The difference with the first two examples is, we need to calculate the logarithm of lambda first using the equation above, and exponentiate it to compute the dependent variable. We can simulate it by writing these lines of code as follows.

> summary(m3)
Call:
glm(formula = y3 ~ x1 + x2, family = poisson(link = "log"))
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.99481  -0.58807  -0.14819   0.00079   2.08933
Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.02212    0.03389   148.2   <2e-16 ***
x1          -3.96326    0.02871  -138.0   <2e-16 ***
x2           2.48380    0.02950    84.2   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 90503.729  on 99  degrees of freedom
Residual deviance:    69.329  on 97  degrees of freedom
AIC: 283.21
Number of Fisher Scoring iterations: 4

From the model m3 , we can see that the variables and the model constant are significant, and we can see that the estimated coefficients are pretty close too compared with the predetermined value of each coefficient.

Conclusion

And that’s it! You have learned how to generate a simulated dataset for the linear model in R. The examples in this article are just some of a simple implementation of how this simulated dataset generating process is conducted. In a real-world application, you can generate a more complex simulated dataset and use it for some linear models with interaction effects or some advanced ML models.

As usual, feel free to ask and/or discuss if you have any questions! See you in my next article!

Author’s Contact

LinkedIn: Raden Aurelius Andhika Viadinugroho

Medium: https://medium.com/@radenaurelius

References

[1] Ross, S. M. (2013). Simulation, 5th ed. Elsevier.

[2] https://bookdown.org/rdpeng/rprogdatascience/simulation.html#simulating-a-linear-model

[3] Baraldi, P., Mangili, F., and Zio, E. (2013). Investigation of uncertainty treatment capability of model-based and data-driven prognostic methods using simulated data. Reliability Engineering & System Safety, vol. 112, pp. 94–108.

[4] Ortiz-Barrios, M. A., Lundström, J., Synnott, J., Järpe, E., and Sant’Anna, A. (2020). Complementing real datasets with simulated data: a regression-based approach. Multimedia Tools and Applications, vol. 79, pp. 34301–34324.


Related Articles