Creating Synthetic data

Using models to feed data centricity

Published in

Towards Data Science

16 min readNov 21, 2022

I have blogged about synthetic data before. Twice actually. The first time I used synthetic data from a Dutch Cancer Registry to conduct Survival Analysis. The second time I used the same dataset to apply Specification Curve Analysis.

Synthetic data will play a key role in the years to come, as models become ever more complex yet reach limitations bound by the limits of the data that is available. For some time now, the model-centric movement is slowly being replaced by a more data-centric view, and in time the models will once again take center stage.

Synthetic data is actually both worlds combined, and still in its infancy. Its data, but the way we are used to seeing and looking at data. For the majority of researchers, data comes from observations made in the field or from sensors. Synthetic data is modelled data. In essence its data disguised as other data in which sensitive components are removed, but they key components remain. If successful, a modeler should not even know that the data is synthetic. Perhaps a good analogy would be to depict the Louvre in which you stand before the Mona Lisa, but it is not the actual Mona Lisa. The actual true Mona Lisa is hidden in a basement and you are looking at a heavily guarded replica.

Synthetic data is all about honoring the essence of the original, without being the original. And from that starting point, synthetic data can even be more. It can become training material for models that will seldom occur in real life, but which can be used as simulation material to be shared with everyone. Because the data has lost its sensitivity, but not its worth.

What I am going to do in this block is take dataset in R, the diamond dataset, and create several models from it to establish synthetic data. Of course, the diamonds dataset is already open, and so it does not have sensitive information, but the essence of the exercise is to recreate the essence of the dataset without recreating the dataset.

My reason for using the diamonds dataset is that it has sufficient rows and columns without having immediate and clear-cut relationships.

Lets get started.

rm(list = ls())
options("scipen"=100, "digits"=10)
#### LIBRARIES ####
library(DataExplorer)
library(ggplot2)
library(dplyr)
library(tidyr)
library(ggExtra)
library(skimr)
library(car)
library(GGally)

library(doParallel)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)

After loading the libraries, I want to immediately take a look at the data itself. If I am to model the data to identify its essence (via cross-correlations) and from that essence replicate it, it is essential that I get a solid grip on that dataset. For that I will look at the covariance and correlation matrices first.

skimr::skim(diamonds)
str(diamonds)
diamonds$price<-as.numeric(diamonds$price)
diamonds%>%
  dplyr::select(where(is.numeric))%>%
  cor()
diamonds%>%
  dplyr::select(where(is.numeric))%>%
  corrgram::corrgram()
diamonds%>%
  dplyr::select(where(is.numeric))%>%
  cor()%>%
  ggcorrplot::ggcorrplot(., hc.order = TRUE, type = "lower",
             lab = TRUE)

The diamonds dataset has a lot of rows, but only a couple of columns. That should make it easier to build the model, but the relationships are not clear-cut. Image by author.

I will start with the numerical values first, since that is the easiest start. This does not mean that we cannot recreate synthetic data for ordinal or nomial data. However, a correlation matrix for non-continuous data requires some additional transformations which we leave out of the picture for now.

The Correlation Matrix for all numerical variables. Image by author.

And three types of correlation matrices visually shown. To the left you see a heatmap with dendrograms attached to it. The dendrogram is a clustering technique that involves correlations. Hence, why it is straightforward to combine them. As you can see, there are some heavy correlations and some parts which are almost separated. Image by author.

We could create even more fancy plots, but R will run a bit into a problem plotting all 53k rows. So I will first select a random sample of 1000 rows and then plot the distributions of each numerical value and their underlying correlations.

diamonds%>%
  dplyr::select(where(is.numeric))%>%
  sample_n(., 1000)%>%
  scatterplotMatrix()

The distributions should multiple peaks, which is interesting. Also, the relationships between variables is sometimes extreme, and sometimes non-existent (which you could also call extreme). Image by author.

And, there is a wonderfull way to plot correlations and distributions per specific category (here cut) using the GGally library.

diamonds%>%
  dplyr::select(x, y, z, carat, depth, table, price, cut)%>%
  sample_n(., 1000)%>%
  ggpairs(.,
  mapping = ggplot2::aes(color = cut),
  upper = list(continuous = wrap("density", alpha = 0.5), 
               combo = "box_no_facet"),
  lower = list(continuous = wrap("points", alpha = 0.3), 
               combo = wrap("dot_no_facet", alpha = 0.4)),
  title = "Diamonds")

If we are to recreate a dataset based on the original data containing only numerical values, we need to immediately think about two main issues that need to be taken care of:

The summary statistics and distribution of each variable need to be as close as possible to the original, but does not need to be identical. Meaning that the structure of each variables needs to be maintained. If that is not the case, the descriptive part of the dataset cannot be used. The modelling part can still be maintained.
The covariance/correlation between the datapoints needs to be the same. Meaning that the underlying relationship between the variables needs to be maintained. This is essential for the modelling part.

Extracting the covariance / correlation matrix from the data is not that difficult, but like we said, the standard formula only works for numeric values. Nonetheless, we can start part of our creation with the numerical values to get the idea going.

Here you see the covariance matrix.

diamond_cov<-diamonds%>%dplyr::select_if(., is.numeric)%>%cov()


                 carat           depth           table             price
carat    0.22468665982   0.01916652822    0.1923645201     1742.76536427
depth    0.01916652822   2.05240384318   -0.9468399376      -60.85371214
table    0.19236452006  -0.94683993764    4.9929480753     1133.31806407
price 1742.76536426512 -60.85371213642 1133.3180640679 15915629.42430145
x        0.51848413024  -0.04064129579    0.4896429037     3958.02149078
y        0.51524781641  -0.04800856925    0.4689722778     3943.27081043
z        0.31891683911   0.09596797038    0.2379960448     2424.71261297
                     x                y                z
carat    0.51848413024    0.51524781641    0.31891683911
depth   -0.04064129579   -0.04800856925    0.09596797038
table    0.48964290366    0.46897227781    0.23799604481
price 3958.02149078326 3943.27081043196 2424.71261297033
x        1.25834717304    1.24878933406    0.76848748285
y        1.24878933406    1.30447161384    0.76731957995
z        0.76848748285    0.76731957995    0.49801086259

From this procedure, we obtain the covariance, but we cannot create data. To do so, we also need summary statistics which can be obtained fairly easily.

diamonds%>%dplyr::select_if(., is.numeric)%>%summary()

     carat               depth             table              price         
 Min.   :0.2000000   Min.   :43.0000   Min.   :43.00000   Min.   :  326.00  
 1st Qu.:0.4000000   1st Qu.:61.0000   1st Qu.:56.00000   1st Qu.:  950.00  
 Median :0.7000000   Median :61.8000   Median :57.00000   Median : 2401.00  
 Mean   :0.7979397   Mean   :61.7494   Mean   :57.45718   Mean   : 3932.80  
 3rd Qu.:1.0400000   3rd Qu.:62.5000   3rd Qu.:59.00000   3rd Qu.: 5324.25  
 Max.   :5.0100000   Max.   :79.0000   Max.   :95.00000   Max.   :18823.00  
       x                   y                   z            
 Min.   : 0.000000   Min.   : 0.000000   Min.   : 0.000000  
 1st Qu.: 4.710000   1st Qu.: 4.720000   1st Qu.: 2.910000  
 Median : 5.700000   Median : 5.710000   Median : 3.530000  
 Mean   : 5.731157   Mean   : 5.734526   Mean   : 3.538734  
 3rd Qu.: 6.540000   3rd Qu.: 6.540000   3rd Qu.: 4.040000  
 Max.   :10.740000   Max.   :58.900000   Max.   :31.800000

From these combinations, we should be able to recreate the data using various procedures. Perhaps the most straightforward would be to use a multivariate normal distribution which is present in the MASS package. All I need is the mean of each variable, and the correlation matrix. The multivariate normal distribution will do the rest. To obtain an equal comparison I will create as many observations as the original dataset.

sigma<-diamonds%>%dplyr::select_if(., is.numeric)%>%cor()%>%as.matrix()
mean<-diamonds%>%dplyr::select_if(., is.numeric)%>%as.matrix()%>%colMeans()
df<-as.data.frame(MASS::mvrnorm(n=dim(diamonds)[1], mu=mean, Sigma=sigma))
> dim(df)
[1] 53940     7
> head(df)
           carat       depth       table       price           x           y
1 -1.34032717822 61.49447797 57.19091527 3931.162855 3.545669821 3.931061364
2 -0.47751648630 61.16241371 57.86509627 3931.306295 4.906696688 5.057863929
3  2.24358594522 63.09062530 56.73718104 3932.957682 7.386140807 7.405936831
4 -0.03108967881 60.99439588 57.58369767 3931.677322 5.194041948 5.431802322
5  1.16179859890 62.39235813 57.96524508 3933.322044 6.213609464 6.592872841
6 -0.16757252197 60.84783867 56.68337288 3932.501268 4.987489939 5.118558015
            z
1 1.138732182
2 2.596622246
3 5.674154202
4 3.089565271
5 4.387662667
6 2.577509666

Once created, the next task at hand is to check the validity and usability of the data applying two methods:

Checking univariate characteristics.
Checking multivariate characteristics.

With the ggpairs function I can do a little bit of both, and get a starting glimpse of the applicability of the process and the data it generates.

diamonds%>%dplyr::select_if(., is.numeric)%>%ggpairs()
ggpairs(df)

To the left the original dataset, to the right the simulated data taking the means and the correlation matrix from the original data. The distribution of the synthetic data follows, of course, a normal distribution. The correlations between the data is maintained, but the summary statistics surely is not (except the mean). Image by author.

Now, we already said that it is our intention to build synthetic data which means building data that is in essence the same, but not necessarily in outlook. We HAVE achieved that goal. I will build a quick model on both dataset, using the caret package, to see if the synthetic data will give me the exact same results as the original dataset.

diamonds_num<-diamonds%>%dplyr::select_if(., is.numeric)
trainIndex <- caret::createDataPartition(diamonds_num$carat, 
                                         p = .6, 
                                         list = FALSE, 
                                         times = 1)
> wideTrain <- diamonds_num[ trainIndex,];dim(wideTrain)
[1] 32366     7
> wideTest  <- diamonds_num[-trainIndex,];dim(wideTest)
[1] 21574     7

fitControl <- caret::trainControl(
  method = "repeatedcv",
  number = 20,
  repeats = 20)

lmFit1 <- caret::train(carat ~ ., 
                        data = wideTrain, 
                        method = "lm",
                        trControl = fitControl,
                        verbose = FALSE)
> summary(lmFit1)

Call:
lm(formula = .outcome ~ ., data = dat, verbose = FALSE)

Residuals:
        Min          1Q      Median          3Q         Max 
-0.54124488 -0.03836279 -0.00665401  0.03530248  2.71983984 

Coefficients:
                    Estimate       Std. Error   t value               Pr(>|t|)    
(Intercept) -2.5273956761070  0.0294158001479 -85.91966 < 0.000000000000000222 ***
depth        0.0188998891767  0.0003766002370  50.18555 < 0.000000000000000222 ***
table        0.0046678986326  0.0002266381898  20.59626 < 0.000000000000000222 ***
price        0.0000330063643  0.0000002519416 131.00798 < 0.000000000000000222 ***
x            0.2956915579921  0.0032315175272  91.50238 < 0.000000000000000222 ***
y            0.0130670340617  0.0028968529147   4.51077            0.000006482 ***
z           -0.0026197077889  0.0025495576103  -1.02751                0.30419    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0844222 on 32359 degrees of freedom
Multiple R-squared:  0.9684143, Adjusted R-squared:  0.9684085 
F-statistic:   165354 on 6 and 32359 DF,  p-value: < 0.00000000000000022204

Above the summary of a simple linear regression based on the original data. Below the summary of a simple linear regression based on synthetic data. Yes, the procedure is not exactly the same, and one can expect some variation based on the selection of train and test data, and the repeated cross-validation, but the essence of the procedure if to see if the essence of the data is kept.

trainIndex <- caret::createDataPartition(df$carat, 
                                         p = .6, 
                                         list = FALSE, 
                                         times = 1)
> wideTrain <- df[ trainIndex,];dim(wideTrain)
[1] 32364     7
> wideTest  <- df[-trainIndex,];dim(wideTest)
[1] 21576     7

fitControl <- caret::trainControl(
  method = "repeatedcv",
  number = 20,
  repeats = 20)
lmFit2 <- caret::train(carat ~ ., 
                        data = wideTrain, 
                        method = "lm",
                        trControl = fitControl,
                        verbose = FALSE)
summary(lmFit2)

Call:
lm(formula = .outcome ~ ., data = dat, verbose = FALSE)

Residuals:
        Min          1Q      Median          3Q         Max 
-0.68929576 -0.11680934  0.00078743  0.11741046  0.74064948 

Coefficients:
                   Estimate      Std. Error    t value               Pr(>|t|)    
(Intercept) -1079.584562077     8.210331042 -131.49099 < 0.000000000000000222 ***
depth           0.052707918     0.001155147   45.62874 < 0.000000000000000222 ***
table           0.019422785     0.001035408   18.75859 < 0.000000000000000222 ***
price           0.272537301     0.002088731  130.47987 < 0.000000000000000222 ***
x               0.708827904     0.006118409  115.85167 < 0.000000000000000222 ***
y               0.015186425     0.004344156    3.49583             0.00047322 ***
z               0.007592425     0.004694642    1.61725             0.10583335    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1736463 on 32357 degrees of freedom
Multiple R-squared:  0.9697885, Adjusted R-squared:  0.9697829 
F-statistic: 173109.8 on 6 and 32357 DF,  p-value: < 0.00000000000000022204

What becomes immediately clear is that the size of intercept has changed, which is due to the fact that the synthetic data does have the same mean but not the same distribution. Another way to get a glimpse of the data is look at the variance importance of each predictor in a model trying to predict the carat of a diamond.

> lm1Imp;lm2Imp
lm variable importance

         Overall
price 131.007979
x      91.502384
depth  50.185548
table  20.596258
y       4.510769
z       1.027515

         Overall
price 130.479871
x     115.851671
depth  45.628736
table  18.758588
y       3.495829
z       1.617253

It seems as if the the models do have the same variable importance, and the distance between them seems to be sort-of retained. It is not perfect, for sure, but it does not have to be.

A second way to get some idea of the usability of the synthetic data is to look at cumulative distribution plots. A cumulative density plot is another way of showing the distributional form of the variable, but one in which it becomes immediately clear if the two datasets show the same essence.

Important to realizes is that there is no single way of determining if the synthetic data has kept the same essence as the original data.

df$Model<-"Synthetic"
diamonds_num$Model<-"Original"
combined<-rbind(df, diamonds_num)
ggplot(combined, aes(carat, colour = Model)) +
  stat_ecdf()+
  theme_bw()

Here you can see the outcome I have tried to model using a simple linear regression. As you can see, the synthetic data is very clear and prestine, compared to the original data which is bumpy. Hence, a simple multivariate normal distribution containing the correlation matrix will not be enough. Image by author.

The next step is to plot this graph for each numerical variable in the dataset.

combined_long<-combined%>%tidyr::pivot_longer(!Model, 
                                              names_to = "Variable",
                                              values_to = "Value",
                                              values_drop_na = TRUE)
ggplot(combined_long, aes(Value, colour = Model)) +
  stat_ecdf()+
  theme_bw()+
  facet_wrap(~Variable, scales="free")

As you can see, the cumulative distribution function (CDF) does not differ that much between the original and the synthetic data. The synthetic data is more pristine, and for **carat** we already saw certain deviations which can cause differences between analyzing the original and the synthetic data. However, this is nothing compared to **price** which is absolutely wrong. Image by author.

Lets dig deeper into price and compare using densities. I will first plot price, which is deviating severely, and then depth which should almost the same density structure.

g1<-ggplot()+
  geom_density(data=diamonds_num, aes(x=price, fill="Original"), alpha=0.5)+
  theme_bw()+
  labs(fill="")
g2<-ggplot()+
  geom_density(data=df, aes(x=price, fill="Synthetic"), alpha=0.5)+
  theme_bw()+
  labs(fill="")
gridExtra::grid.arrange(g1,g2,nrow=2)

The reason why the cdf for price in the synthetic dataset looked so strange is because it had to be plotted on the same scale as the original. And the original has a maximum exceeding fifteen thousand, unlike the synthetic which stops at around 3950. Image by author.

The difference is a direct result of the function that makes up the multivariate normal, which means that each variable has a mean and a correlation with other functions. The mean, in this particular example, does not tell the entire story.

ggplot()+
  geom_density(data=diamonds_num, aes(x=depth, fill="Original"), alpha=0.5)+
  geom_density(data=df, aes(x=depth, fill="Synthetic"), alpha=0.5)+
  theme_bw()+
  labs(fill="")

Original and synthetic data are not the same, but do show simular characteristics. However, if one would take summary statistics from the original and compare it to the synthetic data, it would be limited at best. That is because descriptive statistics require exactly the same values which means that the distribution should be exactly the same. THAT is the difference between the descriptive and the modelling part of the synthetic dataset. Image by author.

And what if I employ the empirical form of the multivariate normal distribution of the MASS package — meaning that the sample size gets a bigger role to play.

df_emp<-as.data.frame(MASS::mvrnorm(n=dim(diamonds)[1], 
                                    mu=mean, Sigma=sigma, empirical = TRUE))

g1<-ggplot()+
  geom_density(data=diamonds_num, aes(x=price, fill="Original"), alpha=0.5)+
  theme_bw()+
  labs(fill="")
g2<-ggplot()+
  geom_density(data=df, aes(x=price, fill="Synthetic"), alpha=0.5)+
  theme_bw()+
  labs(fill="")
g3<-ggplot()+
  geom_density(data=df_emp, aes(x=price, fill="Synthetic Emp"), alpha=0.5)+
  theme_bw()+
  labs(fill="")
gridExtra::grid.arrange(g1,g2,g3,nrow=3)

No real difference between the two multivariate normal procedures, which is to be expected, with 53k rows of data. The sample size does not play an issue here. Image by author.

Okay, so the the multivariate normal distribution does offer a good start to creating a synthetic dataset, but only from a modelling perspective. Not from a descriptive perspective, and one needs to determine for themselves if that is even necessary. Part of the synthetic nature of the data is to make sure the essence is maintained, and for a modeler that means being able to get the same results when modelling.

Now, another way to create synthetic data (which means simulating correlated variables) is dive into the world of copulas which we will do now, for some part, using the Gaussian. Copulas are a great way to understand and built joing probabilities of a multivariate distribution. The word copula means ‘link’ and that is exactly what they do.

According to Wikipedia, a copula is: a multivariate cumulative distribution function for which the marginal probability distribution of each variable is uniform on the interval [0, 1]. If we would break down the steps, it would look like this (which I refraised from this great blog):

Sample correlated standardized (N[0,1]) distributions from a multivariate normal distribution.
Transform them to correlated Uniform(0, 1) distributions with the normal CDF (probability integral transform).
Transform them to any correlated probability distribution you desire with that probability distribution’s inverse CDF (inverse transform sampling).

Below is a function showing a built multivariate normal distribution function, which is equal to the mvnorm function of the MASS package. As a result, I will obtain (just like before) data from a multivariate normal, but this time standardized.


mvrnorm <- function(n = 1, mu = 0, Sigma) {
  nvars <- nrow(Sigma)
  # nvars x n matrix of Normal(0, 1)
  nmls <- matrix(rnorm(n * nvars), nrow = nvars)
  # scale and correlate Normal(0, 1), "nmls", to Normal(0, Sigma) by matrix mult
  # with lower triangular of cholesky decomp of covariance matrix
  scaled_correlated_nmls <- t(chol(Sigma)) %*% nmls
  # shift to center around mus to get goal: Normal(mu, Sigma)
  samples <- mu + scaled_correlated_nmls
  # transpose so each variable is a column, not
  # a row, to match what MASS::mvrnorm() returns
  t(samples)
}
df_new <- mvrnorm(dim(diamonds)[1], Sigma = sigma)
mean2<-rep(0,dim(sigma)[2])
names(mean2)<-colnames(sigma)
df_new2 <- MASS::mvrnorm(dim(diamonds)[1], 
                         mu=mean2, 
                         Sigma = sigma)
> cor(df_new)
              carat          depth         table          price              x              y
carat 1.00000000000  0.02548515939  0.1756632819  0.92081606682  0.97483969924  0.95123398113
depth 0.02548515939  1.00000000000 -0.2966914653 -0.01267406029 -0.02968316709 -0.03193223836
table 0.17566328191 -0.29669146532  1.0000000000  0.12037212430  0.18934326066  0.17732479479
price 0.92081606682 -0.01267406029  0.1203721243  1.00000000000  0.88323677536  0.86373468972
x     0.97483969924 -0.02968316709  0.1893432607  0.88323677536  1.00000000000  0.97460644946
y     0.95123398113 -0.03193223836  0.1773247948  0.86373468972  0.97460644946  1.00000000000
z     0.95342001221  0.09193958958  0.1450393656  0.86003163701  0.97049061902  0.95189367284
                  z
carat 0.95342001221
depth 0.09193958958
table 0.14503936558
price 0.86003163701
x     0.97049061902
y     0.95189367284
z     1.00000000000
> cor(df_new2)
              carat          depth         table          price              x              y
carat 1.00000000000  0.02401766053  0.1879023338  0.92211539384  0.97544578111  0.95144071623
depth 0.02401766053  1.00000000000 -0.3013205641 -0.01515267326 -0.02925527573 -0.03155516412
table 0.18790233377 -0.30132056412  1.0000000000  0.13488569935  0.20173295788  0.18904146957
price 0.92211539384 -0.01515267326  0.1348856993  1.00000000000  0.88524849733  0.86534901465
x     0.97544578111 -0.02925527573  0.2017329579  0.88524849733  1.00000000000  0.97451646301
y     0.95144071623 -0.03155516412  0.1890414696  0.86534901465  0.97451646301  1.00000000000
z     0.95428998681  0.09122255285  0.1570967255  0.86180784719  0.97117225153  0.95206791685
                  z
carat 0.95428998681
depth 0.09122255285
table 0.15709672554
price 0.86180784719
x     0.97117225153
y     0.95206791685
z     1.00000000000

So, our first obtained values are the standardized values coming from a multivariate normal. Which means that the variables are all on the same scale, and carry with them the original cross-correlation matrix. We can check both (standardized scale and correlation) quite easily.

pairs(df_new)
hist(df_new[,1])

Left: correlation matrix. Right: distribution. Both coming from a standardized multivariate normal. Image by author.

Next up is the transformation to a uniform distribution whilst maintaining the underlying cross-correlation matrix.

U <- pnorm(df_new, mean = 0, sd = 1)
hist(U[,1])
cor(U)
              carat          depth         table         price              x              y             z
carat 1.00000000000  0.02617867093  0.1682925760  0.9140135553  0.97247036750  0.94681926838 0.94931607318
depth 0.02617867093  1.00000000000 -0.2837151691 -0.0102210807 -0.02657673258 -0.02879604409 0.08933946454
table 0.16829257601 -0.28371516908  1.0000000000  0.1160484259  0.18057212310  0.16815896730 0.13787984049
price 0.91401355528 -0.01022108070  0.1160484259  1.0000000000  0.87409215149  0.85353166518 0.84964860456
x     0.97247036750 -0.02657673258  0.1805721231  0.8740921515  1.00000000000  0.97228684415 0.96785128673
y     0.94681926838 -0.02879604409  0.1681589673  0.8535316652  0.97228684415  1.00000000000 0.94769217826
z     0.94931607318  0.08933946454  0.1378798405  0.8496486046  0.96785128673  0.94769217826 1.00000000000

The first variable in the uniformly distributed matrix, which is carat, is now also uniformly distributed. The cross-correlation with all the other data is maintained. Image by author.

We can do the same for different variables, such as price. Below, you see me build new variables of price and carat, but this time I am sampling them from a Poisson distribution. This is the last part of the three steps which says that I can, via inverse transform sampling, create any distribution I would like from that matrix U which contains uniformly distributed data. This is quite a cool technique!

price <- qpois(U[, 4], 5)
par(mfrow = c(2, 1))
hist(price)
hist(diamonds_num$price)

carat <- qpois(U[, 1], 30)
par(mfrow = c(2, 1))
hist(carat)
hist(diamonds_num$carat)

The original distributions and the distributions I made by using a copula. Image by author.

> cor(diamonds_num$carat, diamonds_num$price)
[1] 0.9215913012
> cor(carat, price)
[1] 0.9097580747

As you can see, the correlation is partly maintained. The reason for the deviation is that a correlation on discrete data, which a Poisson is, is not the same as a correlation from continuous data, which the Gaussian is.

We can now look at the modelling part. Instead of using carat I am now chosing a more straightforward way to ensure that the sampling of the training and test set, and the cross-validation, is not getting in the way. Below, two straightfoward linear regressions.

fit1<-lm(carat~price, data=diamonds_num)
fit2<-lm(carat~price, data=data.frame(cbind(carat,price)))
fit1;fit2

Call:
lm(formula = carat ~ price, data = diamonds_num)

Coefficients:
 (Intercept)         price  
0.3672972042  0.0001095002  


Call:
lm(formula = carat ~ price, data = data.frame(cbind(carat, price)))

Coefficients:
(Intercept)        price  
  18.887067     2.224347

Clear to see is that the coefficients differ, but that should be expected, sinceI built different descriptives.

par(mfrow = c(2, 4))
plot(fit1);plot(fit2)

The model fit is of course also difference — the above matrix of four plots comes from the original data which is not pristine. The latter four plots originate from the copula from which I created a Poisson distribution. Analyzing discrete data using a linear regression assuming normal data is wrong. Image by author.

A better test, still keeping in mind that a linear regression on Poisson data is wrong, is to model interactions. And what better way then to use splines!

depth <- qpois(U[, 2], 30)
fit1<-lm(price~ns(carat,3)*ns(depth,3), data=diamonds_num)
fit2<-lm(price~ns(carat,3)*ns(depth,3), data=data.frame(cbind(carat,price, depth)))

> fit1

Call:
lm(formula = price ~ ns(carat, 3) * ns(depth, 3), data = diamonds_num)

Coefficients:
                (Intercept)                ns(carat, 3)1                ns(carat, 3)2  
                  2218.8390                   17936.5330                 -126840.8539  
              ns(carat, 3)3                ns(depth, 3)1                ns(depth, 3)2  
               -231168.4360                   -1091.1322                    1362.9843  
              ns(depth, 3)3  ns(carat, 3)1:ns(depth, 3)1  ns(carat, 3)2:ns(depth, 3)1  
                  7416.9316                    -329.7165                   68951.7911  
ns(carat, 3)3:ns(depth, 3)1  ns(carat, 3)1:ns(depth, 3)2  ns(carat, 3)2:ns(depth, 3)2  
                118173.3027                  -19264.2274                  263279.9813  
ns(carat, 3)3:ns(depth, 3)2  ns(carat, 3)1:ns(depth, 3)3  ns(carat, 3)2:ns(depth, 3)3  
                480346.8166                  -34626.0634                   14745.8991  
ns(carat, 3)3:ns(depth, 3)3  
                 73406.4320  

> fit2

Call:
lm(formula = price ~ ns(carat, 3) * ns(depth, 3), data = data.frame(cbind(carat, 
    price, depth)))

Coefficients:
                (Intercept)                ns(carat, 3)1                ns(carat, 3)2  
               -1.148528255                  8.348102184                 17.258451783  
              ns(carat, 3)3                ns(depth, 3)1                ns(depth, 3)2  
               15.909197993                 -0.242271773                 -1.327453563  
              ns(depth, 3)3  ns(carat, 3)1:ns(depth, 3)1  ns(carat, 3)2:ns(depth, 3)1  
               -0.862625027                 -0.393382870                 -0.090925281  
ns(carat, 3)3:ns(depth, 3)1  ns(carat, 3)1:ns(depth, 3)2  ns(carat, 3)2:ns(depth, 3)2  
               -0.029317928                 -0.006505344                  0.323633739  
ns(carat, 3)3:ns(depth, 3)2  ns(carat, 3)1:ns(depth, 3)3  ns(carat, 3)2:ns(depth, 3)3  
               -1.189341372                  0.188006534                 -0.109592519  
ns(carat, 3)3:ns(depth, 3)3  
               -0.984283018

Of course the coefficients are not the same since the data is not standardized, but lets look at the interaction plots. One would assumes that IF the essence of the data is maintained, the relationship between the variables would be as well.

sjPlot::plot_model(fit1, type="pred")
sjPlot::plot_model(fit2, type="pred")

Above you see the relationships between the variables from the original data, below you see the relationships coming from the synthetic data. The relationship between **carat** and **price** seems to be maintained to some degree, but **depth** and **price** certainly is not. The reason why I chose splines is because they are used often AND they are quite data-dependent. Hence, a wrong turn in the transformation from original to synthetic will most likely be picked up by a spline. Image by author.

A good way to check the validity of the transformation, and the model, is to just plot and compare original to synthetic data, because depth and price are not correlated at all. And both models show a link.

I will use ggplot and fit a spline within the graph on the original data.

ggplot(diamonds_num, 
       aes(x=depth, y=price))+
  geom_point()+
  geom_smooth()+
  theme_bw()

ggplot(diamonds_num, 
       aes(x=carat, y=price))+
  geom_point()+
  geom_smooth()+
  theme_bw()

Clearly shows the problem of fitting splines on data. As you can see, **price** and **carat** share no correlation on this bi-dimensional level but the spline does tend to dance a bit in the middle. To the right we see **carat** and **price** being displayed on top of which a spline first draws a clear relationship. It then takes a heavy curve hunting down whatever points it can find. The greatness AND danger of splines displayed within two plots. Image by author.

The above plots are the original data. Lets also observe what happens if I plot on the synthetic data, which now has completely different distributional attributes than the original (discrete instead of the original continuous scale).

g1<-ggplot(diamonds_num, 
       aes(x=carat, y=price))+
  geom_point()+
  geom_smooth()+
  theme_bw()
g2<-ggplot(data.frame(cbind(carat,price, depth)), 
       aes(x=carat, y=price))+
  geom_point()+
  geom_smooth()+
  theme_bw()
gridExtra::grid.arrange(g1,g2,nrow=1)


g1<-ggplot(diamonds_num, 
           aes(x=carat, y=depth))+
  geom_point()+
  geom_smooth()+
  theme_bw()
g2<-ggplot(data.frame(cbind(carat,price, depth)), 
           aes(x=carat, y=depth))+
  geom_point()+
  geom_smooth()+
  theme_bw()
gridExtra::grid.arrange(g1,g2,nrow=1)

Here we see plots looking at the same relationship twice. We have **carat** and **price**, and we have **carat** and **depth**. Both seem to adhere to the original relationship, even when coming from a different distribution, but it is NOT perfect. Image by author.

Looking at the plots above, we can see that the majority of the original relationship (or absence of) is maintained. That the synthetic data will not be perfect should be no surprise. Also, it should be no surprise that the models on the original and the synthetic data show different coefficients. I have made the data so it would have different descriptive characteristics, even coming from a different type of distribution, but still being able to maintain it essense.

This blog post was just a short introduction to one way of building synthetic data, and just on numerical values. Using copulas, we can build many different types of synthetic data. Also, we have not ventured into deep learning, such as GANs, which are mostly used for building synthetic data. This example shows that we do not have to go that far.

If anything is amiss, please let me know!

Creating Synthetic data

Using models to feed data centricity

Written by Dr. Marc Jacobs