Creating Synthetic data
Using models to feed data centricity
I have blogged about synthetic data before. Twice actually. The first time I used synthetic data from a Dutch Cancer Registry to conduct Survival Analysis. The second time I used the same dataset to apply Specification Curve Analysis.
Synthetic data will play a key role in the years to come, as models become ever more complex yet reach limitations bound by the limits of the data that is available. For some time now, the model-centric movement is slowly being replaced by a more data-centric view, and in time the models will once again take center stage.
Synthetic data is actually both worlds combined, and still in its infancy. Its data, but the way we are used to seeing and looking at data. For the majority of researchers, data comes from observations made in the field or from sensors. Synthetic data is modelled data. In essence its data disguised as other data in which sensitive components are removed, but they key components remain. If successful, a modeler should not even know that the data is synthetic. Perhaps a good analogy would be to depict the Louvre in which you stand before the Mona Lisa, but it is not the actual Mona Lisa. The actual true Mona Lisa is hidden in a basement and you are looking at a heavily guarded replica.
Synthetic data is all about honoring the essence of the original, without being the original. And from that starting point, synthetic data can even be more. It can become training material for models that will seldom occur in real life, but which can be used as simulation material to be shared with everyone. Because the data has lost its sensitivity, but not its worth.
What I am going to do in this block is take dataset in R, the diamond dataset, and create several models from it to establish synthetic data. Of course, the diamonds dataset is already open, and so it does not have sensitive information, but the essence of the exercise is to recreate the essence of the dataset without recreating the dataset.
My reason for using the diamonds dataset is that it has sufficient rows and columns without having immediate and clear-cut relationships.
Lets get started.
rm(list = ls())
options("scipen"=100, "digits"=10)
#### LIBRARIES ####
library(DataExplorer)
library(ggplot2)
library(dplyr)
library(tidyr)
library(ggExtra)
library(skimr)
library(car)
library(GGally)
library(doParallel)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
After loading the libraries, I want to immediately take a look at the data itself. If I am to model the data to identify its essence (via cross-correlations) and from that essence replicate it, it is essential that I get a solid grip on that dataset. For that I will look at the covariance and correlation matrices first.
skimr::skim(diamonds)
str(diamonds)
diamonds$price<-as.numeric(diamonds$price)
diamonds%>%
dplyr::select(where(is.numeric))%>%
cor()
diamonds%>%
dplyr::select(where(is.numeric))%>%
corrgram::corrgram()
diamonds%>%
dplyr::select(where(is.numeric))%>%
cor()%>%
ggcorrplot::ggcorrplot(., hc.order = TRUE, type = "lower",
lab = TRUE)
I will start with the numerical values first, since that is the easiest start. This does not mean that we cannot recreate synthetic data for ordinal or nomial data. However, a correlation matrix for non-continuous data requires some additional transformations which we leave out of the picture for now.
We could create even more fancy plots, but R will run a bit into a problem plotting all 53k rows. So I will first select a random sample of 1000 rows and then plot the distributions of each numerical value and their underlying correlations.
diamonds%>%
dplyr::select(where(is.numeric))%>%
sample_n(., 1000)%>%
scatterplotMatrix()
And, there is a wonderfull way to plot correlations and distributions per specific category (here cut) using the GGally library.
diamonds%>%
dplyr::select(x, y, z, carat, depth, table, price, cut)%>%
sample_n(., 1000)%>%
ggpairs(.,
mapping = ggplot2::aes(color = cut),
upper = list(continuous = wrap("density", alpha = 0.5),
combo = "box_no_facet"),
lower = list(continuous = wrap("points", alpha = 0.3),
combo = wrap("dot_no_facet", alpha = 0.4)),
title = "Diamonds")
If we are to recreate a dataset based on the original data containing only numerical values, we need to immediately think about two main issues that need to be taken care of:
- The summary statistics and distribution of each variable need to be as close as possible to the original, but does not need to be identical. Meaning that the structure of each variables needs to be maintained. If that is not the case, the descriptive part of the dataset cannot be used. The modelling part can still be maintained.
- The covariance/correlation between the datapoints needs to be the same. Meaning that the underlying relationship between the variables needs to be maintained. This is essential for the modelling part.
Extracting the covariance / correlation matrix from the data is not that difficult, but like we said, the standard formula only works for numeric values. Nonetheless, we can start part of our creation with the numerical values to get the idea going.
Here you see the covariance matrix.
diamond_cov<-diamonds%>%dplyr::select_if(., is.numeric)%>%cov()
carat depth table price
carat 0.22468665982 0.01916652822 0.1923645201 1742.76536427
depth 0.01916652822 2.05240384318 -0.9468399376 -60.85371214
table 0.19236452006 -0.94683993764 4.9929480753 1133.31806407
price 1742.76536426512 -60.85371213642 1133.3180640679 15915629.42430145
x 0.51848413024 -0.04064129579 0.4896429037 3958.02149078
y 0.51524781641 -0.04800856925 0.4689722778 3943.27081043
z 0.31891683911 0.09596797038 0.2379960448 2424.71261297
x y z
carat 0.51848413024 0.51524781641 0.31891683911
depth -0.04064129579 -0.04800856925 0.09596797038
table 0.48964290366 0.46897227781 0.23799604481
price 3958.02149078326 3943.27081043196 2424.71261297033
x 1.25834717304 1.24878933406 0.76848748285
y 1.24878933406 1.30447161384 0.76731957995
z 0.76848748285 0.76731957995 0.49801086259
From this procedure, we obtain the covariance, but we cannot create data. To do so, we also need summary statistics which can be obtained fairly easily.
diamonds%>%dplyr::select_if(., is.numeric)%>%summary()
carat depth table price
Min. :0.2000000 Min. :43.0000 Min. :43.00000 Min. : 326.00
1st Qu.:0.4000000 1st Qu.:61.0000 1st Qu.:56.00000 1st Qu.: 950.00
Median :0.7000000 Median :61.8000 Median :57.00000 Median : 2401.00
Mean :0.7979397 Mean :61.7494 Mean :57.45718 Mean : 3932.80
3rd Qu.:1.0400000 3rd Qu.:62.5000 3rd Qu.:59.00000 3rd Qu.: 5324.25
Max. :5.0100000 Max. :79.0000 Max. :95.00000 Max. :18823.00
x y z
Min. : 0.000000 Min. : 0.000000 Min. : 0.000000
1st Qu.: 4.710000 1st Qu.: 4.720000 1st Qu.: 2.910000
Median : 5.700000 Median : 5.710000 Median : 3.530000
Mean : 5.731157 Mean : 5.734526 Mean : 3.538734
3rd Qu.: 6.540000 3rd Qu.: 6.540000 3rd Qu.: 4.040000
Max. :10.740000 Max. :58.900000 Max. :31.800000
From these combinations, we should be able to recreate the data using various procedures. Perhaps the most straightforward would be to use a multivariate normal distribution which is present in the MASS package. All I need is the mean of each variable, and the correlation matrix. The multivariate normal distribution will do the rest. To obtain an equal comparison I will create as many observations as the original dataset.
sigma<-diamonds%>%dplyr::select_if(., is.numeric)%>%cor()%>%as.matrix()
mean<-diamonds%>%dplyr::select_if(., is.numeric)%>%as.matrix()%>%colMeans()
df<-as.data.frame(MASS::mvrnorm(n=dim(diamonds)[1], mu=mean, Sigma=sigma))
> dim(df)
[1] 53940 7
> head(df)
carat depth table price x y
1 -1.34032717822 61.49447797 57.19091527 3931.162855 3.545669821 3.931061364
2 -0.47751648630 61.16241371 57.86509627 3931.306295 4.906696688 5.057863929
3 2.24358594522 63.09062530 56.73718104 3932.957682 7.386140807 7.405936831
4 -0.03108967881 60.99439588 57.58369767 3931.677322 5.194041948 5.431802322
5 1.16179859890 62.39235813 57.96524508 3933.322044 6.213609464 6.592872841
6 -0.16757252197 60.84783867 56.68337288 3932.501268 4.987489939 5.118558015
z
1 1.138732182
2 2.596622246
3 5.674154202
4 3.089565271
5 4.387662667
6 2.577509666
Once created, the next task at hand is to check the validity and usability of the data applying two methods:
- Checking univariate characteristics.
- Checking multivariate characteristics.
With the ggpairs function I can do a little bit of both, and get a starting glimpse of the applicability of the process and the data it generates.
diamonds%>%dplyr::select_if(., is.numeric)%>%ggpairs()
ggpairs(df)
Now, we already said that it is our intention to build synthetic data which means building data that is in essence the same, but not necessarily in outlook. We HAVE achieved that goal. I will build a quick model on both dataset, using the caret package, to see if the synthetic data will give me the exact same results as the original dataset.
diamonds_num<-diamonds%>%dplyr::select_if(., is.numeric)
trainIndex <- caret::createDataPartition(diamonds_num$carat,
p = .6,
list = FALSE,
times = 1)
> wideTrain <- diamonds_num[ trainIndex,];dim(wideTrain)
[1] 32366 7
> wideTest <- diamonds_num[-trainIndex,];dim(wideTest)
[1] 21574 7
fitControl <- caret::trainControl(
method = "repeatedcv",
number = 20,
repeats = 20)
lmFit1 <- caret::train(carat ~ .,
data = wideTrain,
method = "lm",
trControl = fitControl,
verbose = FALSE)
> summary(lmFit1)
Call:
lm(formula = .outcome ~ ., data = dat, verbose = FALSE)
Residuals:
Min 1Q Median 3Q Max
-0.54124488 -0.03836279 -0.00665401 0.03530248 2.71983984
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.5273956761070 0.0294158001479 -85.91966 < 0.000000000000000222 ***
depth 0.0188998891767 0.0003766002370 50.18555 < 0.000000000000000222 ***
table 0.0046678986326 0.0002266381898 20.59626 < 0.000000000000000222 ***
price 0.0000330063643 0.0000002519416 131.00798 < 0.000000000000000222 ***
x 0.2956915579921 0.0032315175272 91.50238 < 0.000000000000000222 ***
y 0.0130670340617 0.0028968529147 4.51077 0.000006482 ***
z -0.0026197077889 0.0025495576103 -1.02751 0.30419
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0844222 on 32359 degrees of freedom
Multiple R-squared: 0.9684143, Adjusted R-squared: 0.9684085
F-statistic: 165354 on 6 and 32359 DF, p-value: < 0.00000000000000022204
Above the summary of a simple linear regression based on the original data. Below the summary of a simple linear regression based on synthetic data. Yes, the procedure is not exactly the same, and one can expect some variation based on the selection of train and test data, and the repeated cross-validation, but the essence of the procedure if to see if the essence of the data is kept.
trainIndex <- caret::createDataPartition(df$carat,
p = .6,
list = FALSE,
times = 1)
> wideTrain <- df[ trainIndex,];dim(wideTrain)
[1] 32364 7
> wideTest <- df[-trainIndex,];dim(wideTest)
[1] 21576 7
fitControl <- caret::trainControl(
method = "repeatedcv",
number = 20,
repeats = 20)
lmFit2 <- caret::train(carat ~ .,
data = wideTrain,
method = "lm",
trControl = fitControl,
verbose = FALSE)
summary(lmFit2)
Call:
lm(formula = .outcome ~ ., data = dat, verbose = FALSE)
Residuals:
Min 1Q Median 3Q Max
-0.68929576 -0.11680934 0.00078743 0.11741046 0.74064948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1079.584562077 8.210331042 -131.49099 < 0.000000000000000222 ***
depth 0.052707918 0.001155147 45.62874 < 0.000000000000000222 ***
table 0.019422785 0.001035408 18.75859 < 0.000000000000000222 ***
price 0.272537301 0.002088731 130.47987 < 0.000000000000000222 ***
x 0.708827904 0.006118409 115.85167 < 0.000000000000000222 ***
y 0.015186425 0.004344156 3.49583 0.00047322 ***
z 0.007592425 0.004694642 1.61725 0.10583335
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1736463 on 32357 degrees of freedom
Multiple R-squared: 0.9697885, Adjusted R-squared: 0.9697829
F-statistic: 173109.8 on 6 and 32357 DF, p-value: < 0.00000000000000022204
What becomes immediately clear is that the size of intercept has changed, which is due to the fact that the synthetic data does have the same mean but not the same distribution. Another way to get a glimpse of the data is look at the variance importance of each predictor in a model trying to predict the carat of a diamond.
> lm1Imp;lm2Imp
lm variable importance
Overall
price 131.007979
x 91.502384
depth 50.185548
table 20.596258
y 4.510769
z 1.027515
Overall
price 130.479871
x 115.851671
depth 45.628736
table 18.758588
y 3.495829
z 1.617253
It seems as if the the models do have the same variable importance, and the distance between them seems to be sort-of retained. It is not perfect, for sure, but it does not have to be.
A second way to get some idea of the usability of the synthetic data is to look at cumulative distribution plots. A cumulative density plot is another way of showing the distributional form of the variable, but one in which it becomes immediately clear if the two datasets show the same essence.
Important to realizes is that there is no single way of determining if the synthetic data has kept the same essence as the original data.
df$Model<-"Synthetic"
diamonds_num$Model<-"Original"
combined<-rbind(df, diamonds_num)
ggplot(combined, aes(carat, colour = Model)) +
stat_ecdf()+
theme_bw()
The next step is to plot this graph for each numerical variable in the dataset.
combined_long<-combined%>%tidyr::pivot_longer(!Model,
names_to = "Variable",
values_to = "Value",
values_drop_na = TRUE)
ggplot(combined_long, aes(Value, colour = Model)) +
stat_ecdf()+
theme_bw()+
facet_wrap(~Variable, scales="free")
Lets dig deeper into price and compare using densities. I will first plot price, which is deviating severely, and then depth which should almost the same density structure.
g1<-ggplot()+
geom_density(data=diamonds_num, aes(x=price, fill="Original"), alpha=0.5)+
theme_bw()+
labs(fill="")
g2<-ggplot()+
geom_density(data=df, aes(x=price, fill="Synthetic"), alpha=0.5)+
theme_bw()+
labs(fill="")
gridExtra::grid.arrange(g1,g2,nrow=2)
The difference is a direct result of the function that makes up the multivariate normal, which means that each variable has a mean and a correlation with other functions. The mean, in this particular example, does not tell the entire story.
ggplot()+
geom_density(data=diamonds_num, aes(x=depth, fill="Original"), alpha=0.5)+
geom_density(data=df, aes(x=depth, fill="Synthetic"), alpha=0.5)+
theme_bw()+
labs(fill="")
And what if I employ the empirical form of the multivariate normal distribution of the MASS package — meaning that the sample size gets a bigger role to play.
df_emp<-as.data.frame(MASS::mvrnorm(n=dim(diamonds)[1],
mu=mean, Sigma=sigma, empirical = TRUE))
g1<-ggplot()+
geom_density(data=diamonds_num, aes(x=price, fill="Original"), alpha=0.5)+
theme_bw()+
labs(fill="")
g2<-ggplot()+
geom_density(data=df, aes(x=price, fill="Synthetic"), alpha=0.5)+
theme_bw()+
labs(fill="")
g3<-ggplot()+
geom_density(data=df_emp, aes(x=price, fill="Synthetic Emp"), alpha=0.5)+
theme_bw()+
labs(fill="")
gridExtra::grid.arrange(g1,g2,g3,nrow=3)
Okay, so the the multivariate normal distribution does offer a good start to creating a synthetic dataset, but only from a modelling perspective. Not from a descriptive perspective, and one needs to determine for themselves if that is even necessary. Part of the synthetic nature of the data is to make sure the essence is maintained, and for a modeler that means being able to get the same results when modelling.
Now, another way to create synthetic data (which means simulating correlated variables) is dive into the world of copulas which we will do now, for some part, using the Gaussian. Copulas are a great way to understand and built joing probabilities of a multivariate distribution. The word copula means ‘link’ and that is exactly what they do.
According to Wikipedia, a copula is: a multivariate cumulative distribution function for which the marginal probability distribution of each variable is uniform on the interval [0, 1]. If we would break down the steps, it would look like this (which I refraised from this great blog):
- Sample correlated standardized (N[0,1]) distributions from a multivariate normal distribution.
- Transform them to correlated Uniform(0, 1) distributions with the normal CDF (probability integral transform).
- Transform them to any correlated probability distribution you desire with that probability distribution’s inverse CDF (inverse transform sampling).
Below is a function showing a built multivariate normal distribution function, which is equal to the mvnorm function of the MASS package. As a result, I will obtain (just like before) data from a multivariate normal, but this time standardized.
mvrnorm <- function(n = 1, mu = 0, Sigma) {
nvars <- nrow(Sigma)
# nvars x n matrix of Normal(0, 1)
nmls <- matrix(rnorm(n * nvars), nrow = nvars)
# scale and correlate Normal(0, 1), "nmls", to Normal(0, Sigma) by matrix mult
# with lower triangular of cholesky decomp of covariance matrix
scaled_correlated_nmls <- t(chol(Sigma)) %*% nmls
# shift to center around mus to get goal: Normal(mu, Sigma)
samples <- mu + scaled_correlated_nmls
# transpose so each variable is a column, not
# a row, to match what MASS::mvrnorm() returns
t(samples)
}
df_new <- mvrnorm(dim(diamonds)[1], Sigma = sigma)
mean2<-rep(0,dim(sigma)[2])
names(mean2)<-colnames(sigma)
df_new2 <- MASS::mvrnorm(dim(diamonds)[1],
mu=mean2,
Sigma = sigma)
> cor(df_new)
carat depth table price x y
carat 1.00000000000 0.02548515939 0.1756632819 0.92081606682 0.97483969924 0.95123398113
depth 0.02548515939 1.00000000000 -0.2966914653 -0.01267406029 -0.02968316709 -0.03193223836
table 0.17566328191 -0.29669146532 1.0000000000 0.12037212430 0.18934326066 0.17732479479
price 0.92081606682 -0.01267406029 0.1203721243 1.00000000000 0.88323677536 0.86373468972
x 0.97483969924 -0.02968316709 0.1893432607 0.88323677536 1.00000000000 0.97460644946
y 0.95123398113 -0.03193223836 0.1773247948 0.86373468972 0.97460644946 1.00000000000
z 0.95342001221 0.09193958958 0.1450393656 0.86003163701 0.97049061902 0.95189367284
z
carat 0.95342001221
depth 0.09193958958
table 0.14503936558
price 0.86003163701
x 0.97049061902
y 0.95189367284
z 1.00000000000
> cor(df_new2)
carat depth table price x y
carat 1.00000000000 0.02401766053 0.1879023338 0.92211539384 0.97544578111 0.95144071623
depth 0.02401766053 1.00000000000 -0.3013205641 -0.01515267326 -0.02925527573 -0.03155516412
table 0.18790233377 -0.30132056412 1.0000000000 0.13488569935 0.20173295788 0.18904146957
price 0.92211539384 -0.01515267326 0.1348856993 1.00000000000 0.88524849733 0.86534901465
x 0.97544578111 -0.02925527573 0.2017329579 0.88524849733 1.00000000000 0.97451646301
y 0.95144071623 -0.03155516412 0.1890414696 0.86534901465 0.97451646301 1.00000000000
z 0.95428998681 0.09122255285 0.1570967255 0.86180784719 0.97117225153 0.95206791685
z
carat 0.95428998681
depth 0.09122255285
table 0.15709672554
price 0.86180784719
x 0.97117225153
y 0.95206791685
z 1.00000000000
So, our first obtained values are the standardized values coming from a multivariate normal. Which means that the variables are all on the same scale, and carry with them the original cross-correlation matrix. We can check both (standardized scale and correlation) quite easily.
pairs(df_new)
hist(df_new[,1])
Next up is the transformation to a uniform distribution whilst maintaining the underlying cross-correlation matrix.
U <- pnorm(df_new, mean = 0, sd = 1)
hist(U[,1])
cor(U)
carat depth table price x y z
carat 1.00000000000 0.02617867093 0.1682925760 0.9140135553 0.97247036750 0.94681926838 0.94931607318
depth 0.02617867093 1.00000000000 -0.2837151691 -0.0102210807 -0.02657673258 -0.02879604409 0.08933946454
table 0.16829257601 -0.28371516908 1.0000000000 0.1160484259 0.18057212310 0.16815896730 0.13787984049
price 0.91401355528 -0.01022108070 0.1160484259 1.0000000000 0.87409215149 0.85353166518 0.84964860456
x 0.97247036750 -0.02657673258 0.1805721231 0.8740921515 1.00000000000 0.97228684415 0.96785128673
y 0.94681926838 -0.02879604409 0.1681589673 0.8535316652 0.97228684415 1.00000000000 0.94769217826
z 0.94931607318 0.08933946454 0.1378798405 0.8496486046 0.96785128673 0.94769217826 1.00000000000
We can do the same for different variables, such as price. Below, you see me build new variables of price and carat, but this time I am sampling them from a Poisson distribution. This is the last part of the three steps which says that I can, via inverse transform sampling, create any distribution I would like from that matrix U which contains uniformly distributed data. This is quite a cool technique!
price <- qpois(U[, 4], 5)
par(mfrow = c(2, 1))
hist(price)
hist(diamonds_num$price)
carat <- qpois(U[, 1], 30)
par(mfrow = c(2, 1))
hist(carat)
hist(diamonds_num$carat)
> cor(diamonds_num$carat, diamonds_num$price)
[1] 0.9215913012
> cor(carat, price)
[1] 0.9097580747
As you can see, the correlation is partly maintained. The reason for the deviation is that a correlation on discrete data, which a Poisson is, is not the same as a correlation from continuous data, which the Gaussian is.
We can now look at the modelling part. Instead of using carat I am now chosing a more straightforward way to ensure that the sampling of the training and test set, and the cross-validation, is not getting in the way. Below, two straightfoward linear regressions.
fit1<-lm(carat~price, data=diamonds_num)
fit2<-lm(carat~price, data=data.frame(cbind(carat,price)))
fit1;fit2
Call:
lm(formula = carat ~ price, data = diamonds_num)
Coefficients:
(Intercept) price
0.3672972042 0.0001095002
Call:
lm(formula = carat ~ price, data = data.frame(cbind(carat, price)))
Coefficients:
(Intercept) price
18.887067 2.224347
Clear to see is that the coefficients differ, but that should be expected, sinceI built different descriptives.
par(mfrow = c(2, 4))
plot(fit1);plot(fit2)
A better test, still keeping in mind that a linear regression on Poisson data is wrong, is to model interactions. And what better way then to use splines!
depth <- qpois(U[, 2], 30)
fit1<-lm(price~ns(carat,3)*ns(depth,3), data=diamonds_num)
fit2<-lm(price~ns(carat,3)*ns(depth,3), data=data.frame(cbind(carat,price, depth)))
> fit1
Call:
lm(formula = price ~ ns(carat, 3) * ns(depth, 3), data = diamonds_num)
Coefficients:
(Intercept) ns(carat, 3)1 ns(carat, 3)2
2218.8390 17936.5330 -126840.8539
ns(carat, 3)3 ns(depth, 3)1 ns(depth, 3)2
-231168.4360 -1091.1322 1362.9843
ns(depth, 3)3 ns(carat, 3)1:ns(depth, 3)1 ns(carat, 3)2:ns(depth, 3)1
7416.9316 -329.7165 68951.7911
ns(carat, 3)3:ns(depth, 3)1 ns(carat, 3)1:ns(depth, 3)2 ns(carat, 3)2:ns(depth, 3)2
118173.3027 -19264.2274 263279.9813
ns(carat, 3)3:ns(depth, 3)2 ns(carat, 3)1:ns(depth, 3)3 ns(carat, 3)2:ns(depth, 3)3
480346.8166 -34626.0634 14745.8991
ns(carat, 3)3:ns(depth, 3)3
73406.4320
> fit2
Call:
lm(formula = price ~ ns(carat, 3) * ns(depth, 3), data = data.frame(cbind(carat,
price, depth)))
Coefficients:
(Intercept) ns(carat, 3)1 ns(carat, 3)2
-1.148528255 8.348102184 17.258451783
ns(carat, 3)3 ns(depth, 3)1 ns(depth, 3)2
15.909197993 -0.242271773 -1.327453563
ns(depth, 3)3 ns(carat, 3)1:ns(depth, 3)1 ns(carat, 3)2:ns(depth, 3)1
-0.862625027 -0.393382870 -0.090925281
ns(carat, 3)3:ns(depth, 3)1 ns(carat, 3)1:ns(depth, 3)2 ns(carat, 3)2:ns(depth, 3)2
-0.029317928 -0.006505344 0.323633739
ns(carat, 3)3:ns(depth, 3)2 ns(carat, 3)1:ns(depth, 3)3 ns(carat, 3)2:ns(depth, 3)3
-1.189341372 0.188006534 -0.109592519
ns(carat, 3)3:ns(depth, 3)3
-0.984283018
Of course the coefficients are not the same since the data is not standardized, but lets look at the interaction plots. One would assumes that IF the essence of the data is maintained, the relationship between the variables would be as well.
sjPlot::plot_model(fit1, type="pred")
sjPlot::plot_model(fit2, type="pred")
A good way to check the validity of the transformation, and the model, is to just plot and compare original to synthetic data, because depth and price are not correlated at all. And both models show a link.
I will use ggplot and fit a spline within the graph on the original data.
ggplot(diamonds_num,
aes(x=depth, y=price))+
geom_point()+
geom_smooth()+
theme_bw()
ggplot(diamonds_num,
aes(x=carat, y=price))+
geom_point()+
geom_smooth()+
theme_bw()
The above plots are the original data. Lets also observe what happens if I plot on the synthetic data, which now has completely different distributional attributes than the original (discrete instead of the original continuous scale).
g1<-ggplot(diamonds_num,
aes(x=carat, y=price))+
geom_point()+
geom_smooth()+
theme_bw()
g2<-ggplot(data.frame(cbind(carat,price, depth)),
aes(x=carat, y=price))+
geom_point()+
geom_smooth()+
theme_bw()
gridExtra::grid.arrange(g1,g2,nrow=1)
g1<-ggplot(diamonds_num,
aes(x=carat, y=depth))+
geom_point()+
geom_smooth()+
theme_bw()
g2<-ggplot(data.frame(cbind(carat,price, depth)),
aes(x=carat, y=depth))+
geom_point()+
geom_smooth()+
theme_bw()
gridExtra::grid.arrange(g1,g2,nrow=1)
Looking at the plots above, we can see that the majority of the original relationship (or absence of) is maintained. That the synthetic data will not be perfect should be no surprise. Also, it should be no surprise that the models on the original and the synthetic data show different coefficients. I have made the data so it would have different descriptive characteristics, even coming from a different type of distribution, but still being able to maintain it essense.
This blog post was just a short introduction to one way of building synthetic data, and just on numerical values. Using copulas, we can build many different types of synthetic data. Also, we have not ventured into deep learning, such as GANs, which are mostly used for building synthetic data. This example shows that we do not have to go that far.
If anything is amiss, please let me know!