In this article we present a use case of synthetic data generation in a very common problem for digital marketers and brand strategist. Because campaigns are being launched in fast-changing contexts, we cannot expect to collect detailed data from them. We show how the use of synthetic generated data to generate large datasets can increase the value of low sample size collected data.
The challenges of digital marketing
Marketers and brand strategists are simply spoiled for choice as digital marketing continues to innovate, with new trends emerging every few months. Reaching the right audience at the right time with relevant content and at the right cost remains a huge challenge for brands. To answer the question, "Where should I invest my next dollar?", data is the brand’s best friend. But not all brands or campaigns have lots of data. Digital marketing is an empirical discipline.
Marketers must constantly adapt to new and engaging content creation, privacy issues, rapid-changing omnichannel strategies, and so on. They collect a lot of data, but it’s not always statistically relevant.
Eventually, they will have small or medium-sized datasets to study for a specific problem. It is difficult to forecast campaign success when we only have data for one year, because this approach did not exist two years ago. In this context, synthetic data can be a new tool for extracting value from existing data and addressing new challenges.
Tabular synthetic data: the forgotten data
One of today’s most exciting technologies is synthetic data. Everyone seems to be talking about DALL-E or other image-or text-generating tools. Tabular data has not made the same strides as image or text data. Tabular synthetic data is only relevant to a few data science and machine learning experts. It is also difficult to communicate the benefits of its use to non-experts. We are all familiar with making a photo look better. Image processing technologies help us improve the resolution of an image to better identify some details that are not clear in the original image. This improved image can be considered as a "synthetic dataset" generated from the original image. Can we, however, improve a tabular dataset? Can we increase its "resolution" to see more details? The answer is yes, it’s possible, but it is not as simple to grasp as we can see with images.
Generative Adversarial Networks
Generative Adversarial Networks, or GANs, are one of the most innovative technologies in machine learning. Ian Goodfellow pioneered GANs in 2014 [1]. The idea was to make two distinct neural networks and pit them against each other. The first neural network (the generator) generates new data that is statistically similar to the input data. The second neural network (discriminator) is tasked with identifying which data is artificially created and which not. We can imagine a game where the first network tries to trick the second, and the second has to guess what the first is doing. This game makes this combination very powerful. For more information about these networks, you can read this article where we introduce the Python open source library we are going to use in this work.
The S-Curve in digital marketing
We can see from Gossen’s Law of diminishing marginal value [2] that overspending can be a threat to a marketing strategy. This idea is also known as the diminishing marginal utility theory, and in marketing, the advertising S-curve or diminishing curve[3] . The diminishing curve simulates the relationship between advertising spend and sales or market share and postulates that, after a certain point, advertising spending does not result in an increase in sales (or in revenues or in market share). This relationship has an ‘S-shape’: it is neither linear nor symmetrical and has a saturation point.
Marketers and brand strategists are fully aware of the diminishing curve. After a certain point, no amount of advertising effort increases revenue. We need a consistent amount of data to draw this S-shape curve; otherwise, we will see a linear relationship or, in the worst scenario, no relationship at all. This is why it is difficult for most marketers to estimate the saturation point correctly . Because some brands and agencies are running numerous campaigns at the same time, an imprecise calculation of this point implies wasting a significant amount of money. Additionally, these curves are very interesting quantitative tools to compare between different campaigns and model future strategies.
A practical case
We have contacted a novel brand that was launched two years ago. Since then, they have been investing in different campaigns (Google, Facebook, Linkedin, etc.). They have collected a limited amount of data and asked us to give it some sense in order to improve their advertising strategy. As an overall result of the most costly campaign, they have built the following table (figure 3).
The table includes four columns and nineteen rows (the months that they have been operating). They want to know if they’re still at saturation point in order to plan their next strategic step. In Figure 4 we can see the result of plotting the ‘total utility’, in this case the Monthly Recurring Revenue (MRR) and the ‘quantity’ (the amount spent on advertising).
We should be able to calculate a saturation point, but based on the plot, this does not seem to be viable. We have a few samples with several possibilities to draw a curve. Furthermore, the impact of stationarity is significant (influence of the month of the year in the relationship). We want to explore if a synthetic dataset can help them with this problem.
Generating synthetic data
In order to generate synthetic data, we are going to use the open source python library nbsyntehtic. We launched this library recently, and we have included new packages specifically to solve this problem. We have used a non-conditional Wasserstein Generative Adversarial Networks. According to its creators, wGANS improves the stability of learning, gets rid of problems like mode collapse, and provides meaningful learning curves useful for debugging and hyper-parameter searches [4]. It not the intention of this article to deep dive into these technology basics. Please refer to 4 for detailed information.
Results
We have generated a 2000-sample synthetic dataset from the original 19-sample table data. The code can be found here. In Figure 5 we can see the comparison of both datasets.
It is possible to assess whether synthetic data is indeed "similar" or if it can be replaced by real data when making predictions. When it comes to creating synthetic tabular data, this is not a simple question. As previously mentioned, everything is considerably simpler when working with synthetic images. Everyone agrees it is the same image when we compare the original with a synthetic image (for example, to increase resolution). But when we create a synthetic dataset, this association isn’t visual. There are several methods to check the data similarity. Even so, the notion of "similarity" is still a very complex mathematical concept. The most common methods are visual comparison, using a machine learning model, statistical tests, and topological data analysis. In our analysis, we are going to use visual comparison and the use of a machine learning model. In the Github repo, the reader can also find the comparison using statistical tests and topological data analysis. Additionally, we are going to introduce a new method based on manifold learning methods.
Visual comparison
This is the easiest and most straightforward method to check the data similarity when it is possible. When we have multidimensional data, we can compare features in two- or three-dimensional plots. So if we have, for example, twenty features, we have to make a lot of plots to check all possible feature combinations. From a practical point of view, it can be difficult to implement. In Figure 5, we can see the direct representation of our problem. Figure 6 gives us an idea of where the saturation point is. We also see that the curve strongly depends on the month of the year. Of course, this visual information can be useful for taking decisions.
Comparison with a machine learning model
To test the "interchangeability" of both datasets, we can use them in a machine learning problem. In our case, we used a Random Forest Regressor [5] to predict the MMR variable. Then we used the same algorithm to make the same prediction on a synthetic dataset. Finally, we used the synthetic data-trained algorithm to predict MMR using original data values. The results are shown in the table below.
Original data
-------------
Score without cross validation = 0.32
Scores with cross validation = [ 0.19254948 -7.0973158 0.1455913 0.18710539 -0.14113018]
Synthetic data
--------------
Score without cross validation = 0.80
Scores with cross validation = [0.8009446 0.81271862 0.79139598 0.81252436 0.83137774]
Check algorithm with original data
----------------------------------
Score with cross validation prediction = 0.71
As we can see, predicting in the original dataset results in fairly unstable accuracy, depending on how we split the data for training and testing. When we apply a cross-validation strategy for training, we observe that the results are highly dispersed but have a pretty modest prediction accuracy.
We get significantly more stable accuracy with better results when we train the system using synthetic data. As a conclusion, making the prediction in the synthetic dataset makes more sense than doing it in the original data (limited sample size) and results in an interesting accuracy. Lastly, we used the synthetic data-trained algorithm to predict the original data. We also adopt a cross-validation strategy. The results reveal that while the accuracy is slightly lower than that obtained through synthetic data training, it is clearly far more robust and attractive than that obtained from training the original data.
A different approach using Manifold learning
Manifold learning is a nonlinear dimensionality reduction technique. Many datasets are thought to have an artificially high dimensionality and that all information can be extracted from a lower dimensional manifold embedded in data space. Intuitively, it says that for every high-dimensional data space, there is an equivalent lower-dimensional one. This helps to simplify operations because it eliminates all of the challenges that arise when analyzing high-dimensional data spaces [9,10,11,12]. High-dimensional data challenges, such as the curse of dimensionality and the blessings of dimensionality, simply vanish.
Variance concentration ratios (VCR) is a rigorous and explainable metric to quantify data (Han et al. 2021)
Variance concentration ratio (VCR) is a rigorous and explainable metric to quantify data[13]. To better examine the explainability of manifold learning on high-dimensional and low-dimensional data, the variance concentration ratio (VCR) metric can be adopted. It was initially proposed by Han et al. 2021 [13] to measure high-frequency trading data and to quantify high and low-dimensional data.
First we have to recall the concept of singular value decomposition – SVD [14]. Singular value decomposition seeks to convert a rank R matrix to a rank K matrix. It means that we can approximate a list of R unique vectors as a linear combination of K unique vectors.
Singular values can be thought of as providing a "bridge" between two subsystems (two matrices). They are a measure of how much interaction exists between them. Source:Math3ma
SVD is a common technique for dimensionality reduction and manifold learning.
Given a dataset 𝑋 with n _o_bservations and 𝑝 variables, the variance concentration is defined as:
The VCR is defined as the ratio between the largest singular value of the dataset and the total sum of all singular values. It answers the question:
¿what’s the data variance percentage concentrated on the direction of the first singular value ?
An intuition of this metric could be as follows: we think there is a low-dimensional manifold embedded in data space that is equivalent to the high-dimensional space. This manifold also has several dimensions . We can consider that the first dimension is the most important one. Then, we measure how the data variation (variance) is reflected in this first dimension. When comparing original data and synthetic data, we have to first compare if the dimensions of the low-dimensional manifold are equivalent, and then if there is the same concentration of data variation in the first manifold dimension.
We can see the results obtained from our original 19-sample table data and the 2000-sample generated synthetic dataset. We include a comparison with a random generated dataset with the same number of instances as the synthetic one.
Singular values
---------------
Singular values for original dataset: [1.0, 0.114, 0.051, 0.0] Singular values for synthetic dataset: [1.0, 0.110, 0.046, 0.0] Singular values for random dataset: [1.0, 0.184, 0.027, 0.0]
Variance concentration ratio (VCR)
----------------------------------
Variance concentration ratio original data = 85.85% Variance concentration ratio synthetic data = 86.49% Variance concentration ratio random data = 82.56%
Our empirical results showed the following rules:
- The singular values of the original and synthetic datasets have to be similar.
- Variance concentration ratio VCR of the synthetic dataset has to be equal or higher than the VCR of the original data.
Conclusion
We have seen a common problem faced by many brands and agencies when preparing new campaigns. They want to base their decisions on available data, but this data frequently does not allow them to highlight relevant actionable insights.We have seen that by using synthetic data, they can value the available data. Additionally, we have introduced several methods to prove that synthetic generated data is "similar" to and "interchangeable" with real data when used in the decision-making process.
Final notes
- nbsynthetic is an open source project launched by NextBrain.ai
- We would like to thank to [Atomic 212](http://We would like to thank to Atomic 212 for its guidance in helping us understand marketing concepts and the requirements for data-driven solutions in digital marketing.) for its guidance in helping us understand marketing concepts and the requirements for data-driven solutions in digital marketing.
- Code for this article can be found here.
References
- Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., & Bengio, Y. et al. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
- Gossen, H. (1854). The Laws of Human Relations and the Rules of Human Action Derived Therefrom. (1983). Cambridge, MA: MIT Press.
- Johansson, J. K. (1979). Advertising and the S-Curve: A New Approach. Journal of Marketing Research, 16(3), 346–354. https://doi.org/10.1177/002224377901600307
- Arjovsky, M., Chintala, S. & Bottou, L.. (2017). Wasserstein Generative Adversarial Networks. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:214–223.
- Pedregosa et al. (2012). Scikit-learn: Machine Learning in Python, JMLR 12, pp. 2825–2830.
- Ilya Tolstikhin, Bharath K. Sriperumbudur, and Bernhard Schölkopf (2016). Minimax estimation of maximum mean discrepancy with radial kernels. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16). Curran Associates Inc., Red Hook, NY, USA, 1938–1946.
- Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A kernel two-sample test. J. Mach. Learn. Res. 13, null (3/1/2012), 723–773.
- Donoho, David. (2000). High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality. AMS Math Challenges Lecture. 1–32.
- Aggarwal, C. ,Hinneburg, A., Keim, D. (2002). On the Surprising Behavior of Distance Metric in High-Dimensional Space. First publ. in: Database theory, ICDT 200, 8th International Conference, London, UK, January 4–6, 2001 / Jan Van den Bussche … (eds.). Berlin: Springer, 2001, pp. 420–434
- K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. (1999). When is "nearest neighbor" meaningful? in Proc. 7th Int. Conf. Database Theory, pp. 217–235.
- Alexander Hinneburg, Charu C. Aggarwal, and Daniel A. Keim. (2000). What Is the Nearest Neighbor in High Dimensional Spaces? In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB ’00). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 506–515.
- François, D., Wertz, V., & Verleysen, M. (2007). The Concentration of Fractional Distances. IEEE Transactions on Knowledge and Data Engineering, 19, 873–886.
- Han, Henry & Teng, Jie & Xia, Junruo & Wang, Yunhan & Guo, Zihao & Li, Deqing. (2021). Predict high-frequency trading marker via manifold learning. Knowledge-Based Systems. 213. 106662. 10.1016/j.knosys.2020.106662.
- Trefethen, L. N., & Bau III, D. (1997). Numerical linear algebra (Vol. 50). Siam.