A novel method for training generative machine learning models

You can reproduce the experiments in this article by cloning https://github.com/narroyo1/pmt.
The previous article in this series named Approximating stochastic functions introduced a novel method to train generative machine learning models capable of approximating any stochastic function with a single output variable. From this point on I will refer to this method as Pin Movement Training or PMT for short. This because of the analogy of placing pins on fabric and moving them that is used to illustrate it.
The method was described for functions with any number of inputs X but with only a single output Y. The present article will generalize PMT for functions with any number of outputs. A summary of the method will be provided and should be enough to understand how it works, but if you would like a more in depth description you can read the previous article.
The generalized method, for reasons you will learn below, utilizes an architecture similar to that of autoencoders. Because of this and because the uniform sampling distribution may be more convenient for many applications, I believe this method is a valid alternative to Variational Autoencoders.
Refresher of the original method
Let’s say that we want to use the a neural network to approximate a stochastic function defined as 𝑓(𝑥) → 𝑌 where x is an input of any number of dimensions in X a_nd Y is_ a one dimensional random variable.
The first thing we will do is to introduce a secondary input Z that we define as a uniformly distributed random variable in a range [Zₘᵢₙ, Zₘₐₓ]. This is necessary in order to introduce randomness to an otherwise deterministic system. This gives us a neural network defined by 𝑓𝜃(𝑥,𝑧∼𝑍) → 𝑌 where 𝜃 represents the network trained weights.
Now let’s visualize any given point 𝑥′, 𝑠.𝑡. x′ ∈ _X. For this x’ we_ want to map the whole range [Zₘᵢₙ, Zₘₐₓ] to Yₓ′. That is f(x_′, Zₘᵢₙ) should be as similar as possible to min(Yₓ′) and f(x′, Zₘₐₓ) should be as similar as possible to max(Yₓ′). Additionally the mid-points f(x′, mid(Z)) and mid(Yₓ′) sh_ould be as similar as possible and of course the same goes for every other point in the range (see Fig. 1).

In order to achieve this let’s think of model 𝑓𝜃 as a stretchable and transparent fabric on which X is represented horizontally and Z is represented vertically. Also let’s imagine a board with all the data points in the dataset plotted in it, in this board X is represented horizontally and Y is represented vertically. We then proceed to place the fabric on top of the board.
For every data point we place a "pin" on the fabric at the vertical midpoint of Z or mid(Z). We then compare the positions of the pin and the data point. If the data point is higher than the pin then, without unpinning the pin on the fabric, we move the pin upwards a predefined distance so that it lands in a higher position on the board. The pin will stretch or shrink the fabric with this motion. If it is lower then we move the pin downwards a predefined distance. We add the distances moved upwards and downwards and call the sum total movement.
After processing every data point, if the pin was not initially in the midpoint, the total movement will be greater in the direction of the actual midpoint. After repeating the process enough times the pin will reach a position close to the midpoint where the total movement upwards and downwards is equal, that is, the number of data points above it is the same as the number of data points below it. See Fig. 2 for an animation of how this process stabilizes.

Now if instead of putting the pin on the midpoint of Z we put it in a point 1/3 of the distance in range [Zₘᵢₙ, Zₘₐₓ] from the lowest point Zₘᵢₙ. And instead of moving it the same predetermined distance upwards and downwards, we move it 1.5 times the predetermined distance when going downwards and 0.75 times the predetermined distance when going upwards. Then this pin will reach a stability point (where the total movement upwards and total movement downwards are equal) at a place roughly above 1/3 of the data points.
This is because distance upwards higher data points = distance downwards lower data points or (0.75∗2/3=1.5∗1/3=0.5). See Fig. 3 for an animation of how this process stabilizes for pins at Zₘᵢₙ + 1/3 and Zₘᵢₙ + 2/3.

How do we achieve this movement using a neural network? In order to move the "pins on the fabric" with a neural network, we select a value in Z (which we call a z-pin) and do backpropagation with the target value being the z-pin plus/minus the predetermined distance, like this:

Leveraging this principle we can select points uniformly in Z and over a number of epochs we obtain the mapping we require. i.e. 𝑓𝜃(𝑥,𝑧∼𝑍) → 𝑌.
Notes from the original article
- In the original article the fabric stretching/shrinking analogy referred to pins that were used to reshape the model, however the model definitions and training method used the term z-samples to refer to the same concrete term. In the present article and in the future these will be referred to exclusively as z-pins.
- When selecting z-pins the original article always placed them evenly distributed in Z and also used the same positions on every data point for every epoch. This is not necessary though, the only requirement is that the z-pins are uniformly distributed in Z.
- The original article would use multiple z-pins per data point. This is also not necessary and it is sufficient to select a single z-pin per data point. In the present article all the experiments will select a single z-pin per data point per epoch.
Generalizing for multiple outputs
Having revisited the original method for one output, lets move on to the changes necessary to work for multiple outputs.
Redefining the Z-space
Let’s define Z, the sampling ground from where we select our z-pins. In the original article Z was defined as a single dimensional range described simply by lower and upper bounds [Zₘᵢₙ, Zₘₐₓ]. However in the generalized method and in order to be able to handle multidimensional outputs Y, Z must be defined in multiple dimensions as well (note however that the number of dimensions in Z and Y need not be the same).
In theory it could be any bounded n-dimensional space, but because it makes calculating the scalars easier as you’ll see later, I chose to use a hyper-sphere that can be defined by an origin O, a radius R and a dimensionality N (see Fig. 4).

Now let’s define a few concepts related to Z that will be needed to move ahead.
- z-pins: These are uniformly sampled points in Z. They can be defined as an N-dimensional vector like this: zₚᵢₙ = (z₀, z₁, …, zₙ) where z₀, z₁, … are coordinates in Z.
- z-dirs: A z-dir is a direction on Z that can be defined as a unit vector based at origin O like this: z-dir = O + (ž₀, ž₁, …, žₙ)
- z-lines: A z-line is a line in Z such that it runs between any two points in Z. We will define it as a line with a z-pin origin and a z-dir including all points in it that are inside of Z like this: zₗᵢₙₑ = zₚᵢₙ + z-dir s.t.∀z ∈ zₗᵢₙₑ : z ∈ Z
The model
Moving into a multidimensional Z introduces an important challenge. In the case of one-dimensional Z and 𝑌 spaces, it was very simple to tell whether the selected z-pin projection i.e. 𝑓𝜃(x, zₚᵢₙ) was greater or smaller than the observed data point in order to decide which direction to move it to. In one dimension "greater than" in 𝑌 could simply be translated to "greater than" in Z and the z-pi_n cou_ld simply be moved up. This because we were mapping a line to another line.
But with multidimensional 𝑌 and Z **** it is not possible to assume that the spaces will have the same shape or even the same number of dimensions which means that in order to decide the direction where to move a z-pin based on its relation to a data point, it is necessary to map that data point from 𝑌 to Z. This means that in addition to train function 𝑓𝜃 to generate values in 𝑌, we’ll also need to train an inverse function 𝑓𝜃⁻¹ to map data points to Z. This fact changes our model architecture to something like this:

The left side of the model allows us to map points in 𝑌 to Z. The right side of the model allows us to generate random samples in 𝑌 by sampling points in Z.****
You may have noticed that this architecture is similar to that of plain autoencoders and indeed it is. This has the added benefit of making the method useful for learning latent representations that are bounded and evenly distributed.
The Method
Having defined all the concepts we need we can proceed to discuss how pin movement works in multiple dimensions.
Mapping data points to Z
The first step is to use the inverse function 𝑓𝜃⁻¹ (or encoder going with autoencoder terminology) and map all data points in the batch from 𝑌 space to Z. _W_e will call the original data points y-d_ata an_d the mapped data points z-data.

Selecting the z-pins
Next we must select some z-pins. In order to do so, we start by selecting uniformly sampled z-dirs, one for every data point. The easiest way to do so is by choosing random points in a hypersphere surface with the same dimensionality as Z. Then we use the selected z-dirs and translate them to have the data points mapped in the previous step z-data as origins. This gives us some z-lines as you can see in Fig. 7.

Once we have our z-lines we proceed to randomly select points in these lines, these will be our z-pins. Fig. 8 shows how this can look like.

It is necessary for the method to work that for any given z-line in Z, every mapped data point z-data in it has an equal probability of occurring, otherwise the equations on Calculating the movement scalars would not hold up.
Given a 2-dimensional Z and for any given z-line in it, lets picture it as as a line with a minimal width 𝜖 in a way that it seems like a long rectangle, similar to the z-lines in Fig. 8. The probability of any given 𝑧 existing in it is of the area of this "thin" z-_line over the area of Z.****_

Since this "thin" z-line is rectangular, any given segment 𝑠 of minimal length 𝛿 across it’s length has an equal area and therefore any given 𝑧 has an equal probability of being in the segment.

Also the probability of any given 𝑧 inside this "thin" z-line of selecting the z-dir of this "thin" z-line is constant given that the z-dirs are selected using a uniform distribution.

Taking equations (2) and (3) we get that the probability of any 𝑧 being on any segment of a given z-line and selecting the same z-dir, and that is the same for every segment which satisfies the requirement above.

The probability is independent of the position of 𝑧 in z_-line so the distribution in any z-line_ is uniform.
Calculating the target values
After selecting the z-pins we can proceed to calculate the target values (or z-targets) to use in our backpropagation. All we have to do for this is to add to every z-pin the movement constant 𝑀 in the direction where the mapped data point 𝑧-𝑑𝑎𝑡𝑎 is.

Fig. 9 Shows how the z-targets are calculated.

Calculating the movement scalars
The way that movement scalars are calculated is similar to the way it was done in the original one-dimensional method.
Let’s start by picturing a z-line along with a z-pin and some mapped data points 𝑧𝑑𝑎𝑡𝑎 like we see on Fig .10.

Let’s call a the distance from the z-pin to one end of the z-line and b the distance to the other end. And let’s call the number of data points on the former side a’ and the number of data points on latter side b’. Our purpose is to make quantity a’ proportional to distance a and b’ proportional to b i.e. 𝑎:𝑏::𝑎′:𝑏′.
Next we will call 𝛼 the scalar that we will use on the movement applied to the z-pin for all data points on the side of length a. And we will call 𝛽 the scalar that we will use on the movement applied to the z-_pin for all data points on the side of length b.****_
We will also call T the total movement, which is the sum of moving the z-pin a constant movement M towards the side of every data point multiplied by that side’s scalar.

We want T to be 0 (i.e. stabilized) when 𝑎′/(𝑎′+𝑏′)≈𝑎/(𝑎+𝑏)∧𝑏′/(𝑎′+𝑏′)≈𝑏/(𝑎+𝑏), that is when the z-pin divides the intended proportion of data points to both sides. Substituting T with 0 on _(5) give_s us the equation:

Now let’s remember that not all z-lines will have the same length, since they are bounded by the hypersphere defined by Z the ones towards the center will be longer than the ones at the edges. Longer lines will represent larger spaces in Z (see equation (1)) so their influence in the movement should be proportional to their length. We want T to be linearly proportional to the length of the z-line which gives us the equation:

If we put together (6) and (7) we get that the scalars should have these values:

Which are a similar equations to the one on the original article.
You may have noticed that this equations break towards the edges i.e. when either a or b tends to 0.
In order to solve this problem a maximum scalar constant S is introduced to clamp the scalars. Of course when clamping the scalars we have to be careful to adjust the value for both sides, for example if a is very small (and therefore 𝛼 is large) but the data point is on side b **** the scalar 𝛽 must be adjusted as well, otherwise equation (5) will not hold.
We start by selecting the largest of the 2 scalars 𝑚𝑎𝑥(𝛼,𝛽). Then we calculate an adjustment value by dividing S by _𝑚𝑎𝑥(𝛼,𝛽) and clamping it to 1.0 so that it is always a number in the range [0, 1]. We will use the adjustment value to prevent scalars from going over S. Finally, if a is 0.0, then the values of 𝛼 and 𝛽 are S and 0.0 res_pectively, and the other way around if b is 0.0. This gives us the revised equation (8b).__

Below you can see how the plots for the scalars proportional to a or b look like. Notice how they are clamped beyond the selected S.

Having calculated both scalars we can choose the one to use by determining the side on which the data point resides.

Training the model
Now that all the concepts involved are clear we can move on to describe the training algorithm.
1. Pretraining and selecting the Z hyperparameters
The algorithm described makes the assumption that the models 𝑓𝜃⁻¹ and 𝑓𝜃 inversely match each other. This can lead to a slow start if we train these 2 models to match each other at the same time as we do pin movement. So it has been found beneficial to do a "pretrain" stage on which we only train 𝑓𝜃⁻¹ and 𝑓𝜃 to match each other. This stage is essentially an ordinary autoencoder training. After the reconstruction error has reached a reasonably low value the algorithm can proceed to the main training.
This pretrain stage has the added advantage that it makes it easier to define Z when it completes. In the section Redefining the Z-space it was mentioned that Z is defined by an origin O and a radius R. Having pretrained the model for some time, all we do is run a batch of data points through the inverse model to calculate a set Z-data.

Then we take the mean of this set and use it as the origin O.

We can also use the mean distance in Z-data to O as R, however it has been seen that experimenting with and tuning this value may give better results.

This works because after the "pretrain" stage, the model has found a region capable of representing the data, so defining Z in its vicinity will likely have a low reconstruction error.
2. Pin Movement
To start pin movement we select a batch of data y-data = {y-data₀, y-data₁, …, y-dataₙ} from the training dataset and map it to z-data = {z-data₀, z-data₁, …, z-dataₙ} like it is explained in Mapping data points to Z.
The next step is to randomly select the z-pins set {z-pin₀, z-pin₁, …, z-pinₙ} (one for every data point in the batch) in the way described on section Selecting the Z-pins.
Note that it is possible to select multiple z-pins per data point. But it is not necessary and for simplicity we will use only one on the experiments.
Then we calculate the target values z-targets = {z-target₀, z-target₁, …, z-targetₙ} and scalars s = {s₀, s₁, …, sₙ} as explained on sections Calculating the target values and Calculating the movement scalars.
Having the z-targets, we calculate the current model predictions by running them through 𝑓𝜃, this gives us:

Now we have everything for the first component of our loss function:

Notice that we are using a Weighted Mean Absolute Error (WMAE) function instead of a Weighted Mean Squared Error (WMSE). This is because the latter is designed to punish larger differences while we are moving all of our pins the same distance.
3. Reconstruction Loss
The next component to our loss function is the difference between our model 𝑓𝜃 and our inverse model 𝑓𝜃⁻¹. This is very similar to the Reconstruction Loss in both variational and ordinary autoencoders. It is necessary to pass the batch data points to 𝑓𝜃⁻¹, take the results and pass them on to 𝑓𝜃 and then run backpropagation using the results and the original data points.

4. Inverse Reconstruction Loss
Before we define the last component of the loss function, let’s explain why it is necessary. Ideally at the end of the training both 𝑓𝜃 and 𝑓𝜃⁻¹ will be bijective, meaning that there will be an exactly one-to-one correspondence between the domain and codomain Z and Y. H_owever during training this is not guaranteed to be the case and it is possible that areas in Z will not be mapped to Y.****_

As you can see in Fig. 12 as a result of training with component loss-y, 𝑓𝜃 and 𝑓𝜃⁻¹ agree with each other as far as Y go_e_s. That is ∀y ∈ Y, 𝑓𝜃⁻¹(𝑓𝜃(𝑦)) ≈ y. However not all of Z is used **** and some points in Y map ou_tside of it. This is a problem because the assumption of moving z-pins to a position that will map to a point in Y that bo_th 𝑓𝜃 and 𝑓𝜃⁻¹ will agree on is broken.
Fig. 12 shows 2 of the problems that may happen. A "fold" in the fabric happens when 2 or more points in Z map to the same point in Y. An "out-of-bounds" happens when a point in Z maps to a point outside of Y.
In order to solve this problem we add third component to the loss function:

What this does is to synchronize 𝑓𝜃 and 𝑓𝜃⁻¹ with regards to Z and it does so by selecting random points in Z in_s_tead of using the training set data points.
Notice that for both the reconstruction loss and the inverse reconstruction loss we simply use Mean Squared Error (MSE).
5. Loss function
Now that we have all the components to the loss function all that’s left is to define weights for each of them which we will name 𝛾-𝑝, 𝛾-y a_n_d 𝛾-𝑧. We can put together (10), (11) and (12) to define the loss function like this:

All that’s left after this is to run backpropagation on the loss.
Testing the model
In the original paper we used goal 1 and goal 2 testing which measured the density of data points between z-pins and compared it to the densities of the test dataset. However on a multi-dimensional space doing that approach is not practical since the spaces between z-pins scale rapidly in number.
The original paper also used _Earth Mover’s Distance (EMD)_ as an indicator of the model’s performance. For multiple dimension PMT we will use EMD to measure the model’s accuracy. We will define the EMD error by comparing data points from the training dataset against data points generated by the PMT model.

And in order to have an idea of what the lowest EMD error would be we will also calculate a base EMD by comparing data points from the training dataset against data points in the test dataset.

This gives us a baseline that we can compare against E-emd to measure the accuracy of the models.
Comparison with Variational Autoencoders
The most similar generative model to PMT is Variational Autoencoders (VAE). It has an almost identical neural network architecture and acts both a generative model and a latent representation mapper. The biggest difference between the two is that the source distribution in VAE is unbounded (Gaussian) and the one in PMT is bounded (uniform distribution).
The experiments show however, that for both bounded and unbounded target distributions PMT outperforms VAE. Furthermore, the reconstruction error in PMT is significantly lower than on VAE. The reason for this may be that the components of the loss function cooperate with each other on PMT as opposed to competing with each other in VAE. Also because of the fact that the target distribution is uniform, the spacing between data points in Z can be larger.
Another difference is that PMT takes a larger number of hyperparameters, 𝑆 (maximum scalar), 𝛾-𝑝 (pin movement weight), 𝛾-𝑦 (reconstruction loss weight), 𝛾-𝑧 (inverse reconstruction loss weight) and 𝑀 (movement constant) compared to VAE hype_rpa_rameters which are just the kld weig_ht. This m_ay make PMT trai_nin_g more difficult.
Finally PMT takes longer to train per epoch than VAE, this is because it is necessary to do a pass to calculate the z-targets, and also because the loss function has an additional component (see equation (12)).
Experiments
Now I will try out the model in several datasets. In order to make them easier to plot, the experiments presented will have no X inputs.
Due to the similarities with VAE, every experiment will be done using both PMT and VAE models for comparison. In every experiment both models will have identical neural network architectures.
The source code and everything needed to reproduce the experiments below can be found in https://github.com/narroyo1/pmt.
Multiple blobs
The first dataset I’ll try is generated using _make_blobs()_ from the sklearn library. As it name suggests it generates a number of Gaussian blobs and it is a good dataset to test how PMT performs with unbounded datasets.





Fig. 13a shows the test data generated by the _makeblobs() function. Fig. 13b and Fig. 13c show animations of the PMT and VAE training methods respectively.
Fig. 13d shows the plot of the EMD errors (𝐸-𝑒𝑚𝑑) calculated for both PMT, _VA_E and the base value (𝐵-𝑒𝑚𝑑). As you can see from this PMT’s 𝐸-𝑒𝑚𝑑 are closer to 𝐵-𝑒𝑚𝑑 than those of VAE which means _tha_t its performance is better.
Fig. 13e shows the plot of the reconstruction error for both PMT and VAE. As you can see from this PMT‘s reconstruction error is an order of magnitude lower than that of of VAE.
Square with another square
The second dataset is quite simple. We just have an outer square with a uniform distribution of data points inside it, with an inner square that is also filled with a uniform distribution but with a greater density. This will help us test for non-gaussian distributions with sharp details.





Fig. 14d shows the EMD errors plot and you can see from this that PMT outperforms VAE.
Fig. 14e shows the reconstruction error values and you can see from this PMT‘s reconstruction error is over 2 orders of magnitude lower than that of VAE.
Human behavior
The next dataset is made of human body motion sensor data acquired by performing several physical activities. It was extracted from Mobile Health Human Behavior Analysis Dataset¹. This one has 3 dimensions instead of 2 like the previous datasets.





Fig. 15d shows the EMD errors plot and once again PMT outperforms VAE again.
Fig. 15e shows that PMT‘s reconstruction error is over an order of magnitude lower than that of VAE.
MNIST
Lastly we have the famous MNIST Dataset². As you know it contains bitmaps of numbers written by humans and the task here is to try to generated new data points that look like real hand drawn numbers. This is an interesting dataset because it has a large number of output dimensions (784) and a latent space of 4 dimensions.







This dataset does not have an EMD error plot since it would be very difficult to calculate (and not indicative) given the large number of output dimensions.
Fig. 16b plots the reconstruction errors, once again PMT‘s is lower than VAE‘s.
Conclusion and what’s next
Approximating stochastic functions with a single output is very useful for forecasting single value distributions, like temperatures or market values. But the ability to produce multiple outputs make the method apt for a large variety of use cases like simulation and generative tasks.
The multiple output method described in this article has proved that in the experiment datasets, it is capable of outperforming VAEs in both probabilistic likeness and reconstruction. I believe they will produce better results in a variety of real world use cases too.
In the future I would like to continue testing PMT on higher dimensional datasets for generative purposes, such as Fashion MNIST and CelebA. For this purpose it will also be necessary to experiment with deep networks and Convolutional Neural Networks (CNN).
Feel free to reach out to me with any questions or comments.
[1]: Mobile Health Human Behavior Analysis
https://www.kaggle.com/datasets/gaurav2022/mobile-health
CC0 Public Domain https://creativecommons.org/publicdomain/zero/1.0/
[2] MNIST Handwritten Database
MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges