Hands-On Tutorial; Shifting Perspective in AI

Revisiting MNIST with Fresh Eyes

A spatial perspective to MNIST data, rather than the usual vector-treatment

Raveena Jay
Towards Data Science
8 min readMar 21, 2022

--

Photo by nine koepfer on Unsplash

By 2022, the MNIST dataset has become the “Hello-World” dataset of people entering AI. Hell, it’s probably been that for a few years already.

In the usual AI treatment of MNIST––whether loading the data into a neural network, logistic regression, random forest, insert-AI-algorithm-here, each handwritten digit image, with its pixel values, is treated as a vector (so the square image is elongated into a vector), and the vector is fed into the algorithm. For thousands of digits, training the algorithm means feeding in thousands of these pixel-elongated vectors as a matrix into the algorithm to crunch.

Image credit: https://medium.com/dataman-in-ai/module-6-image-recognition-for-insurance-claim-handling-part-i-a338d16c9de0

However, I want you, the reader well-versed with AI, to take a different perspective to the digits — one that you almost take for granted with your own human vision since it seems quite intuitive––and our eyes are quite literally built for this perspective.

Image data has a unique trait that other types of data, such as tabular records, NLP-text data, audio data, etc. don’t have–the intuitive spatial configuration of the pixels in the image. So instead of trying to remove the spatial property of image data by coalescing the rows into a single vector, let’s take advantage of this property that our eyes & brain use intuitively!

Converting Image Pixels into Cartesian Points

The first step we need to take is to convert the image pixels into cartesian (x,y) points. We take a single image, and transform the values, so we can make a copy of it using dots in the xy-plane. So this is the original pixelated image of a 3:

Image Credit: Author. The original sample MNIST image of a “3”

Each pixel in the black-white image above has a coordinate: (0,27) is the bottom-left and (0,0) is the top-left. So it’s a bit different from the usual Cartesian xy-plane where (0,0) is in the bottom-left. First we do a transformation to convert pixel-coordinates to Cartesian coordinates, which is done in this code below:

normalize_image = np.asarray(image).astype('float32')/255.0
coordinates = list(itertools.product( range(0,28) , range(0,28)))
...
x_values = np.float16(np.asarray([fp[1] for fp in filtered_points]))
y_values = 27-np.float16(np.asarray([fp[0] for fp in filtered_points]))

Then, we add statistical noise (using a normal distribution) to each point, so that it “fills-up” the space more between the pixel-points. Each pixel is at an integer-coordinate; so converting to Cartesian only gives us integer coordinates, which isn’t a data-rich representation. The basic idea is that we go to each integer Cartesian coordinate–converted from pixels–and sample 4 new points as random noise around the selected coordinate, and plot all 5 points together. The code is this:

for i in range(0, data.shape[0]): # this loops thru each Cartesian coordinate
point = data[i,:]
mean = point
cov = np.array( [[0.0001, 0], [0, 0.0001]] )
...
test_x, test_y = np.random.multivariate_normal(mean, cov, 4).T
... # this creates the random noise
return filled_in

And the finished Cartesian, coordinate xy-plane result looks like this!

Image Credit: Author. You can see here that this image made from the Cartesian pixel-integer coordinates (normalized), added with noise, looks really similar to the MNIST black-and-white image above!

Here Comes the Machine Learning!

Okay, so you’re probably asking: why did we go through this whole process just to make another look-alike of the MNIST image with Cartesian points? What’s the big deal?

Well, the big deal is: because we now have (x,y) points as data, we can perform machine learning on this as if it were any other 2D dataset! The machine learning technique we’re going to use is called Gaussian Mixture models. Basically, it’s a fancy way of teaching the computer ways to understand the distribution of points, by using Gaussian densities (just think of it as Gaussian ovals; they literally look like ovals, as you’re about to see in a second).

The code to create this is: (“plot_gmm” is a custom visualization function, credit goes to Jake VanderPlas for the code to visualize the clusters. His function can be found here: https://jakevdp.github.io/PythonDataScienceHandbook/05.12-gaussian-mixtures.html)

selected_model = GaussianMixture(n_components=gaussian_learning( result ), random_state=1, covariance_type='full')# plot the selected model over original image data with a user-defined function. Credit to Jake VanderPlas.
plot_gmm(gmm=selected_model, X=result)
Image credit: Author

Nice colors! But what does this tell us, concretely? Well, it shows us that the AI algorithm was able to figure out the mixture pieces required to create points in the shape of that “3”.

But wait a minute: this was for only one specific “3” in the dataset. Wouldn’t we have to do this visualization for every single “3” in the MNIST set? Well, yes, but instead of doing a visualization, let’s focus on a different property gleaned from this set-up: the number of mixture components used in the mixture machine-learning process.

Aggregating Interesting Properties Across the Data

Let’s look at the image above again. We can see that the ML algorithm used 6 mixture components to derive the distribution of points that make a “3”. Is this common across all “3”s in the set? And why 6 components? To our human eyes, it just looks like “3” is composed of two simple curves joined at the center.

To answer the second question–why 6 components– in the visualization above, we see a limitation to using the Gaussian cluster: each component is an an ellipse with imaginary perpendicular axes, kind of like this:

Image Credit: https://en.wikipedia.org/wiki/Ellipse

Why is this important? Well, an ellipse with these perpendicular axes can certainly cover a “cloud” of points vaguely distributed like ovals. But this puts a restriction: in the above image of data-points shaped like a 3, the main sections of the points which comprise the “curves of the 3” look like convex curved “jellybeans”–and while an ellipse is also convex, it can’t be forcibly curved or bent like a jellybean without changing the equation. And changing the equation, means not using Gaussian mixture learning.

So this basically tells us that Gaussian mixture modeling, in the context of MNIST from this 2D, spatial configuration–can really only model “straight pieces” of the graph. It’s almost as if you were tasked with taking the picture of a 3, and breaking it down into polygon straight sections, and then drawing ovals around those straight lines.

Image Credit: Author. Breaking a “3” piecewise into linear pieces, then envelop each straight line with a oval. Looks similar to the Gaussian colored clustering above, doesn’t it?

So imagine that all the data points from the visualization above that composed the 3, each of them belonged to one of these ovals. As you can see, the straight lines here are the major-axis diameter of the ellipse, and depending on the sizes of the ellipse and the data-point configuration, there might be more or fewer ellipses. But Gaussian mixture learning is attempting to dissect the image into these clusters, which explains why some “3”s are clustered with 6 components.

But obviously in the illustration I’ve drawn more than 6 ellipse components, so which diagrams require more or fewer than 6? To do that, we need to visualize a histogram of the # of components. I won’t show all the code here, but for this example we’ll be taking a random sample of 300 images out of 6,000 (so sampling 5%) of a “3” from the MNIST dataset, feeding them through the Gaussian Learning process, and visualizing the results:

Image Credit: Author

According to this chart–and extrapolating to the whole MNIST dataset based on the sampling normality assumption — between 37% and 43% of “3”s require six components — six piecewise clusters to create the image of a 3. Meanwhile, between 24% and 30% of “3”s would require five components, and between 15% and 20% of “3”s would require seven components.

So…what can we glean from this information? Let’s look back at the visual illustration of the dots. If the “3” is drawn quite small, compact, then it wouldn’t require as many piecewise straight lines to compose the image — and hence, five or six Gaussian components should be enough. But if the “3” is drawn much larger, or say with a loop in the middle which sometimes happens, then the drawing would require more components to cover. According to the statistics of the histogram, the average # of components is around 6, and the standard deviation of the sample is 1-component, so it’s not hard to see an example where seven components are necessary. In fact, here’s an example:

Image Credit: Author. You can count that there are 7 components here, because the person drew an extra curl on the left-bottom end of the 3, and an emphasized bend in the middle.

Final Thoughts

This isn’t, by any means, necessarily a path to a new machine-learning algorithm for MNIST images. The MNIST image has been trained like an old dog for ages on AI algorithms. Rather, what I want to present to the reader is an alternative, spatial-inspired perspective of applying machine learning techniques to handwritten digits.

This Gaussian Mixture Learning process won’t necessarily gain better results over a neural network; but the goal here isn’t results, it’s simply a change in perspective. Taking advantage of the spatial positioning of pixels, and translation into data points, allows us to use the visual geometry of Gaussian clustering to look at, what basically is– a geometrical result–a handwritten digit. It also allows us to use concepts even from high school and college geometry!–the idea of piecewise connecting dots on a curve using straight lines, treating it almost as sides of a polygon. This concept isn’t new; in fact, piecewise connections are used in calculus to find the “arc-length” of curves! But here, intuitively reasoning about piecewise lines geometrically, actually gives us good reasoning behind the internal structure of the mixture model–and explains why some handwritten drawings of the same digit use more pieces (components) than others.

Thanks so much for reading, and stay tuned for my next article! In the meantime, feel free to check out some of my other past articles on my thoughts of AI!

--

--

I recently earned my B.A. in Mathematics and I'm interested in AI's social impact & creating human-like AI/ML systems. @raveena-jay.