The world’s leading publication for data science, AI, and ML professionals.

What I learned about human learning from machine learning

The principles in machine learning that enabled me to become a better learner. From becoming a real estate investor to learning Indonesian.

.

Photo by Gerd Altmann from Pixabay
Photo by Gerd Altmann from Pixabay

(External links to books are affiliate links, thank you for your support!)

Introduction

Anyone who has dabbled in multiple domains notices how principles and meta-narratives in one area carry over to others with surprising frequency. For example, Satya Nadella (Microsoft’s CEO) often compared strategies from the sport of cricket to business strategies in his autoethnography about turning Microsoft around [1]. In Conscious Business, Fred Kaufman cross-applies a number of Bhuddist principles into corporate leadership. Steve Covey does something similar in Seven Habits of Highly Effective People with occasional reflections to his Mormon faith. In Fooled By Randomness, Nassim Taleb draws parallels between principles of randomness he observed in financial markets to other aspects of life, including business.

This disquisition is of that genre. I’ve been an autodidact most of my life. But after I started working in the field of machine Learning, I came to understand some amusing but useful parallels between human learning and machine learning. Now granted, despite all the hype about machine learning (better known by the moniker "artificial intelligence" in popular media), machines don’t learn like humans do. Most of the impressive feats of "artificial intelligence" that made headlines recently are, when one pulls back the curtain Wizard of Oz style, little more than statistical parlor tricks in controlled environments – a far cry from anything we would consider "intelligent." No, machines cannot "think" or "learn" the way humans do. Despite being an A.I. practitioner, I remain an A.I. skeptic.

Nonetheless, some of the principles I learned from machine learning aided me in quests quite different from that field and I’d like to share a ternion of those here. This essay is targeted to a broad audience but sprinkled with some technical expositions. Non-technical readers are encouraged to rapidly scroll through those parts; the overall message will hopefully remain intact.

1. Recurrent Neural Networks and Real Estate Investing

At the beginning of 2018 I decided that I would become a real estate investor. As a total neophyte, I knew that ignorantly forking out large sums of cash made for a precarious venture. For this to succeed, I needed to intelligently fork out large sums of cash, as opposed to ignorantly.

A friend of mine inquired "how do you know, that you know, enough?" Reading one book on a topic surely isn’t sufficient – without any contextual knowledge, I couldn’t judge if the book was instructional or oversimplified. The same could be said about reading just two books. But what about three? How many books need to be read?

That was a reasonable question. There aren’t any good ways for small time real estate investors to objectively self-measure their competency. And given that humans are prone to overestimate their competence (e.g. the Illusory Superiority or Above-average effect [2]) in a subject, these innate psychological traps compound the danger.

This wasn’t the first time I had nerded out about a topic, read deeply about it, and afterwards at least appeared to gain some mastery with it. I wasn’t worried about my ability to acquire knowledge. I was worried about fooling myself into thinking I was more knowledgeable than I really was, but still unwittingly investing with dangerous blind-spots.

Thankfully, I was able to make one of those cross-domain insights that turned out to be extremely useful, and that is what I want to share in this section.

Generating fake Shakespeare

My first foray into modern Natural Language Processing (NLP) was a toy recurrent neural network that would generate Shakespeare-sounding text. It accomplished this by processing a lot of Shakespeare text (https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt) and eventually learning to mimic it. By now, this toy example is widely known enough that it appears in the Tensorflow documentation, so I won’t regurgitate the code. The interested programmer can find it here: https://www.tensorflow.org/tutorials/text/text_generation

Here is some synthetic Shakespeare from that page:

Photo by WikiImages from Pixabay
Photo by WikiImages from Pixabay

COMINIUS: I’ll carried, him in.

GREMIO: Y she had much. God as ’twere an ill. I’ll pings thee. And cet your slave, direcliful there! O, thy dear keavens and to’ no more surse!

Nurse: Happy Mangel, a pruse is again; One hooth against me; That they lives swear having but a fence, I have fish to

It’s easy to laugh at that as having old-English characteristics but being obviously fake to those familiar with the author. However, the model clearly has some "understanding" of what Shakespeare plays tend to look like. How is it able to do that? It’s using a similar technique behind autocompletion: given a string of words, what word is most likely to follow? If I give you a Shakespeare string as "I beseech ___" you can probably guess that "thee" or some other proper noun like "Horatio" or "Lucento" goes next. Being able to "fill in the blank" with a sensible word indicates some level of proficiency in the subject.

On the other hand, if we drop into a totally unfamiliar domain like hematology oncology and see a sentence like

_The postoperative phase of wound healing is prolonged in immunocompromised patients. Complete healing can often not be waited for, as the intensity of chemotherapy necessitates the cytotoxic __ …._

I haven’t got any clue what goes in the blank next. (I copied blurb from a random paper on Google scholar that I have no domain knowledge of [3]). This serves as a tangible indicator that I lack proficiency in hematology oncology.

A Machine Learning algorithm would not be able to fill in the blank convincingly without some "understanding" of the domain.

As such, I can say a machine has some level of "understanding" of a subject if it can predict what is coming next in a sequence of words – and not just single words but entire sentences and descriptions of thought. This "grasp" of text has a quantifiable measure called "perplexity" that the interested reader can investigate more on here.

That was my approach to studying real estate. Keep reading books until I could predict what the author was going to say next.

Needless to say, that was a fairly expensive and time consuming process. I read at least 20 books before I bought my first investment property. Was it worth it? Bragging about investment success is uncouth, so I’ll spare the reader. But suffice to say, I’m still enthusiastic, active, and solvent in that endeavor today.

Could I have been more efficient with this process? Frankly, I don’t think so as I will discuss in the last section of this essay. In the meantime, let’s talk about word embeddings.

2. Word Embeddings and Learning a Foreign Language

Photo by Author. Di Bali, monyet makan pisang di atas kepelaku. Yours truly in Bali with a monkey eating a banana on my head.
Photo by Author. Di Bali, monyet makan pisang di atas kepelaku. Yours truly in Bali with a monkey eating a banana on my head.

Let’s fast forward to 2020 and tell another story. A few months ago, I decided to learn Indonesian. Predictably, I downloaded Duo Lingo and started memorizing the simple phrases the app fed me.

Saya minum air (I drink water)

Selamat malam, sampai jumpa lagi (good evening, see you again!)

Kamu mau nasi dengan ayam goreng? (Do you want rice with fried chicken?)

Then it gets hard

And so forth. I felt I was learning pretty quickly until I hit a snag. At the beginning, the app effectively teaches you phrases, not individual words.

Somewhere around the 10th lesson, it decided to teach me a bunch of verbs:

Berhasil = succeed

Berdiri = stand up

Bersaign = compete

Berkunjung = visit

Bekerja = work

Bermain = play

The app would teach the word by providing a sentence like I ____ with him, except in Indonesian, so it would be _saya __ dengan dia._

But there’s hardly anything in the context to suggest that "visit", "work", or "compete" are better alternatives. This is hardly better than learning a word via flash card. And since all those words look and sound similar to my linguistically untrained eye and ear, it was a challenging way to learn.

When you learn a word in isolation, you can’t do better than rote memorization and you aren’t studying for the task you actually care about. When using a language, you generally utter phrases, not individual words. You don’t converse with your foreign interlocutor by flipping flashcards over. So it makes a lot of sense to just skip memorizing words and go straight to phrases, which Duo Lingo for Indonesian did well in the beginning. Learning phrases helps you learn words. But if the phrase is excessively generic you are essentially learning words in isolation again. Phrases give meaning to words just as much as words give meaning to phrases. [4]

In fact, that’s what linguist John Firth pointed out:

You shall know a word by the company it keeps – (John Rupert Firth)

Ah, but I didn’t learn that concept because I studied linguistics (I have not), I learned it because I had studied word embeddings from machine learning.

Word embeddings

Humans understand words by their relationship to real world objects or phenomena – as well as how those words are used in relation to other words. Machine learning, at least at this point in time, only understands words in relation to other words. Despite this handicap, computers can still do useful and interesting things with words, as anyone who has interacted with a chatbot or digital assistant knows.

To use an analogy, imagine you were born without sight. You could still come to understand that "The balloon is red" is a valid sentence and that "red is the balloon" is gibberish. Similarly, after listening to enough literature, you could associate "blue" with "ocean" despite lacking a visual concept of blue. To further illustrate, we know humans are able to understand that phrases like "notwithstanding the forgoing, provision shall be made for…" tend to appear in legal documents although we cannot sensorially experience that phrase. Words do not need a physical reification to be understood.

So how can a machine understand words? Let’s start with the end result and work backward. Consider the following image.

Photo by Author
Photo by Author

Looking at this plot makes sense – the machine seems to have some understanding of these words (code provided later). Words that appear in the same context appear closer to each other. How did the machine gain an understanding of words such that it could cluster them sensibly?

The relationship between words in machine learning is done with embeddings, the most famous algorithm being word2vec.

In word2vec, each word is represented as a point in 300 dimensional space (although 300 is somewhat arbitrarily chosen). The plot above is the 300 dimension space "projected" down to two dimensions. In isolation, a certain point in 300 dimensional space has almost no interpretation and is basically useless.

There are no 300 dimensional landmarks to put the word in context. However, if the computer learns how words relate with each other, it will be able to put the words into clusters that correspond with their mutual contexts. As the linguist said, the machine will know the word by the company it keeps. That words are circularly defined is no more of a hinderance to the machine than it is to a human.

If the machine learns the words properly, similar context words will appear close to each other as they do in the plot above. Before the machine starts learning, each word is essentially in a random location, and the machine must iteratively rearrange the words in 300 dimensional space as it reads large volumes of text. After enough rearranging, the points will be positioned relative to each other in a meaningful way, even though their absolute position remains meaningless.

The exact implementation details can be found elsewhere but roughly, the algorithm takes in a lot of text and randomly blacks out a word in a sentence. Then, the algorithm tries to guess the missing word. It keeps doing this over and over, and with each successive guess, it updates the 300-dimensional position of the hidden word. It does this for all words in the corpus. As the words move around the 300 dimensional space, eventually they converge to clusters and structures that mimic the contexts they appear in.

A machine (or human) can’t make sense out of an isolated 300-dimensional point representation of a word. It only knows its relationship to other words. Therefore, when learning a language as a human, you shouldn’t care as much about where exactly that word floats in thought-space, but how it relates to other words in that space.

Here is the code to produce the figure above:

The rest is in this link: https://github.com/jeffreyscholz/Blog-word2vec/blob/master/Blog-Word2Vec.ipynb

Word embeddings applied

In case you are wondering how I circumvented the flash-card education the app had devolved into, an Indonesian friend of mine intervened and gave me Indonesian folk songs and nursery rhymes to memorize. That music can aid memorization is a separate topic, but this method introduced words to me in the context that they keep.

Thankfully, my flatmate doesn’t understand Indonesian, so he doesn’t know I’m singing about a parrot sitting in a window or green balloons popping or other childish silliness. As a matter of fact, I’m sure I sound sophisticated to him. Sure enough, I find it much easier to recall words from memorable sentences rather than from contextless phrases like _I __ with him_.

Taking this methodology to its logical conclusion, I try to see Indonesian words in as many different contexts as possible so that I can learn words by the company that they keep – or to be very nerdy, iteratively formulate the 300-dimensional representations of Indonesian words in my mind. Practically, this means interacting with tutors, listening to music in that language, reading different didactic books on the subject, and so forth. Of course, a polyglot won’t find this conclusion remarkable – of course one should immerse oneself in a language to learn effectively. Ah, but I did not unearth phenomenon as an experienced linguaphile, I rediscovered this principle by importing its analogue from machine learning.

3. Does Learning for Humans and Machines Have Irreducible Complexity?

Photo by Wikipedia, Public Domain
Photo by Wikipedia, Public Domain

Going back to my journey to learning real estate – was there a less strenuous way to get started in the endeavor besides reading all those books? And speaking of painful learning, it would be nice to learn a language without so much repetition, mental strain, and social embarrassment.

But when we consider how machines learn, it’s equally barbaric, if not more so. Machine learning in practice consists of buying a bunch of expensive hardware, collecting a ton of data, burning a lot of electricity, and boom – the algorithm learns the pattern in the data. Sure, the impressive results make for a good headline sometimes, but having seen how the sausage is made, the results hardly seem mystical. As I mentioned in the intro, I consider a lot of "Artificial Intelligence" to be statistical parlor tricks in controlled environments.

At various points I have mused if there was some way to just feed the massive dataset into a clever algorithm and just jump to the parameters of the model that enable the model to make useful predictions.

Technical discussion about computational complexity

For example, in linear regression the parameters needed to "understand" a dataset can often be solved "in one go" with the following formula: [5]

Here, X is your dataset where each row is a datapoint, and y is the label for that entry. Inverting a matrix is reasonably quick, it can be done in O(n^2.376) time. [6] However, the "impressive" neural networks (GPT3, Megatron, etc) that can generate impressive essays were trained on datasets that are about 40 Gigabytes large, so X multiplied by it’s transpose would go well past gigabytes squared – that’s certainly not going to fit into any computer’s memory. Besides, we are discussing linear regression here, and neural networks are not linear. Even though the algorithm for solving linear regression analytically is polynomial in time and space, the sheer size of the datasets used in A.I. make analytical, or "nice" solutions infeasible.

Very technical discussion: It isn’t too relevant if a problem is "np-hard" or "polytime". One can find reasonably good approximate solutions to np-hard problems – the USPS can deliver packages efficiently despite the traveling salesman problem being np-hard. Likewise, a problem requiring O(n³)time to solve exactly isn’t tractable when n is large. Even in the relatively innocent case of O(n²) space requirements, the above problem shows the data won’t even fit in a computer if the data is large. In both of these cases, an approximate solution that respects real world resource constraints is needed – but this necessarily creates a computational bottleneck.

This example with linear regression illustrates the larger matter at hand.

Implications of computational complexity

I won’t dive into computational learning theory here as I think it will unnecessarily distract from the point I’m trying to make. But just looking at mathematics, computational complexity, and statistics, we can show that many classes of problems for which there are no "nice" ways to learn statistical patterns. That is, given a dataset we want a machine to understand, we can show mathematically that there does not exist an elegant or efficient way for a machine to discover the patterns. The only way to discover the pattern is to start with a wild guess and then very slowly improve that guess by painstakingly iterating over the data again and again. This becomes a very slow and expensive process when there is a lot of data.

If we can show that there is no mathematically elegant and efficient way for a computer to learn from a large dataset, then we shouldn’t get frustrated with ourselves as humans when we take a long time to learn a subject of interest. The pain of reading a lot of real estate books or trying to understand a foreign language come to mind.

There seems to be encoded into the universe (as defined by mathematics as we currently understand it), a fundamental law that distilling large bodies of information into succinct describable patterns requires expending a minimum amount of energy proportional to the size of that data – whether it be compute cycles for a machine or study-hours for a human. For fairly large "datasets" like real estate or language learning, that minimum energy could be quite large. There is just no way to get around it. Just as it is not possible to sort an array of unbounded integers in less than O(n log n) time, it isn’t possible to "understand" a dataset without expending a certain amount of energy.

This of course poses some philosophical questions about the limits of how much knowledge humans can acquire. But in a practical sense, I find it comforting. When I encounter a topic is taking me a long time to learn, I don’t need to entirely attribute the difficulty to my dull wit, but also to this mysterious metaphysical property of the universe: that learning is often provably difficult in a mathematical sense.

Conclusion

Again, none of this is to say that machine learning is anything like human learning. The legendary computer scientist Edsger Dijkstra pointed out that "The question of whether a computer can think is no more interesting that the question of whether a submarine can swim." As I pointed out in the introduction, authors have written extensively about the parallels between such disparate fields as sports, religion, probability, and business. I believe the parallels between machine learning and human learning is like that. Quite disparate, but sometimes interesting and illuminating.

Have you seen any other parallels between machine learning and human learning? Please share in the comments!

P.S. I plan to follow up with an essay on knowledge distillation. That was originally going to be a section inside of this one, but it made the entire piece too long. Stay tuned!

Footnotes

[1] Links are affiliate links, thank you for your support!

[2] Illusory Superiority. Accessed 2020 https://en.wikipedia.org/wiki/Illusory_superiority

[3] Arne Simon et. al. Wound Care With Antibacterial Honey. https://www.researchgate.net/profile/Kai_Santos2/publication/7684399_Wound_care_with_antibacterial_honey_Medihoney_in_pediatric_hematology-oncology/links/564c74d008aeab8ed5e990b6.pdf

[4] I’m resisting the urge to discuss Wittgenstein’s language games here, but the interested can read further here: https://en.wikipedia.org/wiki/Languagegame(philosophy)

[5] I would like to credit post here where I derived this illustration from https://stats.stackexchange.com/questions/23128/solving-for-regression-parameters-in-closed-form-vs-gradient-descent

[6] Dr. Math. Complexity of Matrix Inversion. Accessed 2020 http://mathforum.org/library/drmath/view/51908.html


Related Articles