Machine Learning: Use Small Words

I like the word heterodox. It is a cool way of saying “thinking outside the box, and against the mainstream”. It describes well how Hinton and others moved forward with deep learning research when few researchers believed in the approach. Big words like heterodox get me thinking about using smaller words and simpler words to describe what we do.

There is a big gap between what we do in deep learning artificial intelligence, and what people think we do. At a recent machine learning event, a senior research scientist from another company gave me a hard time for giving no credit to Marvin Minsky and others from his generation, for kicking off the field of artificial intelligence many decades ago. Clients don’t care about this stuff. Personally, I prefer Alan Turing, but his point was fair. I usually think about Hinton, Bengio, and LeCun, and so on, starting in the 1990s. The people I admire are those who dreamed when all hope was lost; Guys like Hinton, Nick Bostrom and Elon Musk, who toiled in obscurity. We techie types tend to come to the dance with one technology, head straight for the spiked punch, and immediately ditch our date for the hottest tech in the room. There are valid reasons for operating this way, but it can be a bit brutal. Outside of academia, if you get too nostalgic and loyal to old technologies, you go irrelevant very quickly. Just ask your blacksmith.

In the machine learning era, history is moving almost too fast to catch up, and our clients rely on us as consultants to distill the key messages and capabilities, rather than to give them a historical context. I spend a lot of my time just learning the newest stuff.

I have been very consumed by the progress in deep learning these past few years, and now the “old” wisdom of orthodox mathematics and statistics are creeping back into the mainstream after being beaten and bloodied by the unexpected success of deep learning over all other techniques. The “real” mathematicians are taking back control of the car. I see improvements in machine learning from here on out as following a distribution model similar to software compilers. The math guys will make new magic boxes we can use in frameworks like Keras, similar to how the contributors to the gcc compiler expose interfaces to high-level languages and abstracted away all the lower level stuff.

As it turns out, progress is moving in more than one direction. There are new kinds of neural architectures (e.g. Hinton’s capsule network v.s. convolutional net), but also the emergence of state of the art “old” models used as upgrades to our existing ones. The new Bayesian BNN, for example, propagates uncertainty through a network, rather than relying only on the loss on the gradient at the network output. This is much more powerful than DNN models, because we can make claims about the probability of things, rather than talking about how well the model fits the data. The problem with measuring loss in a DNN is that it leads to overfitting on the training data, rather than generalizing about the dataset. Data science has invented ways of fighting overfitting, but the basic idea is that backpropagating the gradient on a scalar output is not enough information to talk about how certain we are in the answer of a deep learning system. This leaves open the still real possibility of big mistakes, such as pixel attacks on deep learning systems.

Examples of a good model with some error (left) and an overfitting model with lower error (right). Credit:

Like GaN, the BNN is an improvement over DNN because it models real world stuff, which is nondeterministic and follows probability density curves: a probability distribution. The joint probability of variables is old hat, compared to the newer deep learning stuff, and yet by providing an uncertainty measurement, BNN provides a better model than simply minimizing/observing the model’s error. I guess the heterodoxy of deep learning has it’s limits, and the pendulum is now swinging back in favor of the pure math guys (and gals), who will apply a huge arsenal of classic math approaches to beat deep learning back down from the overbearing position it has taken on. I think in the grand scheme of things, the new approaches will augment, rather than diminish, existing deep learning techniques. It’s an exciting time.

Take a step back from this breakneck progress in basic R&D, and think about how we can express these simple ideas coming from the research community with less complicated language. As an engineer working on deep learning technology, I posit that it is time for large organizations to look at deep learning not as a tool for modeling data, but as a component of a cognitive computing strategy. This comprehensive approach to data includes fusing internal corporate data and external open sourced data, and then generating insights using workflows and deep machine learning. To do all this you don’t need big fancy words. Instead you need to generate insights from data. I saw a nice quote on LinkedIn calling this “self-aware data”. Somehow many companies have forgotten that even though machine learning is excellent at learning from data, companies still need to understand and leverage data into insights and actions, not numbers.

Enterprises adopting machine learning should consider that ML is one small part of a bigger enterprise toolbox. “Stuff” like BNNs represent new wrenches in the ML compartment of the toolbox. But don’t get distracted by the shiny wrenches in the toolbox. The point you care about is using the toolbox to fix leaks and build fast cars. Nobody cares what wrench you used when you cross the finish line.

It’s always great when you develop a new model that can synthesize information you did not know. Rather than just fitting the data, you learn something new and unexpected. I plan to write more about cognitive computing and how it is changing the way business gets done.

Until then, happy coding!

CTO & Co-founder @

Other articles you may enjoy: