The world’s leading publication for data science, AI, and ML professionals.

Introduction to Artificial Intelligence, Machine Learning, and Deep Learning with Tensorflow

A starting point into Machine Learning and Tensorflow

Wildflowers in Montana. Image by author
Wildflowers in Montana. Image by author

This is the first part of a long series I hope to do on advanced Machine Learning. Tensorflow recently came out with 2.x and its integration with Keras makes it a really easy-to-use and functional language to learn. At this point, Tensorflow and PyTorch are pretty comparable, so learning either will serve you really well in participating in the Artificial Intelligence renaissance.

This isn’t meant to be a mathematical walkthrough, but more a practical one – one that will help you get started with Tensorflow yourself. With that being said, I will add explanations to explain certain concepts if you wish to learn.

This article will assume you know some Python.

Table of Contents

  1. A New Programming Paradigm
  2. What’s going on here?
  3. Predicting Median Value of Homes in Boston House Prices Dataset

A New Programming Paradigm

To get on the same page, here are some definitions clarified.

What are Artificial Intelligence, Machine Learning, and Deep Learning?

Artificial Intelligence: the simulation of human intelligence by computer systems. AI can include hardware and software systems and it focuses on 3 cognitive processes: learning, reasoning, and self-correction. As a society, we’re currently at the base-form of AI – Artificial Narrow Intelligence. This essentially means that AI is mainly a phenomenal pattern matcher for complex, unstructured datasets and due to this, it’s most common applications are natural language processing, computer vision, and speech recognition.

Machine Learning: often referred to as a subfield of AI, Machine Learning is the practice of learning from examples seen in data. It takes examples with answers and learns the rules (patterns) that yield those answers given the data. ML models are built on top of a foundation of statistics, ML optimizers (how they learn those patterns) are built on calculus, and efficient ML programming is built on linear algebra.

Deep Learning: a subfield of Machine Learning and is the practice of constructing Neural Networks with multiple layers. Common use cases of deep learning are image classification, time series forecasting, and fraud detection.

Note that Deep Learning just represents a set of methods related to Neural Networks for complex datasets; they are not a silver bullet for ML and will not be the best modeling choice in every scenario. A big reason for their excitement is how they have been able to learn rules that take data and create answers for a variety of unstructured tasks.

Before unpacking more of what’s going on with Deep Learning and Neural Networks, we’ll start with an example.

Starting Simple

Let’s say you wanted to predict the price of a 7 bedroom house. So, imagine if house pricing was as easy as a house costs 50k + 50k per bedroom, so that a 1 bedroom house costs 100k, a 2 bedroom house costs 150k, etc. You go out and collect data on a handful of houses and obtain their bedroom count and price.

To create stability in our model, we could appropriately scale the data down to individual digits (and interpret the answer in hundreds of thousands).

Image by author
Image by author

Building the Simplest Neural Network

This prints out:

Image by author
Image by author

Essentially, we have one layer in the network with one node/neuron. Dense is a way to identify a layer with neurons in it, and successive layers are defined via Sequential (in this case we only have one neuron and one layer).

Allowing the Neural Network to Learn

So now, the network is created, but we need to provide it two things:

  1. A way to learn
  2. What to learn on

    The compile line includes two really important concepts:

  3. The Optimizer: This represents the method in which the neural network traverses through a field of possibilities as it finds the best set of patterns that represents the data resulting in a set of answers
  4. The Loss function: This represents the metric which tells the neural network whether it’s headed in the right direction or not.

Think of this as the model we built traversing a set of hills trying to find the lowest point in the landscape. It can take a number of steps at a time in linear sequence or giant leaps at random (or other smarter techniques) – this is the basis of the optimizer. ‘sgd’ stands for Stochastic Gradient Descent, and this is essentially where the model takes random leaps across the landscape to find the lowest point. This optimizer is really useful because it prevents the model from getting stuck in a local minima that isn’t actually the bottom of the entire landscape, the global minima.

The loss function penalizes the model for taking a step in the wrong direction so that is how it knows whether it’s getting closer to the bottom – also called convergence – or not. Mean Squared Error is a common loss function used for Regression tasks, which this is as it’s predicting a continuous value. It essentially finds the residual error (actual-predicted), squares it, takes the sum across all the squared errors, and divides by the count of values to get a single metric value.

The second line fits the model on the data we collected and runs for 500 epochs. Epochs represent how many times the model will go through the training loop. The training loop is the process of the model "guessing" a value and then measuring how close that is to the actual value and iterating on that until it gets as good as it can.

Predicting for a New Value

Now we can predict for 7 bedrooms.

[3.9685223]

You may be a bit confused because our data had a linear equation it directly mapped to: y=.5x+.5 and if we entered 7 for x, we should get 4 – so why did we get ~3.97?

This is because the basis of ML models, including Neural Networks, is probability. These models don’t operate in the world of certainty, and we actually don’t want them to. The whole goal of Machine Learning is to build models that can accurately and ethically generalize to unseen data. This is not possible if it can "definitively" identify patterns – it needs to factor in probabilities. Additionally, we only had 9 points of data that we trained the model on which can significantly influence how close to the "true formula" of rules the result yields.

The whole goal of Machine Learning is to build models that can accurately and ethically generalize to unseen data.

What’s going on here?

Neural Network Structure

A Neural Network is composed of 3 layers:

  1. Input Layer: In our case, we only had bedrooms but each new node would be added for an additional feature as we expand our dataset (zip code, square footage, wealth, etc.)
  2. Hidden Layer: We only had one hidden layer with one neuron in it, but the neuron functionality stays the same for Deep NNs. Each neuron finds some linear combination of its features and then applies a non-linear activation function to it.
  3. Output Layer: This was our house price. It can also be a categorical output layer, but then we would want to measure via a different loss function.

A Deep Neural Network (or a Multi-layer Perceptron), the bread and butter of Deep Learning, leverages more than 2 Hidden Layers.

Image by author
Image by author

Neural Networks need these at least things to function well (generalize to new data appropriately):

  1. Lots of data (hence the obsession with Big Data)
  2. Sensible activation function
  3. Effective and efficient optimization algorithm
  4. Appropriate loss function

When we talk about lots of data, we typically mean the data should have large volume and variety. If the model has any hope of learning the examples effectively and to generalize to new data, it needs a representative set of examples to learn from.

The technical details of how each optimization algorithm and loss function works is a bit out of scope of this story, but I’ll cover their purpose and common choices.

Activation Functions

Activation functions are a way each node can non-linearly transform the data to find more complex patterns. Here are some of the most popular:

Sample of activation functions. Wikipedia
Sample of activation functions. Wikipedia

A good way to determine which is best for your model is truly knowing your data well and your objective well. For example: ReLU is a common choice, and we’d likely want to use that if we were to build a full on House Prices Prediction model but why? ReLU cuts off values <0 to 0 and this models our output, price, really well because it wouldn’t make sense for our model to predict a house to have a negative price.

Fun fact: if you recognize a familiar name (Logistic) in the activation functions, that’s not by accident. Logistic Regression is a fantastic example of what a Neural Network is doing, at the simplest scale. It takes a variety of inputs, linearly combines them to learn their respective weights, and then smashes them into a sigmoid function to yield an output bounded between 0 and 1.

Optimization Algorithm

The way Neural Networks learn is through their optimization technique. Gradient Descent is the main algorithm for this, but variants of GD have been proposed and they work incredibly well. Here’s a few of the popular ones:

  1. Stochastic Gradient Descent
  2. RMSProp
  3. Adam

Essentially, each one attempts to propose different approaches at traversing an expansive landscape to find the lowest point – the global minima – and depending on your use case different optimizations may be valuable.

The red line is an example of an optimization alg trying to find the minima. Wikipedia
The red line is an example of an optimization alg trying to find the minima. Wikipedia

Loss Functions

Loss functions are largely dictated by the task we’re attempting to do. If we’re predicting a real-value number (Regression), the common loss function is Mean Squared Error. If we’re predicting a categorical variable (Classification), the common loss function is Binary Cross Entropy (and Sparse Categorical Cross Entropy for multiclassification).

But what’s actually happening?

As the model propagates information forward through the network (forward propagation), it guesses and learns weights that correspond with incremental loss along the way. With each time the model goes forward, it also back propagates information to reupdate the weights and be "smarter" the next time around (backpropagation). This is the functional idea behind Gradient Descent and how math so beautifully ties all these pieces together.

Image by author
Image by author
The derivatives learned along the way updating weights. Image by author
The derivatives learned along the way updating weights. Image by author

Predicting Median Value of Homes in Boston House Prices Dataset

Instead of using a contrived list of 9 values, we can use the Boston House Prices dataset (description here) and attempt to build a deeper neural network to assess performance.

Image by author
Image by author

We’re predicting the ‘MEDV’ value so we can separate our dataset into X and y and then use sklearn’s train_test_split() to split our data into training and testing sets via an 75/25 split. This allows the model to be fitted on training data and then validate its performance on the testing data. This is done to create a model that is capable of generalizing to new, unseen data.

What if we wanted to have the model stop training when the accuracy (or any other metric we desire) hit a certain point? Essentially, we don’t need the model to keep training for all the epochs we set if it is able to learn the patterns in the dataset much sooner, according to the metric we choose to measure.

We can set up a callback log like done below to stop training when the mean squared error is less than 1. Since the target value field is in the $1000s, this would mean when the model is able to predict within about $31.

We also can easily add multiple layers to expand our network; this, in combination with our activation functions, allows each layer to learn far more intricate patterns. If adding more enables more complex patterns to be learned, why not just overload the network with endless layers and nodes though? There are at least a few reasons to be deliberate in your design choices:

  1. Overfitting can easily happen when you overcomplicate model architecture. If the model is able to find patterns in anything, it easily picks up noise as signal too and that hinders its ability to generalize to unseen data.
  2. Computational costs are a real thing. Training large scale networks with a ton of data takes up a lot of compute resources.
  3. Occam’s Razor is a fundamental pillar of Machine Learning. Simpler solutions that are easier to understand, design, and explain are usually the better implementation choice. This obviously doesn’t hold for every single problem, but it does for most and definitely does for this one.

Occam’s Razor, "the principle of parsimony or law of parsimony is the problem-solving principle that ‘entities should not be multiplied beyond necessity’…"

Furthermore, we can easily change the optimizer by changing ‘sgd’ to ‘adam’. This can have affects on the result so if you’re experimenting make sure to try out multiple implementations.

Image by author
Image by author

The model seems to do decent according to training data, but not the best for sure. Let’s evaluate on the test data.

Image by author
Image by author

Model does noticeably worse on testing set and it’s clear overfitting is occurring. A likely reason could be that our model was far too complicated for a dataset with 500 rows to start (and the training set had only 75% of that). Regardless of that, we also didn’t apply common techniques to alleviate overfitting when it comes to Neural Networks (dropout, regularization, etc.). Not to mention that Gradient Boosted Trees typically perform way better on structured data, but that’s for a different day.

If we did want to improve this model on this dataset, here are some key things we could try:

  1. Hyperparameter tuning: There are a lot of hyperparameters that could significantly affect model performance. Some key ones to name are learning rate, batch size, number of nodes, etc.
  2. Changing Optimization Algorithm: Stochastic Gradient Descent may be a better choice for this than Adam, but it could also be a tuning of Adam that’s needed.
  3. Cross Validation: A technique for stratifying the training and testing set in a way that allows the model to be evaluated on unseen data more rigorously. This also consequently gives the model less data to be trained on.
  4. Dropout/Regularization/other overfitting techniques: Each of these could really assist in penalizing the model for learning the noise in our dataset too closely.

Disclaimer: I recently (after publishing) found out that this dataset casually assumes that people prefer to buy housing in racially segregated neighborhoods; had I known this prior I wouldn’t have used it for the example. Please be cautious when using this dataset going forward for uses other than teaching ethical practice of the craft.


Tune in Next Time

That’s it for today! This was a dive into the "deep" end of Machine Learning that deals with Neural Networks and TensorFlow, but it really is just the beginning and the world is genuinely so fascinating.

If you’re interested in learning about Computer Vision, Natural Language Processing, Sequence Models, Graph Machine Learning, and more then be sure to follow/subscribe!

References

[1] DeepLearning.AI, Neural Networks and Deep Learning

[2] DeepLearning.AI, Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

[3] DeepLearning.AI, Introduction to TensorFlow for Artificial Intelligence, Machine Learning, and Deep Learning


Related Articles