Write Your First Machine Learning Program in Under 50 Lines Using Scikit-Learn

Harness the power of machine learning to analyze chocolate chip cookies!

Rohan Vij
Towards Data Science

--

Photo by Arseny Togulev on Unsplash

Machine learning’s increasing omnipresence in the world can make it seem like a technology that is impossible to understand and implement without thorough knowledge of math and computer science. However, the truth is far from that. In a world where today’s top companies were built out of garages and FOSS (Free and Open Source Software) is everywhere you look, several community-built libraries exist to simplify the process of developing a machine learning model.

What is Scikit-Learn?

Scikit-learn is a machine learning library for the Python programming language. It is built on top of several Python libraries, including NumPy (mathematical functions), SciPy (more math!), and Matplotlib (data visualization).

If you are already familiar with some jargon in the machine learning space, you might be questioning why we are not using TensorFlow (developed by Google). While TensorFlow is also a machine learning library, it primarily focuses on deep learning and neural networks, whereas scikit-learn contains more general machine learning concepts. Scikit-learn is also widely considered easier for beginners versus TensorFlow.

Companies such as JPMorgan and Spotify use scikit-learn for tasks such as predictive analysis or song recommendations. You can see a full list of testimonies here.

Getting Started

For this tutorial, you need:

  • Python (version 3.7 or above) — basic experience recommended

(Installation tutorial: https://www.tutorialspoint.com/how-to-install-python-in-windows)

Then, install three packages using pip in your terminal:

  • pip install notebook
  • pip install numpy
  • pip install scikit-learn

Run jupyter notebook in your terminal. Your default web browser should open a tab where you see a file explorer. Simply go to the directory you wish to create your program in and then create a Python 3 Notebook (select new in the upper right-hand corner). You should now see this screen:

A screenshot of the Jupyter Notebook environment.

You can rename the file by clicking on “Untitled.”

Code Time!

Run each cell after we finish writing it by pressing the Run button at the top of the screen.

In the first cell, start by importing the libraries we need:

Now, time for the data we will train our model on. Let’s say that we go to a cookie store and survey people based on the cookies they tried:

Take a second to look at this data (notice the comment at the top). As humans, we quickly identify a pattern: sweet cookies were good, and bitter cookies were bad. That simple conclusion is what we will train our model to recognize.

Next, we define the features of our data and the labels of our data:

This is pretty self-explanatory — the features that we are training the model on are whether the cookie is sweet or bitter, and the label is whether the cookie is good or not.

Press alt + enter to create a new cell.

Now, we need to develop a test set to test our model on the data it looked at. This will give us an idea of how accurate the model is.

Press alt + enter to create a new cell.

Neural Networks

We will be using scikit-learn’s MLPClassifier as our model. MLP simply (or not so simply) stands for multilayer perceptron. Loosely, a multilayer perceptron is a feed-forward artificial neural network (where the inputs and outputs are 0 or 1):

A diagram of the artificial neural network used in this program.
Credit: Image by the author.

“Neural network” is no coincidence — the nodes (or neurons) in neural networks are analogous to neurons in a human brain. If the neurons are sufficiently stimulated, they are triggered.

The hidden layers are where the magic happens (referred to as “hidden” because they cannot be viewed outside of the network):

Each neuron in a hidden layer has a weight to reflect how important its input is. For instance, if we were to add more factors than just (sweet or bitter) to our data, the model would apply weights to each of those traits. While saltiness and sweetness both contribute to a good cookie, saltiness might have a smaller weight (i.e 0.2x) than sweetness (i.e 0.5x) because the model found that sweetness is more important.

Each neuron also has a bias (a constant number) that is added or subtracted to offset the result of a neuron. It is a bit too complex for this tutorial, so we will skip the details.

The neural network we will create is not a deep learning network. With enough hidden layers (more than three 3 is the universally accepted number), a neural network is defined as a deep neural network. Deep neural networks do not require labeled data. For example, a classic neural network needs human intervention to label datasets, as we did when we labeled which cookies were good and which ones were not — this is known as supervised learning. On the other hand, deep neural networks perform unsupervised learning. They can utilize unlabelled data and cluster it into different groups based on characteristics it determines on its own.

At last, the code (quite underwhelming, I know). We simply define that our hidden layer will have 5 layers and that we will loop through our data 3000 times.

Press alt + enter to create a new cell.

Now, we will fit, or train, the model to the data we provide it. Our model will go through the data 3000 times (meaning it completes 3000 epochs), as defined when we created our network.

We will then test our model against the training and testing set with the weights and biases it developed.

Output:

Training set score: 100.000%
Testing set score: 100.000%

Press alt + enter to create a new cell.

Finally, we can use our tried and tested model to ascertain whether a cookie will be good or not. This is some code I used for testing:

Remember the conclusion we had developed from the data? Our model successfully matched our thinking!

Type: Sweet cookie
Good cookie!


Type: Bitter cookie
Bad cookie!

Conclusion

Of course, the data we used in this model does not need a neural network — but it is a simple implementation with simple data so that we could focus more on how machine learning works. Of course, feel free to update the cookie survey and test sets to your liking! Add more traits (crunchiness, saltiness, etc) and put the model to the test!

Thank you for reading! I hope you thoroughly enjoyed the tutorial and now are more comfortable with the idea of machine learning.

--

--

Hi! 👋 I’m a high school student who enjoys writing about technology and astronautics.