Welcome to Part 2 of this series of articles about learning Python and Data Science on-the-job, or without formal education. Part 1 talked about some of my experience over the past 10 years of learning both on-the-job and in formal education settings. If you’re interested in philosophy of learning and some ideas for how to motivate yourself to get started, feel free to check it out. Or if you’re like me and learn best by doing concrete examples hands-on, read on!
Problem Formulation
The files for this can all be found on my Github. However, I’d encourage you to do this completely from scratch by following along with the code blocks and screenshots provided here.
For this case study, we’ll perform a simple linear regression. We have two categories of input data, and based on the inputs, we want to train a linear model to predict an output, based on actual observed data. In the data.csv
file, these inputs are called x1
and x2
, and the observed data are called y
. The model will be in the form of Ax1 + Bx2 + C = y. You may notice that x2 = x1². This is intentional, and as you progress in data science, you may want to keep this little trick handy: you can create additional inputs into models by simply squaring or taking the log of already existing inputs (inputs are more commonly referred to as features in data science).
Setup
To start, open both an Excel spreadsheet and a Jupyter notebook. Normally, you might start by copying and pasting the raw data immediately to the Excel file, but for this particular problem, we’re going to do something first that mirrors what you’ll commonly do in Python. You will need the solver add-in to solve this problem. If you’ve never used the solver add-in, follow the instructions here. Enabling the add-in gives you additional functionality within Excel that doesn’t come standard.
While you rarely do this in Excel, you will almost always do something similar in Python. Enabling additional functionality is done by importing libraries, or behind-the-scenes code that enable you to execute more powerful and efficient commands in Python. You do this by typing the line import [library_name].
This tells Python that you will be using the specified library. You can optionally give the library a shortened name. For example, you can say import pandas as pd
. Any time you use some functionality from the pandas
library, you can simply type pd
instead of pandas
. While you could name the library anything you want, you will quickly note that most packages have common abbreviations.
Many libraries come pre-installed when you download Python, just as Excel already has buttons that allow you to make plots or perform mathematical functions. You will likely have no problems importing pandas
(for data manipulation) and matplotlib
(for plotting). You may need to pip install the scikit-learn
(or sklearn
) library, though, just as you had to do some special work to get the Excel solver add-in (sklearn
is coincidentally going to be used in a similar capacity for this exercise as the Excel solver add-in). Your first block of code should look like this:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
The line from sklearn
looks a little different. That’s because sklearn
is a vast library (take a look around their website and documentation), and we will only use a small fraction of it. So what we’re telling Python in that line is to import only that special functionality from scikit-learn, not all of it. It may be a rabbit hole, but note that the following code blocks do the same thing:
import sklearn
lr = sklearn.linear_model.LinearRegression()
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
Loading Data
That’s a lot of work to get things set up, compared to Excel, but it’s one of the elements that makes Python so much more versatile. Now, you’ll load your data into your program. In Excel, you can simply copy and paste from the data.csv
file. In Python, you can load it as a dataframe (think of an Excel sheet on steroids). Your next line of code should be:
df = pd.read_csv('data.csv')
This line tells pandas to read the values in the data.csv
file and store them in the df
(short for dataframe) variable. You should make sure that the data.csv
file is in the same directory as your Jupyter notebook, otherwise you will have to specify the path to the file.
You may be frustrated that you don’t actually see what’s going on with each line of code. As visual creatures, this is perhaps a drawback of programming. However, as you are scripting code, you can readily display outputs. For example, type df
and execute it (press ctrl+enter) in a new code block and see what happens. Then try df.head()
. Finally, try df.head(3)
. What do you notice about each one? This is where the versatility and efficiency of coding starts to show up in comparison to using spreadsheets. The code is minimalistic but powerful, and as soon as you get past the barrier of lack of initial visualization, you may find coding far preferable. As a side note, you’ll also appreciate it much more when you start dealing with datasets that have millions of entries; running operations on these datasets in Excel really bogs it down, while coding continues to run smoothly (until you get really big datasets).
Model Setup
We’ve been moving pretty slowly so far, but hopefully this next portion really highlights the advantages, verstality, and speed of coding in comparison to spreadsheets. Let’s walk through the problem setup in Excel first, then see how we can do the same thing in just a few lines of Python code.
In Excel, we’ll find the coefficients for our model by setting up a sum of squares system. Create a new box of cells to track coefficients, and make a guess value of the coefficients. You can probably set the guesses all to 1 to start, but sometimes your guess does have a great effect on the outcome (Python is more convenient and doesn’t require you to provide an initial guess, although it does give you the option). Then program a new column of cells with your model, as seen in the screenshot, to make a model prediction.
Finally, generate a new column, called the "squared error" column, calculated as shown below. The final step before opening up Solver is to sum all of the squared errors – hence the "sum of squares" objective, also referred to as the l2-norm. (You can do this by going to cell E23 and typing the formula =SUM(E2:E22).)
We finally have the problem set up. I don’t know your proficiency in Excel or if you’ve used Solver before, but the point of going about the exercise this way is to show you how much simpler this process is in Python. This isn’t a tutorial on Excel Solver, linear regression, or why sum of squares works (although I could go into that!) so I’m not going to go into any more details here. We can get everything set up in Python in just 3 lines:
X = df[['x1','x2']].values
y = df['y'].values
model = LinearRegression()
Model Solution
Let’s take a step back and remember what we’re after (we’re so close!). We want to develop a model that allows us to predict a value y
based on two input features, x1
and x2
. We hypothesized that the model was a linear regression, in the form Ax1 + Bx2 + C = y. It seems like we’ve taken a roundabout way to do that, but we’re one step away. In Excel, open up the Solver dialog box, and fill it out as follows (make sure especially that the checkbox constraining to positive values is unchecked). Run the program, and you’ll see it all change on your screen. You will see your values of A, B, and C in the corresponding cells.
We’ll come back to that yellow box and show something similar in Python. But to set up the problem in Python, you can do all this work by writing a single line of code:
model.fit(X,y)
Again, the most noticeable difference is you don’t see anything different in Python. But you actually already have the solution. If you dig into the documentation, you’ll find out that you actually can output the values. For a linear regression, you can find them by a few print statements, as follows:
print('Coefficient A:',model.coef_[0])
print('Coefficient B:',model.coef_[1])
print('Coefficient C:',model.intercept_)
Your values should match! So real quick, back to that yellow box. It’s basically asking if your A, B, and C coefficients should all be positive. Sometimes that’s important to do, especially if you’re modeling a system that has real physical meaning and the coefficients are constrained by natural phenomena to be positive. If you poke around the documentation for scikit-learn’s LinearRegression
, you’ll discover that you can pass an argument when you initialize the model to do the same thing. It looks like this:
model = LinearRegression(positive = True)
The point of that detour is to show you one of the most unintuitive things about programming: the options are there, you just have to find them! There’s no easy visual checkbox like in Excel, but they’re there! The default value that it tells you in the documentation is comparable to whether the box is checked or unchecked in Excel when you open up Solver.
Model Predictions
Great, so now we have a working model. How do we make predictions? Let’s say we want to know the predicted value for x1 = 0.65 and x2 = 0.65² = 0.4225. In Excel, you need to put the values in some new cells, then program the equation into another cell to get the answer, just like in the screenshot below.
In Python, you can do the exact same thing by typing the following code:
x1_predict = 0.65
x2_predict = x1_predict ** 2 # Careful to not use the ^ symbol to square values!
X_predict = [[x1_predict, x2_predict]]
y_predict = model.predict(X_predict)
This probably looks a little more tedious, since we have to type variable names. But here’s another interesting little fact: did you know that you can give Excel cells unique variable names? It’s the same as defining a Python variable, and then using it in a future equation. Google "excel give cell a variable name" or something similar, then you can rewrite your equation like you see in the screenshot. I almost never do this in practice, but since this article is all about comparing Excel to Python, hopefully this gives you a better sense of what you’re doing.
Plotting Results
Here is an area where Excel may seem much better than Python, but that’s only because there’s a user interface to interact with. To make customizations in Python, you have to type a line of code.
I won’t go through how to make a plot in Excel – you’re likely plenty competent in that. In Python, we’ll use matplotlib, but be aware that there are many other options to explore, such as plotly, seaborn, and altair. I think it’s time to turn you loose, so I won’t walk through every line of this code. Instead, take this as an exercise for you to try and understand what each line does. Then, look at the documentation, and see if you can change some of the inputs to make the plot your own!
plt.plot(df['x1'], df['y'], '.', label = 'raw data')
plt.plot(df['x1'], yp, label = 'model prediction')
plt.xlabel('x1', size = 14)
plt.ylabel('y', size = 14)
plt.legend(fontsize = 12, loc = 'lower right')
plt.show()
Summary
You made it! One of your first full codes, starting from scratch and finishing with usable results, is under your belt. Hopefully you got a lot of insight into how to learn programming, without wasting time on a 4 hour tutorial that goes in one ear and out the other. The cool thing is that this activity, which maybe took longer than you’d like, will be something you can whip out in 5 minutes within a month or two. It took me less time to develop the whole thing in both Excel and Python than it takes to read this article
If I could summarize what I anticipate the biggest struggle will be, it’s this: we are visual creatures, and programming isn’t visual. Excel is easy, because there are buttons and a graphical user interface to deal with. You have to create visualizations on-the-fly with programming. As you’re learning, you can never go wrong with extra print
statements, data plots, data tables, etc. Even as a more advanced programmer, you may still find yourself going through new data in Excel to get a quick feel for it. That’s totally fine! However, I hope you, like me, end up having a strong preference for doing things in Python – not only because it’s more powerful and versatile, but even because it becomes easier!
As always, you can connect with me on LinkedIn, and feel free to follow me on Towards Data Science to see my regular posts on data science case studies. I’d be happy to hear if some types of post are more useful or interesting than others. Until next time!