
Mathematical modeling and machine learning can often feel like difficult topics to explore and learn, especially to those unfamiliar with the fields of Computer Science and mathematics. It surprises me to hear from my non-STEM friends that they feel overwhelmed trying to use basic modeling techniques in their own projects and that they can get caught up in the semantics of the field. This is a shame because linear modeling can be very helpful in a number of instances, and with all the open-source code on the internet, implementing your own model has never been easier. Hence, here is my no-frills guide to understanding and implementing a basic linear regression model in Python.
Contents:
- What is Linear Regression?
- How to Prepare Your Data for Linear Regression.
- Performing Linear Regression.
- Interpreting the Results:
- Final Thoughts and Further Reading
What is Linear Regression?

Linear regression is a form of mathematical modeling commonly used to evaluate the relationship between a dependent variable, such as weight, and an independent variable, such as height. Our brains do this naturally, just in a less precise manner. If I asked you to decide who the heavier person is between someone who is 6’2" and someone who is 5’2", you would likely pick the 6’2" person. Sure, the 5’2" person could weigh more, but I’ll bet that in your experiences interacting with people, you have established some sort of a relationship between the height of the person and the weight of the person. Linear regression is just a precise and mathematical way of creating this relationship and drawing meaning from it.
So how does it work?
Linear regression works by creating a line of best fit. The line of best fit is the line that best captures the relationship between the x and y-axis. For example, that relationship could be, as "X" increases, "Y" also increases:

Alternatively, this relationship could be as "X" increases, "Y" decreases.

Establishing the general direction of the trend is easy enough in the examples above, however, depending on the data, it can become much more complex. Furthermore, the precise details of the line can be hard to calculate by hand. In many cases, having the exact equation of the line can be very helpful, allowing us to understand the relationship between two variables and surmising the value of one variable based on the value of another variable.
How to Prepare Your Data for Linear Regression:
For linear regression to work effectively, you’ll need at least two things: a variable you think might be dependent, such as the weight in kilograms of an NBA player, and a variable you think might influence the dependent variable, such as the height in centimeters of an NBA player.

Linear regression works best if both of these variables are continuous. By continuous I mean that there is a continuation between the values. Someone can weigh 151 lbs or 152 lbs or 151.5 lbs or 151.72 lbs, etc. This is unlike discrete or categorical variables such as movie ratings in stars or grades given in a classroom. There are other techniques for handling these types of data however we will be focusing on linear regression for now.
Height and weight are two perfect examples of continuous variables prime for establishing a linear relationship between. If you are following along on Python make sure both of your continuous variables are in float form, this will help with later steps.
If you are interested in trying this out with an already cleaned dataset you can follow along with the NBA dataset I am using, which is found here.
To load in the data I recommend the Pandas package for python:
import pandas as pd #Load the Pandas package
df = pd.read_csv("archive/all_seasons.csv") #Read the NBA file
df.head() #Display the NBA file's data
The output should look like the table above.
Performing Linear Regression:
So now that we have our data loaded in let’s take a look at the relationship between the weight and height of the NBA players:
df.plot.scatter("player_height","player_weight", figsize=(15,10))

Apart from a few outliers, we can already see that there is a direct correlation between the height of the player and the weight of the player. As we explained above, linear regression is like drawing a line from the left of the plot to the right of the plot that best follows the relationship of the data. For our NBA example, we can guess that a line of best fit would start somewhere around the 60kg mark and head towards the top right corner of the plot. The problem is that we humans are nowhere near precise enough to draw the line that would perfectly capture the trend of the data. Instead, let’s use a tool.
Scikit-Learn, or SKLearn, is a python package with a variety of Machine Learning tools- including one for building linear regression models in a simple and effective manner. To use SKLearn we need to isolate our two variables from the pandas dataframe:
from sklearn import linear_model
#By calling to_numpy() we convert the series into a numpy array
#We then reshape the numpy array into a format parsable for sklearn
X = df["player_height"].to_numpy().reshape(-1, 1)
y = df["player_weight"].to_numpy().reshape(-1, 1)

As you can see, the "X" array contains all of the heights and the "y" array contains all of the weights. Now we can fit the model. Fitting the model, in this case, means we are presenting the data to the function and allowing SKLearn to find the line that best captures the relationship between "X" and "y".
#First we call the linear regression function from SKLearn linear_model
#Then using this object we fit the data to a linear model.
lm = linear_model.LinearRegression()
model = lm.fit(X,y)
Now that the model is fitted let’s take a look at what it came up with.
Interpreting the Results:

With our model fitted, it’s time for us to take a look at what it has established from the data we provided. First, let’s look at the parameters it has evaluated for the data:
print(model.coef_) #prints the slope of the line
[1]: [[1.13557995]]
print(model.intercept_) #prints the intercept of the line
[2]: [-127.40114263]
For those familiar with math, you might remember the equation for the slope of a line, y = mx + b. In this case, "b" is the intercept, which can be thought of where the line crosses the y-axis, and "m" is the slope of the line. So for our fitted linear regression model the equation would roughly be y = 1.13x -127.4. This means that for every one digit "x" increases by "y" increases by, 1.13, or rather for every cm taller a player is, the weight of the player should increase by 1.13 kg. Visually, if we plot this line over the scatterplot of player height and weight we get:
import numpy as np #numpy can be used for creating a lsit of numbers
X = np.arange(150,250) # create a list of example values
#plot
df.plot.scatter("player_height","player_weight", figsize=(15,10)).plot(X, model.predict(X.reshape(-1,1)),color='k')

In this case, the black line is the line we have fitted to our data. Based on this line, we can surmise that a player at 180 cm will roughly weigh around 70 kg. However, using SKLearn and the model we have created, we can estimate this:
model.predict(np.array(180).reshape(-1,1))
[3]: array([[77.00324856]])
Hence, a player at 180 cm should roughly weigh 77 kg.
Now that the model is trained you can try this on any value. Here would be my NBA weight:
model.predict(np.array(188).reshape(-1,1))
[4]: array([[86.08788817]])
Final Thoughts
Whilst this is just a trivial example, linear regression can be very useful for a number of tasks and projects and hence should be as accessible as possible to everyone. My full code for this project can be found below and I encourage you to try it out on your own. Furthermore, I have included some additional readings below if you want to learn more.
import pandas as pd
from sklearn import linear_model
df = pd.read_csv("archive/all_seasons.csv")
X = df["player_height"].to_numpy().reshape(-1, 1)
y = df["player_weight"].to_numpy().reshape(-1, 1)
lm = linear_model.LinearRegression()
model = lm.fit(X,y)
model.predict(np.array(188).reshape(-1,1))
Wow! Just 8 lines of code.
Further Reading
- Yale course page that goes into a bit more depth: http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm
-
A Youtube video with a great explanation: https://www.youtube.com/watch?v=zPG4NjIkCjc&ab_channel=statisticsfun
A special thanks to Justinas Cirtautas for supplying the dataset.
All images used are either created by myself or used with the explicit permission of the authors. Links to the author’s material are included under each image.