How to explain your coefficients so anyone can understand them
Though I briefly summarize linear regression and logistic regression below, this post focuses more on the models’ coefficients. For more information about linear and logistic regression models in general, click [here](https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148) and here.
When I used to work at a restaurant, the beginning of every shift was marked by the same conversation amongst the staff: how busy we were going to be and why. Is it a holiday weekend? How’s the weather? What big sports events are scheduled?
Though we spoke with the authority of behavioral economists, our predictions were based more on anecdotal evidence and gut feeling than on data. But we were on to something. Certainly there was a relationship between any number of factors (e.g. weather) and how busy we were going to be. We were, in our own way, conducting a little folk regression analysis.
Regression Analysis
Regression analysis seeks to define the relationship between a dependent variable (y) and any number of independent variables (X).

In linear regression the y variable is continuous (i.e. has an infinite set of possibilities). In logistic regression the y variable is categorical (and usually binary), but use of the logit function allows the y variable to be treated as continuous (learn more about that here).
In either linear or logistic regression, each X variable’s effect on the y variable is expressed in the X variable’s coefficient. Though both models’ coefficients look similar, they need to be interpreted in very different ways, and the rest of this post will explain how to interpret them.

_(I will be using sklearn’s built-in "loadboston" housing dataset for both models. For linear regression, the target variable is the median value (in $10,000) of owner-occupied homes in a given neighborhood; for logistic regression, I split up the y variable into two categories, with median values over $21k labelled "1" and median values under $21k labelled "0.")
Linear Regression Coefficients
First, let’s look at the more straightforward coefficients: linear regression. After instantiating and fitting the model, use the .coef_
attribute to view the coefficients.
linreg = LinearRegression()
linreg.fit(X, y)
linreg.coef_
I like to create a pandas dataframe that clearly shows each independent variable along side its coefficient:
pd.DataFrame(linreg.coef_,
X.columns,
columns=['coef'])
.sort_values(by='coef', ascending=False)

As I said, interpreting linear regression coefficients is fairly straightforward, and you would verbally describe the coefficients like this:
"For every one-unit increase in [X variable], the [y variable] increases by [coefficient] when all other variables are held constant."
So for variable RM (average number of rooms per house), this means as the average number of rooms increases by one unit (think "5" to "6"), the median value of homes in that neighborhood increases by ~$6,960 when all else is static. On the other hand, as concentration of nitric oxide increases by one unit (measured in parts per 10 million), the median value of homes decreases by ~$10,510.
Logistic Regression Coefficients
Logistic regression models are instantiated and fit the same way, and the .coef_
attribute is also used to view the model’s coefficients. (Note: you will need to use .coef_[0]
for logistic regression to put it into a dataframe.)
logreg = LogisticRegression()
logreg.fit(X, y)
log_odds = logreg.coef_[0]
pd.DataFrame(log_odds,
X.columns,
columns=['coef'])
.sort_values(by='coef', ascending=False)

However, logistic regression coefficients aren’t as easily interpreted. This is because logistic regression uses the logit link function to "bend" our line of best fit and convert our classification problem into a regression problem. (Again, learn more here.)
Because of the logit function, logistic regression coefficients represent the log odds that an observation is in the target class ("1") given the values of its X variables. Thus, these log odd coefficients need to be converted to regular odds in order to make sense of them. Happily, this is done by simply exponentiating the log odds coefficients, which you can do with np.exp()
:
odds = np.exp(logreg.coef_[0])
pd.DataFrame(odds,
X.columns,
columns=['coef'])
.sort_values(by='coef', ascending=False)

Now these coefficients are beginning to make more sense, and you would verbally describe the odds coefficients like this:
"For every one-unit increase in [X variable], the odds that the observation is in (y class) are [coefficient] times as large as the odds that the observation is not in (y class) when all other variables are held constant."
So, as variable RM (again, average number of rooms) increases by one unit, the odds that the houses represented in the observation are in the target class ("1") are over 6x as large as the odds that they won’t be in the target class. On the other hand, as concentration of nitric oxide increases by one unit, the odds that the houses are in the target class are only ~0.15. For odds less than 1 (our negative coefficients), we can take 1/odds to make even better sense of them. So as nitric oxide increases by 1, the odds that the house is NOT in the target class are 1/0.15 or 6.66x (ominous!) as likely as the odds that it IS in the target class.
The End
Simple enough! But as I was first learning how to model data using linear and logistic regression, this difference in the models’ coefficients was not clear to me, and I wasn’t quite sure how to verbally explain coefficients from a logistic regression model. I hope this helps students who are new to these concepts understand how to interpret coefficients in linear and logistic regression.