The world’s leading publication for data science, AI, and ML professionals.

How To Run Logistic Regression On Aggregate Data In Python

3 Simple Solutions that Every Data Scientist Should Know

Photo by Leif Christoph Gottwald on Unsplash
Photo by Leif Christoph Gottwald on Unsplash

Ι will show you 3 techniques that will help you deal with Aggregate Data in Python when you want to perform a Logistic Regression.

Let’s create some dummy data.

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

df=pd.DataFrame(
{
'Gender':np.random.choice(["m","f"],200,p=[0.6,0.4]),
'Age':np.random.choice(["[<30]","[30-65]", "[65+]"],200,p=[0.3,0.6,0.1]),
"Response":np.random.binomial(1,size=200,p=0.2)
    }
)

df.head()
Gender      Age  Response
0      f  [30-65]         0
1      m  [30-65]         0
2      m    [<30]         0
3      f  [30-65]         1
4      f    [65+]         0

Logistic Regression on Non-Aggregate Data

Firstly, we will run a Logistic Regression model on Non-Aggregate Data. We will use the library Stats Models because this is the library we will use for the aggregated data and it is easier to compare our models. Also, Stats Models can give us a model’s summary in a more classic statistical way like R.

_Tip: If you don’t want to convert your categorical data into binary to perform a Logistic Regression, you can use the Stats Models formulas Instead of Sklearn._

model=smf.logit('Response~Gender+Age',data=df)
result = model.fit()
print(result.summary())

Logit Regression Results                           
==============================================================================
Dep. Variable:               Response   No. Observations:                  200
Model:                          Logit   Df Residuals:                      196
Method:                           MLE   Df Model:                            3
Date:                Mon, 22 Feb 2021   Pseudo R-squ.:                 0.02765
Time:                        18:09:11   Log-Likelihood:                -85.502
converged:                       True   LL-Null:                       -87.934
Covariance Type:            nonrobust   LLR p-value:                    0.1821
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -2.1741      0.396     -5.494      0.000      -2.950      -1.399
Gender[T.m]      0.8042      0.439      1.831      0.067      -0.057       1.665
Age[T.[65+]]    -0.7301      0.786     -0.929      0.353      -2.270       0.810
Age[T.[<30]]     0.1541      0.432      0.357      0.721      -0.693       1.001
================================================================================

Logistic Regression on Aggregate Data


1. Logistic Regressions using Responders and Non-Responders

In the following code, we grouped our data and we created columns for the responders(Yes) and Non-Responders(No).

grouped=df.groupby(['Gender','Age']).agg({'Response':[sum,'count']}).droplevel(0, axis=1).rename(columns={'sum':'Yes','count':'Impressions'}).eval('No=Impressions-Yes')
grouped.reset_index(inplace=True)
grouped
Gender      Age  Yes  Impressions  No
0      f  [30-65]    9           38  29
1      f    [65+]    2            7   5
2      f    [<30]    8           25  17
3      m  [30-65]   17           79  62
4      m    [65+]    2           12  10
5      m    [<30]    9           39  30

glm_binom = smf.glm('Yes + No ~ Age + Gender',grouped, family=sm.families.Binomial())
result_grouped=glm_binom.fit()
print(result_grouped.summary())

Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:          ['Yes', 'No']   No. Observations:                    6
Model:                            GLM   Df Residuals:                        2
Model Family:                Binomial   Df Model:                            3
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -8.9211
Date:                Mon, 22 Feb 2021   Deviance:                       1.2641
Time:                        18:15:15   Pearson chi2:                    0.929
No. Iterations:                     5                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -2.1741      0.396     -5.494      0.000      -2.950      -1.399
Age[T.[65+]]    -0.7301      0.786     -0.929      0.353      -2.270       0.810
Age[T.[<30]]     0.1541      0.432      0.357      0.721      -0.693       1.001
Gender[T.m]      0.8042      0.439      1.831      0.067      -0.057       1.665
================================================================================

2. Logistic Regression with Weights

For this method, we need to create a new column with the response rate of every group.

grouped['RR']=grouped['Yes']/grouped['Impressions']
glm = smf.glm('RR ~ Age + Gender',data=grouped, family=sm.families.Binomial(), freq_weights=np.asarray(grouped['Impressions']))
result_grouped2=glm.fit()
print(result_grouped2.summary())

Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                     RR   No. Observations:                    6
Model:                            GLM   Df Residuals:                      196
Model Family:                Binomial   Df Model:                            3
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -59.807
Date:                Mon, 22 Feb 2021   Deviance:                       1.2641
Time:                        18:18:16   Pearson chi2:                    0.929
No. Iterations:                     5                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -2.1741      0.396     -5.494      0.000      -2.950      -1.399
Age[T.[65+]]    -0.7301      0.786     -0.929      0.353      -2.270       0.810
Age[T.[<30]]     0.1541      0.432      0.357      0.721      -0.693       1.001
Gender[T.m]      0.8042      0.439      1.831      0.067      -0.057       1.665
================================================================================

3.Expand the Aggregate Data

lastly, we can "ungroup" our data and transform our dependent variable into binary so we can perform a Logistic Regression as usual.

grouped['No']=grouped['No'].apply(lambda x: [0]*x)
grouped['Yes']=grouped['Yes'].apply(lambda x: [1]*x)
grouped['Response']=grouped['Yes']+grouped['No']

expanded=grouped.explode("Response")[['Gender','Age','Response']]
expanded['Response']=expanded['Response'].astype(int)

expanded.head()
Gender      Age Response
0      f  [30-65]        1
0      f  [30-65]        1
0      f  [30-65]        1
0      f  [30-65]        1
0      f  [30-65]        1

model=smf.logit('Response~ Gender + Age',data=expanded)
result = model.fit()
print(result.summary())

Logit Regression Results                           
==============================================================================
Dep. Variable:               Response   No. Observations:                  200
Model:                          Logit   Df Residuals:                      196
Method:                           MLE   Df Model:                            3
Date:                Mon, 22 Feb 2021   Pseudo R-squ.:                 0.02765
Time:                        18:29:33   Log-Likelihood:                -85.502
converged:                       True   LL-Null:                       -87.934
Covariance Type:            nonrobust   LLR p-value:                    0.1821
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -2.1741      0.396     -5.494      0.000      -2.950      -1.399
Gender[T.m]      0.8042      0.439      1.831      0.067      -0.057       1.665
Age[T.[65+]]    -0.7301      0.786     -0.929      0.353      -2.270       0.810
Age[T.[<30]]     0.1541      0.432      0.357      0.721      -0.693       1.001
================================================================================

Conclusions

With all 4 models, we came up with the same coefficients and p-values.

Based on my experience, I saw that getting raw data for a project is not so common and in most cases, we are dealing with aggregated/grouped data. These techniques will help you deal with them easily and that’s why I think is a great addon to your Python toolkit.

If you are using R you can read this very useful post.


I am going to write more beginner-friendly posts in the future. Follow me up at Medium or visit my blog to be informed about them.

I welcome questions, feedback, and constructive criticism and can be reached on Twitter, LinkedIn, or Instagram.


Originally published at https://predictivehacks.com


Related Articles