Ι will show you 3 techniques that will help you deal with Aggregate Data in Python when you want to perform a Logistic Regression.
Let’s create some dummy data.
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
df=pd.DataFrame(
{
'Gender':np.random.choice(["m","f"],200,p=[0.6,0.4]),
'Age':np.random.choice(["[<30]","[30-65]", "[65+]"],200,p=[0.3,0.6,0.1]),
"Response":np.random.binomial(1,size=200,p=0.2)
}
)
df.head()
Gender Age Response
0 f [30-65] 0
1 m [30-65] 0
2 m [<30] 0
3 f [30-65] 1
4 f [65+] 0
Logistic Regression on Non-Aggregate Data
Firstly, we will run a Logistic Regression model on Non-Aggregate Data. We will use the library Stats Models because this is the library we will use for the aggregated data and it is easier to compare our models. Also, Stats Models can give us a model’s summary in a more classic statistical way like R.
_Tip: If you don’t want to convert your categorical data into binary to perform a Logistic Regression, you can use the Stats Models formulas Instead of Sklearn._
model=smf.logit('Response~Gender+Age',data=df)
result = model.fit()
print(result.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Response No. Observations: 200
Model: Logit Df Residuals: 196
Method: MLE Df Model: 3
Date: Mon, 22 Feb 2021 Pseudo R-squ.: 0.02765
Time: 18:09:11 Log-Likelihood: -85.502
converged: True LL-Null: -87.934
Covariance Type: nonrobust LLR p-value: 0.1821
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399
Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665
Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810
Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001
================================================================================
Logistic Regression on Aggregate Data
1. Logistic Regressions using Responders and Non-Responders
In the following code, we grouped our data and we created columns for the responders(Yes) and Non-Responders(No).
grouped=df.groupby(['Gender','Age']).agg({'Response':[sum,'count']}).droplevel(0, axis=1).rename(columns={'sum':'Yes','count':'Impressions'}).eval('No=Impressions-Yes')
grouped.reset_index(inplace=True)
grouped
Gender Age Yes Impressions No
0 f [30-65] 9 38 29
1 f [65+] 2 7 5
2 f [<30] 8 25 17
3 m [30-65] 17 79 62
4 m [65+] 2 12 10
5 m [<30] 9 39 30
glm_binom = smf.glm('Yes + No ~ Age + Gender',grouped, family=sm.families.Binomial())
result_grouped=glm_binom.fit()
print(result_grouped.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: ['Yes', 'No'] No. Observations: 6
Model: GLM Df Residuals: 2
Model Family: Binomial Df Model: 3
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -8.9211
Date: Mon, 22 Feb 2021 Deviance: 1.2641
Time: 18:15:15 Pearson chi2: 0.929
No. Iterations: 5
Covariance Type: nonrobust
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399
Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810
Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001
Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665
================================================================================
2. Logistic Regression with Weights
For this method, we need to create a new column with the response rate of every group.
grouped['RR']=grouped['Yes']/grouped['Impressions']
glm = smf.glm('RR ~ Age + Gender',data=grouped, family=sm.families.Binomial(), freq_weights=np.asarray(grouped['Impressions']))
result_grouped2=glm.fit()
print(result_grouped2.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: RR No. Observations: 6
Model: GLM Df Residuals: 196
Model Family: Binomial Df Model: 3
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -59.807
Date: Mon, 22 Feb 2021 Deviance: 1.2641
Time: 18:18:16 Pearson chi2: 0.929
No. Iterations: 5
Covariance Type: nonrobust
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399
Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810
Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001
Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665
================================================================================
3.Expand the Aggregate Data
lastly, we can "ungroup" our data and transform our dependent variable into binary so we can perform a Logistic Regression as usual.
grouped['No']=grouped['No'].apply(lambda x: [0]*x)
grouped['Yes']=grouped['Yes'].apply(lambda x: [1]*x)
grouped['Response']=grouped['Yes']+grouped['No']
expanded=grouped.explode("Response")[['Gender','Age','Response']]
expanded['Response']=expanded['Response'].astype(int)
expanded.head()
Gender Age Response
0 f [30-65] 1
0 f [30-65] 1
0 f [30-65] 1
0 f [30-65] 1
0 f [30-65] 1
model=smf.logit('Response~ Gender + Age',data=expanded)
result = model.fit()
print(result.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Response No. Observations: 200
Model: Logit Df Residuals: 196
Method: MLE Df Model: 3
Date: Mon, 22 Feb 2021 Pseudo R-squ.: 0.02765
Time: 18:29:33 Log-Likelihood: -85.502
converged: True LL-Null: -87.934
Covariance Type: nonrobust LLR p-value: 0.1821
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399
Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665
Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810
Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001
================================================================================
Conclusions
With all 4 models, we came up with the same coefficients and p-values.
Based on my experience, I saw that getting raw data for a project is not so common and in most cases, we are dealing with aggregated/grouped data. These techniques will help you deal with them easily and that’s why I think is a great addon to your Python toolkit.
If you are using R you can read this very useful post.
I am going to write more beginner-friendly posts in the future. Follow me up at Medium or visit my blog to be informed about them.
I welcome questions, feedback, and constructive criticism and can be reached on Twitter, LinkedIn, or Instagram.
Originally published at https://predictivehacks.com