Dive into some messy real-world data using a combination of Python and R.

This notebook is about using a combination of Python and R to perform data preproccesing and some exploratory statistical analysis on an untidy dataset. Blogs and articles for upcoming data scientists often harp on the need for using real-world data which is messy and the importance of learning Data Cleaning and preprocessing for this profession. However, we mostly find blogs and tutorials with standard datasets which skip the pre-processing step. Similarly, real-world data analysis may require a combination of EDA and statistical analysis. For that purpose, it may require that data scientists use both Python and R and switch between them depending upon the micro-task at hand.
In this tutorial, I share some steps of data processing and exploratory analysis which came across as part of an experiment during my PhD. I have introduced a toy dataset for this purpose but the structure and messyness of the data is similar to what I encountered. Imagine a retail company with five stores in different geographic locations. Each store has two billing counters. The company is trialing a new product which they have only put at the billing counter. The cashiers are supposed to pitch the item to the customers during billing. The cashiers also ask each customer three questions about the product and then ask if they want to buy that product. The questions are about three attributes of the product and the customers have to answer ‘Yes’ – if they like the attribute or ‘No’ – if they don’t like the attribute. Thus we have 4 columns in our dataset for each copy of the product – 3 for the attributes and 1 for recording whether customers end up buying the item or not. All values are categorical – "Yes" or "No".
However, the way the data has been stored – it is stored in a single csv file but it has 40 columns – each of the 5 stores and each of the two billing counters in each store has a separate record. Thus we get multi-level columns as shown in the screenshot below.

Untidy Data
There is one more problem with the way in which the data is stored. If you look at the data for Store 1, and compare the two tills, you see that the records are mutually exclusive. That is, for a customer who visits Till 1, the corresponding record for Till 2 is left empty. There could have been some valid or lazy reasons why it was stored this way. But right now, as a data scientist, we have got this dataset to work on.
So we begin. We first read our csv file and store it as a pandas dataframe. Note that we use header=[2] to use the third row as header and skip the first two rows.
import pandas as pd
d1 = pd.read_csv('https://raw.githubusercontent.com/sagaruprety/data_science/master/multi_attribute_buying_behaviour.csv', header=[2])
d1.head()

Merge Tills within a Store
We see that pandas has labelled the columns with the ‘.’ extension. So the ‘Buy’ variable for {Store_1, Till_1} remains ‘Buy’, but that of {Store_1, Till_2} is ‘Buy.1’. Similarly, ‘Buy’ variable for {Store_2, Till_1} is ‘Buy.2’. Also note that for any store, an item record is registered as NaN if it was purchased from the other till.
In total there are 40 columns here – 4 variables x 5 stores x 2 tills per store.
Our final aim of data processessing is to merge item records of all tills and all stores. This will result in a dataframe with only four columns – corresponding to the three attributes and the Buy decision.
The first step is to merge the data of two tills of a store into one. We iterate over the columns of the dataframe, replace the NaNs with a null string (”) and then concatenate the columns corresponding to any two tills of a store. The new dataframe is stored in a different variable d2.
columns = ['Attribute_1', 'Attribute_2', 'Attribute_3', 'Buy']
num_stores = 5
d2 = pd.DataFrame()
for col in columns:
for i in range(num_stores):
if i == 0:
d2[col+'.'+str(i)] = d1[col].fillna('') + d1[col+'.1'].fillna('')
else:
d2[col+'.'+str(i)] = d1[col+'.'+str(2*i)].fillna('') + d1[col+'.'+str(2*i+1)].fillna('')
d2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202 entries, 0 to 201
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Attribute_1.0 202 non-null object
1 Attribute_1.1 202 non-null object
2 Attribute_1.2 202 non-null object
3 Attribute_1.3 202 non-null object
4 Attribute_1.4 202 non-null object
5 Attribute_2.0 202 non-null object
6 Attribute_2.1 202 non-null object
7 Attribute_2.2 202 non-null object
8 Attribute_2.3 202 non-null object
9 Attribute_2.4 202 non-null object
10 Attribute_3.0 202 non-null object
11 Attribute_3.1 202 non-null object
12 Attribute_3.2 202 non-null object
13 Attribute_3.3 202 non-null object
14 Attribute_3.4 202 non-null object
15 Buy.0 202 non-null object
16 Buy.1 202 non-null object
17 Buy.2 202 non-null object
18 Buy.3 202 non-null object
19 Buy.4 202 non-null object
dtypes: object(20)
memory usage: 31.7+ KB
Combine data for all stores
As we see above, we have merged the Till level information into Store level. In the next step, we merge the records for all stores into one. We create subsets of dataframes corresponding to the five stores and then append them below each other.
num_stores = 5
store_dfs = [pd.DataFrame() for _ in range(num_stores)]
col_ind = 0
for col in columns:
for store in range(num_stores):
store_dfs[store][col] = d2.iloc[:, col_ind]
col_ind+=1
store_dfs[4].head()

Above is the data frame corresponding to Store 5. Similarly we have dataframes corresponding to each store. Note below that each of them has 202 rows.
store_dfs[4].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202 entries, 0 to 201
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Attribute_1 202 non-null object
1 Attribute_2 202 non-null object
2 Attribute_3 202 non-null object
3 Buy 202 non-null object
dtypes: object(4)
memory usage: 6.4+ KB
Now we append the dataframes corresponding to each store below each other and reset the index, otherwise the index will repeat in a loop from 0 to 201 for the 5 dataframes.
df = store_dfs[0]
for i in range(1,num_stores):
df = df.append(store_dfs[i])
df.reset_index(drop=True, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010 entries, 0 to 1009
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Attribute_1 1010 non-null object
1 Attribute_2 1010 non-null object
2 Attribute_3 1010 non-null object
3 Buy 1010 non-null object
dtypes: object(4)
memory usage: 31.7+ KB
Use R for statistical analysis with magic commands
Note that the final dataframe has 202 x 5 = 1010 rows and 4 columns. It represents the data for the whole company rather than Store wise and Till wise. Let us now proceed to some basic statistical analysis of the data. R is one of the most popular languages for statistical analysis and we will use that here. Jupyter (which stands for JUlia PYThon R) notebooks allow us to embedd R code snippets in a Python notebook.
We first need to download the following R package in order to do so:
%load_ext rpy2.ipython
%R is a magic command which helps us switch from Python to R. Beginning any line of code with %R will enable us to write in R.
The following line instructs the machine to take the python variable named df and convert it into an R variable of the same name. So we convert our pandas dataframe into an R dataframe.
%R -i df
Next, we examine the structure of the R dataframe. This shows that internally the values of the different attributes are stored as strings.
%R str(df)
'data.frame': 1010 obs. of 4 variables:
$ Attribute_1: chr "No" "Yes" "Yes" "No" ...
$ Attribute_2: chr "Yes" "Yes" "No" "Yes" ...
$ Attribute_3: chr "Yes" "No" "No" "Yes" ...
$ Buy : chr "No" "Yes" "No" "No" ...
Convert to categorical variables in R dataframe
The following lines of code stores the values of different attributes as categorical variables (called factors in R). But we see that there are 3 types of categorical variables. There is a null string which is due to some missing values in the dataset.
Note that you can also use the magic command %%R which enables the whole cell to be used for R.
%%R
df$Attribute_1 <- as.factor(df$Attribute_1)
df$Attribute_2 <- as.factor(df$Attribute_2)
df$Attribute_3 <- as.factor(df$Attribute_3)
df$Buy <- as.factor(df$Buy)
str(df)
'data.frame': 1007 obs. of 4 variables:
$ Attribute_1: Factor w/ 2 levels "No","Yes": 1 2 2 1 2 1 1 2 2 2 ...
$ Attribute_2: Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 2 ...
$ Attribute_3: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 1 2 2 ...
$ Buy : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 2 2 2 ...
Remove null strings from dataframe using Pandas
This is something we should have checked in the first place. We can easily do that in pandas. Note the empty values are not NaN but rather empty character literals. So we need to first convert them into NaN and then use pandas dropna function to drop rows with NaN values.
import numpy as np
df.replace(to_replace='', value=np.NaN, inplace=True)
df.dropna(inplace=True)
Now run the above piece of R code again, and we get an R dataframe with 2 levels of categorical variables.
%R -i df
%R df$Attribute_1 <- as.factor(df$Attribute_1)
%R df$Attribute_2 <- as.factor(df$Attribute_2)
%R df$Attribute_3 <- as.factor(df$Attribute_3)
%R df$Buy <- as.factor(df$Buy)
%R str(df)
'data.frame': 1007 obs. of 4 variables:
$ Attribute_1: Factor w/ 2 levels "No","Yes": 1 2 2 1 2 1 1 2 2 2 ...
$ Attribute_2: Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 2 ...
$ Attribute_3: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 1 2 2 ...
$ Buy : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 2 2 2 ...
Analysing Data
Now we can perform some analysis on the data. The purpose is to find out which of the three attributes influences the buying decision most. We first use xtabs to calculate cross category frequencies. This gives us a glimpse of the influence of each attribute on the Buy decision.
%%R
print(xtabs(~Buy+Attribute_1, data=df))
print(xtabs(~Buy+Attribute_2, data=df))
print(xtabs(~Buy+Attribute_3, data=df))
Attribute_1
Buy No Yes
No 372 122
Yes 48 465
Attribute_2
Buy No Yes
No 267 227
Yes 180 333
Attribute_3
Buy No Yes
No 272 222
Yes 155 358
Logistic Regression Analysis
Note in the above matrices that Attribute_1 appears to be heavily influencing the buying decision. Almost 80% of customers who like attribute 1 of the new product buy it and almost 90% of those who do not like attribute 1, do not end up buying it. The same cannot be said of the other two attributes where the fractions tend towards 50–50.
Since we are dealing with categorical variables here, we need to use the binomial family of generalised linear models (GLM) in order to analyse the effect of different attributes. So we fit a logistic regression model into the data.
In the summary below, we find the significant influence of attribute 1 in the buying decision as the corresponding p-values is very low.
The negative intercept means that if all the attributes have value ‘No’, then the customer is highly unlikely to buy the product, which makes sense. Note that in logistic regression, the output variables ‘Buy’ is converted into a log odds scale. So a negative value actually means the odds are stacked against the customer buying the product.
If you are not familiar with the concepts and intution behind logistic regression, refer to this excellent series of videos explaining logistic regression.
%%R
logistic <-glm(Buy~Attribute_1+Attribute_2+Attribute_3, data=df, family = "binomial")
print(summary(logistic))
Call:
glm(formula = Buy ~ Attribute_1 + Attribute_2 + Attribute_3,
family = "binomial", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0684 -0.5250 0.5005 0.6474 2.4538
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.9599 0.2241 -13.206 < 2e-16 ***
Attribute_1Yes 3.3681 0.1912 17.620 < 2e-16 ***
Attribute_2Yes 0.5579 0.1757 3.174 0.0015 **
Attribute_3Yes 1.0479 0.1782 5.881 4.07e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1395.6 on 1006 degrees of freedom
Residual deviance: 850.7 on 1003 degrees of freedom
AIC: 858.7
Number of Fisher Scoring iterations: 5
Logistic Regression Model to Predict Buying Decisions
We want to know whether we should consider all three variables in order to model the data or whether some variable, maybe Attribute_2, does not contribute to the buying decision. We can test this using Anova technique as shown below.
The deviance of attribute_1 is much higher than the others. Attributes 2 and 3 do not contribute as much to the buying decision. So if you want to keep the model simpler, attribute_1 can be alone used to make prediction about the buying decision. Although when using all three attributes the residual deviance is the lowest, so the model is best fitted using all the three attributes.
%R anova(logistic, test='Chisq')

We can make predictions using this fitted model on new customers. For example, we see below that if a customer does not like the first attribute but likes other two, the probability of buying the product is 0.2.
%%R
new_customer <- data.frame(Attribute_1=as.factor('No'), Attribute_2=as.factor('Yes'), Attribute_3=as.factor('Yes'))
new_customer_buys <-predict(logistic, new_customer, type='response')
paste(new_customer_buys)
[1] "0.205188948698921"
Conclusion and Future Ideas
So this was just an introduction to some real-world data science work. We worked from a very untidy dataset to prepare the data in order to carry out some statistical analysis and predictions. Also, we used a combination of Python And R for the purpose. The model is very simple, with just three atrtibutes. Even then, one of them alone is sufficient to predict the decision.
For those of you into behavioural aspects of decision-making, it would be interesting to think whether the predictions will change if the cashier asks the three questions to the customers in different orders. Machine Learning models do not take into account the order of features. We know that human answers are affected by the order of questions asked (Order Effects). So, when the features of a machine learning model correspond to human decisions/answers/judgements, ML models need to take into account the ordering of features too when predicting a decision.
Thank you for going through this post. I am still learning the trades of data science and analysis and would love to get feedback on all the aspects of this post.