Getting used to the Pandas package — lesson, exercises and corrections
Learning pandas
pandas and DataFrames
Since you’re using pandas, the first line you will have to write in (almost) all your notebooks is
import pandas as pd
In pandas, a DataFrame is a table. You can do many things on a DataFrame: sum over a column, plot a histogram, do a pivot table…
As a good start, simply write ‘pd.’ in an empty cell and use the keyboard shortcut Tab to see all the available functions within pandas. You’ll the
Creating DataFrames
To create a DataFrame, use:
pd.DataFrame({
'age':[10, 45, 13]
})
The function ‘pd.DataFrame()’ uses a dictionary:
{
'age':[10, 45, 13]
}
where ‘age’ is the name of the column, [10, 45, 13] are the values of the rows.
Note you could have multiple columns:
{
'age':[10, 45, 13],
'name': ['George', 'Damien', 'Ingrid'],
'gender': ['male', 'male', 'female']
}
You can see this is a little bit unintuitive since we are used to visualising tables as list of rows. Rather, the pandas DataFrames use a list of columns. That’s the first concept to come to grips with. There are other methods to create DataFrame but this is the most common one. The other ones consist in reading files:
pd.read_excel('myExcelFile.xlsx')
(don’t forget to close the file before opening it, otherwise you’ll get an error)
or
pd.read_csv('myFile.csv')
Combining DataFrames
Combining DataFrames is really simple: use either .join(), .append(), pd.concat() or .merge().
See the following (it’s ugly but it works)
With .join():
pd.DataFrame({
'age':[10, 45, 13]
}).join(
pd.DataFrame({
'name': ['George', 'Damien', 'Ingrid']
})
).join(
pd.DataFrame({
'gender': ['male', 'male', 'female']
})
)
Returns
With .append():
pd.DataFrame({
'age':[10, 45, 13]
}).append(
pd.DataFrame({
'age': ['George', 'Damien', 'Ingrid']
})
).append(
pd.DataFrame({
'age': ['male', 'male', 'female']
})
)
Returns
(See the ugly indexing: 0,1,2,0,1,2,0,1,2)
Note append cannot join DataFrame on columns, it has to be joined by rows (one above the other, not one next to the other). You have to have the same column names among your DataFrames, otherwise you’ll get a warning.
With pd.concat():
pd.concat([
pd.DataFrame({
'age':[10, 45, 13]
}),
pd.DataFrame({
'name': ['George', 'Damien', 'Ingrid']
}),
pd.DataFrame({
'gender': ['male', 'male', 'female']
})
], axis=1)
Returns
Note that pd.concat() is to be used with a list of DataFrames, not DataFrames. Use pd.concat([df1, df2, …]). Note the ‘, axis=1’ which is very common in pandas. Basically, axis=0 is columns, axis=1 is rows.
With .merge():
pd.DataFrame({
'age':[10, 45, 13],
'name': ['George', 'Damien', 'Ingrid']
}).merge(
pd.DataFrame({
'name': ['George', 'Damien', 'Ingrid'],
'gender': ['male', 'male', 'female']
}), on=['name']
)
Returns
‘.merge()’ is the SQL ‘JOIN’. You need at least one column to match them.
Running operations on the DataFrame
Sum, max, min, mean, median
pd.DataFrame({
'age':[10, 45, 13],
'name': ['George', 'Damien', 'Ingrid'],
'gender': ['male', 'male', 'female']
}).mean()
Returns
See that it averages only the numeric column.
pd.DataFrame({
'age':[10, 45, 13],
'name': ['George', 'Damien', 'Ingrid'],
'gender': ['male', 'male', 'female']
}).sum()
Returns
See that it works for all columns?!! It concatenates text together as well as summing the ‘age’ column. That’s because you can sum text in Python:
'Damien' + 'George'
Here we don’t care about the sum of the text, but in some cases it can be interesting.
pd.DataFrame({
'age':[10, 45, 13],
'name': ['George', 'Damien', 'Ingrid'],
'gender': ['male', 'male', 'female']
}).min()
Returns
.min() returns the minimum value for numeric but also returns the minimum value for strings in the alphabetical order! That’s because you can compare text in Python:
'Damien' > 'George'
pivot_table
No need to introduce this if you’re an Excel user. Pivot table simply aggregates data together in a way that’s easy to read.
pd.DataFrame({
'age':[10, 45, 13],
'name': ['George', 'Damien', 'Ingrid'],
'gender': ['male', 'male', 'female']
}).pivot_table(index='gender', aggfunc='min')
Returns
Same behaviour as before, the ‘min’ operator works for both numeric and string columns.
Custom functions
To scale the columns of a dataset between 0 and 1, one could write the following:
for column in df.columns:
df[column] = \
(df[column] — df[column].min()) / \
(df[column].max() — df[column].min())
Remember that backslash is for line breaks.
The more advanced way to write this is by using the .apply() method. As the name suggests, it applies a function to all columns or all rows at the same time.
df = df.apply(
lambda column: (column-column.min()) / (column.max()-column.min())
)
To do custom functions on 1 specific column, use the .map() method.
df['column_1_squared'] = df['column_1'].map(
lambda x: x * x
)
Conclusion
Practice these as much as you can. They’re going to be super duper useful.
Exercises
Creating DataFrames
Create DataFrames that looks like the following images. Oh, and find a clever way to create them, don’t just hard code the values in a dictionary:
1)
2)
3)
Combining DataFrames
1)
customers1 = pd.DataFrame({
'age':[10, 45, 13],
'name': ['George', 'Damien', 'Ingrid'],
'gender': ['male', 'male', 'female']
})customers2 = pd.DataFrame({
'age':[12, 23, 25],
'name': ['Albert', 'Jane', 'Angela'],
'gender': ['male', 'female', 'female']
})
How to write in one line the table that concatenates the two tables of customers?
2) difficult
Imagine you have on your hard drive some files in the folder named cust_transactions. Each file contains the following information about some customer transactions:
file_482.csv contains the 4 columns below:
How can you read all the files into a single DataFrame named ‘transactions’?
Final one
very difficult ;)
You have a 5 sklearn models for binary classification named model_1, model_2, …, model_5. You want to measure the AUC, LogLoss, accuracy and f1-score of all these 5 models on both the train and test sets named X_train and X_test. The actual values for the target are y_train and y_test respectively.
Each row represents the performance of a model.
Can you write a clean code to create the following table?
Reminder:
These are the 4 measures of accuracy:
from sklearn.metrics import log_loss, roc_auc_score, accuracy_score, f1_score
log_loss and roc_auc_score take the predicted probability as input, accuracy_score and f1_score take the predicted class as input.
For instance, if m is the name of the model:
log_loss(y_true=y_train, y_pred=m.predict_proba(X_train)[:,1])
roc_auc_score(y_true=y_train, y_score=m.predict_proba(X_train)[:,1])
Note one uses y_pred, the other uses y_score.
Corrections
Creating DataFrames
1)
pd.DataFrame({
'n': range(3,11),
'n_squared': [n*n for n in range(3,11)]
})
2)
pd.DataFrame({'number_'+str(n):[n] for n in range(10)})
Don’t forget the [] next to the n.
3)
pd.DataFrame({'column_'+str(n):[n*m for m in range(10)] for n in range(10)})
Two list comprehensions intertwined.
Combining DataFrames
1)
customers1.append(customers2, ignore_index=True)
Note that ignore_index=True isn’t necessary, it’s good to get a clean new index.
2)
import ostransactions = pd.concat([
pd.read_csv(f)
for f in os.listdir('cust_transactions')
if f.endswith('.csv')
])
The ‘import os’ is necessary to call ‘os.listdir(‘cust_transactions’)’ which lists all the files in the folder ‘cust_transactions’. I added the ‘if f.endswith(‘.csv’)’ in case there are other types of files in the folder.
Using multiple .append() here looks like a good choice too. However, when reading thousands or millions of files together, it will be very slow. Using pd.concat([]) with pd.read_csv() will be the fastest option.
Final one
models = [model_1, model_2, model_3, model_4, model_5]pd.concat(
[
pd.DataFrame({
sample+'_log_loss': [log_loss(y_true=y, y_pred=m.predict_proba(X)[:,1]) for m in models],
sample+'_AUC': [roc_auc_score(y_true=y, y_score=m.predict_proba(X)[:,1]) for m in models],
sample+'_accuracy': [accuracy_score(y_true=y, y_pred=m.predict(X)) for m in models],
sample+'_f1score': [f1_score(y_true=y, y_pred=m.predict(X)) for m in models]
})
for (sample,X,y) in zip(['train', 'test'], [X_train, X_test], [y_train, y_test])
],
axis=1
)