Getting used to the Pandas package — lesson, exercises and corrections

Félix Revert
Towards Data Science
7 min readApr 19, 2020

--

Learning pandas

pandas and DataFrames

Since you’re using pandas, the first line you will have to write in (almost) all your notebooks is

import pandas as pd

In pandas, a DataFrame is a table. You can do many things on a DataFrame: sum over a column, plot a histogram, do a pivot table…

As a good start, simply write ‘pd.’ in an empty cell and use the keyboard shortcut Tab to see all the available functions within pandas. You’ll the

Creating DataFrames

To create a DataFrame, use:

pd.DataFrame({
'age':[10, 45, 13]
})

The function ‘pd.DataFrame()’ uses a dictionary:

{
'age':[10, 45, 13]
}

where ‘age’ is the name of the column, [10, 45, 13] are the values of the rows.

Note you could have multiple columns:

{
'age':[10, 45, 13],
'name': ['George', 'Damien', 'Ingrid'],
'gender': ['male', 'male', 'female']
}

You can see this is a little bit unintuitive since we are used to visualising tables as list of rows. Rather, the pandas DataFrames use a list of columns. That’s the first concept to come to grips with. There are other methods to create DataFrame but this is the most common one. The other ones consist in reading files:

pd.read_excel('myExcelFile.xlsx')

(don’t forget to close the file before opening it, otherwise you’ll get an error)

or

pd.read_csv('myFile.csv')

Combining DataFrames

Combining DataFrames is really simple: use either .join(), .append(), pd.concat() or .merge().

See the following (it’s ugly but it works)

With .join():

pd.DataFrame({
'age':[10, 45, 13]
}).join(
pd.DataFrame({
'name': ['George', 'Damien', 'Ingrid']
})
).join(
pd.DataFrame({
'gender': ['male', 'male', 'female']
})
)

Returns

With .append():

pd.DataFrame({
'age':[10, 45, 13]
}).append(
pd.DataFrame({
'age': ['George', 'Damien', 'Ingrid']
})
).append(
pd.DataFrame({
'age': ['male', 'male', 'female']
})
)

Returns

(See the ugly indexing: 0,1,2,0,1,2,0,1,2)

Note append cannot join DataFrame on columns, it has to be joined by rows (one above the other, not one next to the other). You have to have the same column names among your DataFrames, otherwise you’ll get a warning.

With pd.concat():

pd.concat([
pd.DataFrame({
'age':[10, 45, 13]
}),
pd.DataFrame({
'name': ['George', 'Damien', 'Ingrid']
}),
pd.DataFrame({
'gender': ['male', 'male', 'female']
})
], axis=1)

Returns

Note that pd.concat() is to be used with a list of DataFrames, not DataFrames. Use pd.concat([df1, df2, …]). Note the ‘, axis=1’ which is very common in pandas. Basically, axis=0 is columns, axis=1 is rows.

With .merge():

pd.DataFrame({
'age':[10, 45, 13],
'name': ['George', 'Damien', 'Ingrid']
}).merge(
pd.DataFrame({
'name': ['George', 'Damien', 'Ingrid'],
'gender': ['male', 'male', 'female']
}), on=['name']
)

Returns

‘.merge()’ is the SQL ‘JOIN’. You need at least one column to match them.

Running operations on the DataFrame

Sum, max, min, mean, median

pd.DataFrame({
'age':[10, 45, 13],
'name': ['George', 'Damien', 'Ingrid'],
'gender': ['male', 'male', 'female']
}).mean()

Returns

See that it averages only the numeric column.

pd.DataFrame({
'age':[10, 45, 13],
'name': ['George', 'Damien', 'Ingrid'],
'gender': ['male', 'male', 'female']
}).sum()

Returns

See that it works for all columns?!! It concatenates text together as well as summing the ‘age’ column. That’s because you can sum text in Python:

'Damien' + 'George'

Here we don’t care about the sum of the text, but in some cases it can be interesting.

pd.DataFrame({
'age':[10, 45, 13],
'name': ['George', 'Damien', 'Ingrid'],
'gender': ['male', 'male', 'female']
}).min()

Returns

.min() returns the minimum value for numeric but also returns the minimum value for strings in the alphabetical order! That’s because you can compare text in Python:

'Damien' > 'George'

pivot_table

No need to introduce this if you’re an Excel user. Pivot table simply aggregates data together in a way that’s easy to read.

pd.DataFrame({
'age':[10, 45, 13],
'name': ['George', 'Damien', 'Ingrid'],
'gender': ['male', 'male', 'female']
}).pivot_table(index='gender', aggfunc='min')

Returns

Same behaviour as before, the ‘min’ operator works for both numeric and string columns.

Custom functions

To scale the columns of a dataset between 0 and 1, one could write the following:

for column in df.columns:
df[column] = \
(df[column] — df[column].min()) / \
(df[column].max() — df[column].min())

Remember that backslash is for line breaks.

The more advanced way to write this is by using the .apply() method. As the name suggests, it applies a function to all columns or all rows at the same time.

df = df.apply(
lambda column: (column-column.min()) / (column.max()-column.min())
)

To do custom functions on 1 specific column, use the .map() method.

df['column_1_squared'] = df['column_1'].map(
lambda x: x * x
)

Conclusion

Practice these as much as you can. They’re going to be super duper useful.

Exercises

Creating DataFrames

Create DataFrames that looks like the following images. Oh, and find a clever way to create them, don’t just hard code the values in a dictionary:

1)

Dataframe #1: squares from 3 to 10

2)

Dataframe #2: multiple columns

3)

DataFrame #3: squared numbers

Combining DataFrames

1)

customers1 = pd.DataFrame({
'age':[10, 45, 13],
'name': ['George', 'Damien', 'Ingrid'],
'gender': ['male', 'male', 'female']
})customers2 = pd.DataFrame({
'age':[12, 23, 25],
'name': ['Albert', 'Jane', 'Angela'],
'gender': ['male', 'female', 'female']
})

How to write in one line the table that concatenates the two tables of customers?

2) difficult

Imagine you have on your hard drive some files in the folder named cust_transactions. Each file contains the following information about some customer transactions:

file_482.csv contains the 4 columns below:

How can you read all the files into a single DataFrame named ‘transactions’?

Final one

very difficult ;)

You have a 5 sklearn models for binary classification named model_1, model_2, …, model_5. You want to measure the AUC, LogLoss, accuracy and f1-score of all these 5 models on both the train and test sets named X_train and X_test. The actual values for the target are y_train and y_test respectively.

Each row represents the performance of a model.

Can you write a clean code to create the following table?

Reminder:

These are the 4 measures of accuracy:

from sklearn.metrics import log_loss, roc_auc_score, accuracy_score, f1_score

log_loss and roc_auc_score take the predicted probability as input, accuracy_score and f1_score take the predicted class as input.

For instance, if m is the name of the model:

log_loss(y_true=y_train, y_pred=m.predict_proba(X_train)[:,1])
roc_auc_score(y_true=y_train, y_score=m.predict_proba(X_train)[:,1])

Note one uses y_pred, the other uses y_score.

Corrections

Creating DataFrames

1)

pd.DataFrame({
'n': range(3,11),
'n_squared': [n*n for n in range(3,11)]
})

2)

pd.DataFrame({'number_'+str(n):[n] for n in range(10)})

Don’t forget the [] next to the n.

3)

pd.DataFrame({'column_'+str(n):[n*m for m in range(10)] for n in range(10)})

Two list comprehensions intertwined.

Combining DataFrames

1)

customers1.append(customers2, ignore_index=True)

Note that ignore_index=True isn’t necessary, it’s good to get a clean new index.

2)

import ostransactions = pd.concat([
pd.read_csv(f)
for f in os.listdir('cust_transactions')
if f.endswith('.csv')
])

The ‘import os’ is necessary to call ‘os.listdir(‘cust_transactions’)’ which lists all the files in the folder ‘cust_transactions’. I added the ‘if f.endswith(‘.csv’)’ in case there are other types of files in the folder.

Using multiple .append() here looks like a good choice too. However, when reading thousands or millions of files together, it will be very slow. Using pd.concat([]) with pd.read_csv() will be the fastest option.

Final one

models = [model_1, model_2, model_3, model_4, model_5]pd.concat(
[
pd.DataFrame({
sample+'_log_loss': [log_loss(y_true=y, y_pred=m.predict_proba(X)[:,1]) for m in models],
sample+'_AUC': [roc_auc_score(y_true=y, y_score=m.predict_proba(X)[:,1]) for m in models],
sample+'_accuracy': [accuracy_score(y_true=y, y_pred=m.predict(X)) for m in models],
sample+'_f1score': [f1_score(y_true=y, y_pred=m.predict(X)) for m in models]
})
for (sample,X,y) in zip(['train', 'test'], [X_train, X_test], [y_train, y_test])
],
axis=1
)

--

--

Product Manager @Doctolib after 5 years as data scientist. Loves when ML joins Products 🤖👨‍💻