
Whether you have a messy dataset to clean and analyze for insights, or you want to prepare data for a machine learning model, pandas is the library for you. It is simple to use, fast, and highly intuitive.
Pandas was authored specifically for Data Science and is packaged with additional features from other libraries. This means that you can work only in pandas without importing other libraries as we’ll see in the article.
A big disadvantage with pandas is that it fails with big data. This is because pandas stores its data structures in RAM which can run out of memory as data size increases. For this, use PySpark or Dask.
To begin using Pandas in a Jupyter notebook, first import the library into the current runtime. We provide an alias pd
that we’ll refer to in code.
import pandas as pd
We’ll be using the Cereals dataset from Kaggle for demonstration. Let’s first define some common words repeatedly used in the tutorial.
- DataFrame – A 2-dimensional table with labeled columns and rows. It is one of pandas’ data structures.
- Series – A column of values and their indexes. It’s also a data structure. A DataFrame is a collection of Series objects.
- Index – The label of every row (observation) in a DataFrame or Series. We’ll refer to index as the row labels, which are usually in the first column in a DataFrame or Series.
- column names – The headers of each column in a DataFrame.
- Feature— A column in the DataFrame.
Part 1. Data Understanding
1. Reading data into the Jupyter notebook
Pandas can read a whole range of text, binary, and SQL files to return a pandas object, which is usually a Dataframe.
We read text files using pandas.read_csv()
. This documentation explains several optional parameters to include such as how many rows to skip at the top (skip_rows
), which columns to load (usecols
), and the column to use as the index (index_col
).
pd.read_csv('file.csv')
Other data loading functions are pd.read_excel()
, pd.read_json()
and pd.read_sql()
.
2. Viewing the data
Once your dataset is loaded, you can view its contents.
df.head()
, df.tail()
and df.sample()
display the first 5 observations, the last 5, and a randomly selected observation respectively. Each function takes an optional n
parameter for the number of rows, for example df.sample(5)
.
df.head() #the first 5 rows
df.tail() #the last 5
df.sample() #one random row

df.columns
and df.index
returns the column names and the row index values respectively as a list.
df.columns
##Results
Index(['name', 'mfr', 'type', 'calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo', 'sugars', 'potass', 'vitamins', 'shelf', 'weight', 'cups', 'rating'], dtype='object')
df.index
###Results
RangeIndex(start=0, stop=77, step=1)
Note our index values show that it is a RangeIndex. This is the default index created when you load a DataFrame without specifying a column to be the index. It consists of integers beginning from 0 to the total number of rows, 77 in our case.
In the code below, we explicitly set the ‘name’ column as the index using pd.read_csv(filename, index_col).
Optionally, use df.set_index('name')
.
df2 = pd.read_csv('cereal.csv',
index_col='name')
print(df2.index)
###Results
Index(['100% Bran', '100% Natural Bran', 'All-Bran',
'All-Bran with Extra Fiber', 'Almond Delight',...
Now, all the index values are displayed as a list.
3. Learning about the data
Pandas allows us to learn about the structure of the DataFrame and the type of information it holds.
df.shape
returns a tuple to represent the number of rows and columns. In our case, there are 77 rows and 16 columns.
df.shape
(77, 16)
df.dtypes
returns the data type per column. We see that we have a mix of categorical (object
) and numeric (int
and float
) features.

df.info()
returns information about the data such as column count, number of values in each column (non-null means not blank or missing), and their datatypes.

df.describe()
returns the descriptive statistics of each numeric feature. Here, we get the number of values (count), the mean, standard deviation, and the percentiles for each column.

Use df.describe(include='object')
to return descriptive statistics of categorical (object) columns. We get the number of unique categories per column, the most frequent category, and its number of occurrences.

You can also get individual statistics about a DataFrame or Series object. This documentation demonstrates aggregation functions such as max()
and mean()
, and element-wise operations such as cumsum()
.
4. Viewing data by selection
Selecting column(s) using square brackets[]
df['rating']
selects the ‘rating’ column. A series object is returned.
0 68.402973
1 33.983679
2 59.425505
3 93.704912
4 34.384843
...
df[['name', 'calories', 'rating']]
selects three columns. Note the double brackets[[. The inner brackets are for holding the list of columns while the outer ones are for indexing and returning the columns. A DataFrame is returned.
df.loc[] to select data by name
df.loc
allows you to select rows and columns to display. The parameters are the name(s) of the index and(or) column(s).
df.loc[0, 'name']
displays the value at index 0 under column ‘name’.
df.loc[1:4, 'protein':'sugars']
returns rows from index 1 to 4, and all columns between ‘protein’ and ‘sugar’. The use of a colon implies a range. Note: both values in the range (start and finish) are included in the results.

df.loc[[1,4], ['name','calories']]
returns only rows 1 and 4, and columns ‘name’ and ‘calories’. The rows and columns are in a list, hence only these are returned, not a range.

df.loc[:, :]
returns all rows and all columns. Note: A full colon alone returns everything from end to end. Think of empty before or after a colon to mean ‘end’. For example [:5]
means from one ‘end’ to index 5, and [5:]
means from index 5 to the ‘end’.
df.iloc[] to select data by position
Here, we provide the position of the row and(or) column. Think of it as ‘index location’ and all positions are from 0. Note: In the case of a range, the end value (finish) is excluded from the result.
df.iloc[[3,5,16], [3,4,5]]
returns rows at index 3,5 and 16 and columns at position 3,4 and 5.

df.iloc[1:3, 2:4]
returns a range from row 1 to row 3, and column 2 to 4, excluding index 3 and column 4.
type calories
1 C 120
2 C 70
df.iloc[[0,2], -3:]
returns rows 0 and 2, and the last 3 columns. -3:
means ‘return columns from the third-last to the end’.

Selecting columns based on datatype
df.select_dtypes()
returns columns of specific datatypes based on the parameters include=dtype
or exclude=dtype
.
df.select_dtypes(include='object')
returns only columns with string values. include='number'
returns both integers and floats.

df.select_dtypes(exclude=['float','int'])
returns all columns except those whose values are floats
or ints
. More options are in this documentation.
Selecting data based on a condition
This enables you to select the data that satisfies a certain criteria, for example all students aged between 18 and 24. This is also called boolean indexing.
First you create a mask, which is the condition that should be satisfied. Then apply the mask to the DataFrame inside square brackets[], like we did when selecting columns. Here we are indexing and returning a condition.
The code below displays all cereals with a rating above 70.
mask = df['rating'] > 70
df[mask]

These are the cereals manufactured in Germany. You can also use df['mfr'] == 'G'
mask = df['mfr'].isin(['G'])
df[mask]
These are cereals with calories between 100 and 150. Here, you can also use (df['calories'] >= 100) & (df['calories'] <= 150)
.
mask = df['calories'].between(100,150)
df[mask]
Note: To negate a condition and return the opposite, include the tilde~
symbol before the mask.
Here are all cereals without the word ‘wheat’ in the name.
mask = ~df['name'].str.contains('wheat', case=False)
df[mask]
Part 2: Visualize the data
Data visualization involves plotting charts and graphs from the data. Subtle insights like outliers, sparse classes, and skewed features are easy to spot using charts than using plain numbers as we’ve been doing earlier.
You can plot Series or DataFrame objects using pandas. Pandas uses matplotlib in the backend by default, but you don’t need to import the matplotlib library for the following plots.
5. Making basic plots
Line plot. This is the default pandas plot and it displays how the data changes throughout the dataset for each column. The y-axis is a number line of all the observations in the dataset, in our case 0 to 77 rows. Then for each numeric feature, the points are plotted and a line drawn through all the points.
df.plot()

We see that sodium and potassium have the highest values, getting to over 300 for certain cereals.
Histogram. This shows the distribution of values in a numeric feature, or how varied the values are. We include the parameter kind='box'
and also rotate the axis.
df.plot(kind='box', rot=45)

Again, we see that sodium and potassium vary greatly among the cereals with some cereals having 300 while others have 0 sodium content.
Scatter plots. This shows how two numeric features move together by plotting their relationship. We have to provide the x,y,
and kind
parameters.
df.plot(x = 'calories', y ='sugars' ,kind='scatter')

We see a positive relationship between sugars and calories.
Bar plot. This shows how numeric features are distributed for a categorical feature. In the code below, we modified our DataFrame by setting the index as ‘name’, then flipping the columns and rows (transposing) using df.T
which is covered shortly.

df_t = df.set_index('name').T
df_t.iloc[[2,4,6,8,10,-1], :5].plot(kind='bar')
Next, we used df.iloc
to select the rows (minerals and rating) to analyze for the first 5 cereals. Notice how the cereal with the highest rating had the least calories, fat and sugars, and the highest fiber content.

Other plots you can create using pandas are area plots, hexbin plots, and pie charts.
Pandas provides basic formatting options such as removing the legend, changing the x and y labels, rotating the axis labels for readability, and creating subplots.
However, these plots are not amazing to look at and you can further extend pandas capabilities using matplotlib, seaborn, and other libraries mentioned here.
Part 3: Manipulating the data
6. Renaming columns and row indexes
Sometimes you want to change some or all column names to make them readable. A DataFrame loaded with no header will need a whole new set of column names.
Renaming columns
To Change specific column names, use df.rename(columns=dict).
The dict parameter contains the ‘old-name’ to ‘new-name’ mapping for columns that needs renaming.
df.rename(columns =
{'mfr': 'manufacturer',
'potass': 'potassium'})
To Change all column names, use df.columns=list.
The list contains names for all the columns.
df3.columns = ['name', 'mfr', 'type', 'calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo', 'sugars', 'potass', 'vitamins', 'shelf', 'weight', 'cups', 'rating']
Renaming the row index
The same function df.rename(index=dict)
is used, but with the index
parameter. The dict is a dictionary of ‘old-index-name’ : ‘new-index-name’ pairs.
df.rename(index={0:1000,
1:1001})
To rename the index and column names simultaneously, include both index
and column
parameters in the df.rename
function.
7. Removing and inserting data
Adding a column
We’ll look at four ways, but first let’s create a list with the values that will be in the new column. The number of values must be equal to the rows it fills otherwise ValueError results.
import numpy as np
rand_list = np.random.randint(0, 20, len(df))
We use the numpy library for creating random integers ranging between 0 and 20, and the number of values will be equal to the number of rows.
Method 1: df['col'] = values
Here, we reference the new column name inside square brackets then use the equal sign (=) to assign the values. This method can also be used to replace the values of an existing column.
df['new_col'] = rand_list
By default, this new column appears as the last column.
Method 2: df.loc[:, column_name] = values
Remember the .loc
function we looked at earlier? The first part references the rows (and : means all rows). The second part after the comma references the new column. We then assign the values and the new column appears at the last position.
df.loc[:, 'new_col'] = rand_list
Method 3: df.insert(location, column_name, values)
Here, we use .insert
and provide the location (position) we want to insert the column (position is counted from 0), the column name, and the value(s) to fill with. If one value is provided, the whole column is filled with it.
df.insert(1, 'new_col', rand_list)

Note that this method modifies the original DataFrame in-place without making a copy.
Method 4: df.assign(col_name=values)
This method returns a modified dataFrame without altering the original, and then you assign this modified copy to df variable. The new column appears at the last position.
df = df.assign(new_col = rand_list)
Inserting a row
Use df.loc[index_name] = values
similar to method two of column creation above. First we create the list of values to fill the new row with.
row_list = ['Awesome Oats', 'K', 'C', 100, 3, 1, 180, 2.0, 15.0, 5, 40, 25, 3, 1.0, 0.63, 75.345423]
Then we use df.loc[name]
to point to that index and create it if not present. We then assign new values using the assignment operator (=).
df.loc[77] = row_list

To add several rows at a time, you can create a new DataFrame with the new rows and use pd.concat(df, df_new)
or df.append(df_new)
.
Removing (deleting) a column
Use df.drop(columns=cols)
to remove one or several columns.
df.drop(columns='sodium', inplace=True)
This code removes the 'sodium'
column.inplace=True
modifies the original dataframe in-place, otherwise the original is preserved and a modified df returned.
Removing (deleting) a row
Use df.drop(index=row_index)
to delete a row. Check the documentation here for additional parameters.
df.drop(index=76)
Change the column order
Method 1: df[[col2, col1, col5, col3]]
The inner square brackets contain the column names in the desired order.
new_order = np.random.choice(col_list, 16)
df = df[new_order]

Method 2: df.reindex(columns=col_list).
This function that takes in a list of column names to return.
df.reindex(columns = new_order)
You can either of these two methods to slice a DataFrame and return fewer columns.
Flip columns with rows (Transpose)
We use the df.transpose()
or df.T
functions to swap the rows and columns. For example, let’s say we want to make the names of cereals to be features and the features to be row indexes.
Firstly we use df.set_index
to set the names as the index.

Then we call df2.T
to transpose the index and columns.

8. Sorting data
You can sort values in a column (Series object) or the whole Dataframe by a column(s).
Sorting a column
df['name'].sort_values()
sorts the column ‘name’ and returns a Series object. The parameter ascending=False
sorts in descending order.

df.sort_values('name')
returns the whole DataFrame sorted by the 'name'
column. The parameter by=
must be provided.

Sorting the index:
df['name'].sort_index()
sorts ‘name’ column by its index (0 to 4 below).

df2.sort_index(ascending=False)
sorts the entire DataFrame by the index in descending order. This is especially useful when the index is a date.

Conclusion
In this tutorial, we used the pandas library for data analysis and manipulation. We also looked at some visualization techniques and how you can extend their capabilities using additional libraries.
This is by no way an exhaustive view of pandas and the documentation offers in-depth code and functionalities. I encourage you to practice with several variations of the code snippets shared here and try out several parameters. Only by doing this can you understand the full power of pandas.
All the code used is here on GitHub. Thank you for reading!