Quick dive into Pandas for Data Science

Ehi Aigiomawu
Towards Data Science
14 min readApr 25, 2018

--

Pandas — Data Science Fundamentals

Pandas is an open source python library that is built on top of NumPy. It allows you do fast analysis as well as data cleaning and preparation. An easy way to think of Pandas is by simply looking at it as Python’s version of Microsoft’s Excel.

One cool thing I like about Pandas is the fact that it can work well with data from a wide variety of sources such as; Excel sheet, csv file, sql file or even a webpage.

Installing Pandas

If you have Anaconda, you can simply install Pandas from your terminal or command prompt using:

conda install pandas

If you do not have Anaconda on your computer, install Pandas from your terminal using:

pip install pandas

Pandas Data Structures

Series

A Series is a one-dimensional array which is very similar to a NumPy array. As a matter of fact, Series are built on top of NumPy array objects. What differentiates Series from NumPy arrays is that series can have an access labels with which it can be indexed.

import numpy as np
import pandas as pd

Here is the basic syntax for creating a pandas Series:

my_series = pd.Series(data, index)

From the above, data can be any object type such as dictionary, list, or even a NumPy array while index signifies axis labels with which the Series will be indexed.

We’re going to create a Series from various object types.

Note: It is important to note that the index axis is optional. It means, you can decide not to pass it in. If no index is passed, Pandas will create a default index having values [0, ..., len(data) - 1] like in below:

Whenever you create Series from a Python dictionary, Pandas sets the keys as the index of the Series, and sets the values as the corresponding data point as shown in out[24] above.

What differentiates a Pandas Series from a NumPy array is that Pandas Series can hold a variety of object types.

Grabbing information from Series

We can grab information from a Series the same way we do with python dictionary:

Performing Arithmetic operations on Series

Operations on Series are done based off the index. When we use any of the mathematical operations such as -, +, /, *, pandas does the computation using the value of the index. The resulting value is thereafter converted to a float so that you do not loose any information.

From the above, where Pandas was unable to find a match in the Series, it gives it a NaN value.

DataFrames

A DataFrame is a two-dimensional data structure in which the data is aligned in a tabular form i.e. in rows and columns. Pandas DataFrames make manipulating your data easy. You can select, replace columns and rows and even reshape your data.

Here is the basic syntax for creating a DataFrame:

pd.DataFrame(data,index)

Let’s create a DataFrame of random numbers having 5rows and 4columns:

As we see, each of the column above is actually just a Pandas Series and they all share a common index. Hence, it is safe to say a DataFrame is a collection of Series sharing the same index. In the example below, we will create a DataFrame from a Series:

Selecting Columns from DataFrames

Using bracket notation [], we can easily grab objects from a DataFrame same way it’s done with Series. Let’s grab a column name:

Because we grabbed a single column, it returns a Series. Go ahead and confirm the data type returned using type():

If we grab multiple columns, the result will be a DataFrame:

Adding Columns to a DataFrame

When creating a new column, you have to define it like it already exist. Having a DataFrame:

We can easily add new columns either by creating them afresh or from existing columns like shown below:

Removing rows/columns from a DataFrame

We can remove a row or a column using the .drop() function. In doing this, we have to specify the axis=0 for row, and axis=1 for column.

It is important to note that Pandas does not permanently drop a row/column whenever you use the .drop() method unless you tell it to do so. This is so because Pandas doesn’t want you to accidentally loose your data.

Go ahead and confirm by calling df . To permanently delete a row/column, you have to set inplace=True like so:

Selecting Rows in a DataFrame

To select rows, we have to call the location of the rows using .loc[] which takes in the label name or .iloc[] which takes in the index position of the row.

We can also select a subset of rows and columns using the column notation: by specifying the row and column we want like we did in NumPy . Say we want the 'Name’ at row 'c':

Conditional selection

Pandas allows you to perform conditional selection using bracket notation [] . The example below returns the rows where 'W'>0:

Say we want to return only the values of column 'X' where 'W'>0:

Go ahead and try this: df[df['W']>0][['X','Y']] , you should get this:

The above one line abstraction can also be achieved by breaking it down into multiple steps:

my_series = df['W']>0
result = df[my_series]
my_cols = ['X','Y']
result[my_cols]

Using multiple conditions, we can get values in the DataFrame by combining it with the logical operators & (AND) and | (OR). For example, to return the values where'W'>0 and 'X'>1, use:

Resetting the Index of a DataFrame

When your index doesn’t look entirely the way you want it to, you can easily reset it using.reset_index(). This resets our DataFrame index to a column named index with Pandas’ default index values [0, ..., len(data) - 1] being our new index.

As expected, .reset_index() does not permanently reset the index until you specify by saying: .reset_index(inplace=True)

Setting the Index of a DataFrame

Similarly, we can specify what column we want to use as our index label by using the set_index() function. To do this, we will create a new column called"ID' and set it as our new index:

Unlike the .reset_index()which resets the old index and uses the default Pandas index as our new index, .set_index() completely overrides the old index.

Multi-level index (MultiIndex) and Index Hierarchy

A MultiIndex is simply an array of tuples where each tuple is unique. It can be created from a list of arrays (using MultiIndex.from_arrays), an array of tuples (using MultiIndex.from_tuples), or a crossed set of iterables (using MultiIndex.from_product).

Let’s create a MultiIndex from a turple:

We can go ahead to grab rows from our MultiIndex using the .loc[] like so:

As we see, the first two columns of our DataFrame df do not have names. We can give them names using .index.names like so:

Cross section of Rows and Columns

Using the .xs method, we can easily select data at particular levels of a MultiIndex. For example, say we want to grab all levels where Num = 22:

Missing Data

A lot of times, when you’re using Pandas to read-in data and there are missing points, Pandas will automatically fill-in those missing points with a NaN or Null value. Hence, we can either drop those auto-filled values using .dropna() or fill them using.fillna().

Say we have a DataFrame:

Using .dropna() method, we are telling Pandas to drop any row or column with one or more missing values. To drop a row, use.dropna(axis=0) and .dropna(axis=1) to drop a column.

Note that if you do not specify the axis, Pandas assumes you are dropping a row by default.

Similarly, using .fillna() method, we can replace missing values in our DataFrame with our desired value. Let’s replace all the NaN values with 10 :

We could also choose to fill only the missing values of a specific column or row. Let’s replace the missing values in column 'A' with it’s mean value:

From the above, the mean of column 'A' is 2.0, hence it replaces the second row with 2.0.

Similarly, .dropna() and .fillna()do not occur permanently until you set inplace=True

GroupBy

Grouby allows you group together rows based off a column so that you can perform aggregate functions (such as sum, mean, median, standard deviation, etc) on them.

Using the .groupby()method, we can group rows based on the 'Company' column and call the aggregate function .mean()on it:

As you see, our result is the mean of each company.

Count

Using the count() method, we can get the number of times an item occurs in a DataFrame.

Describe

The .describe() method is used to get an overview of what a DataFrame looks like. It gives us a summary of each of the DataFrame index.

If you don’t like that layout, you could use the.transpose() method to get a different layout.

Concatenating, Merging, and Joining DataFrames

Concatenation

Concatenation basically glues DataFrames together. When concatenating DataFrames, keep in mind that dimensions should match along the axis you are concatenating on. Having, a list of DataFrames:

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])

We can concatenate them usingpd.concat() method:

Since we did not specify the axis on which we want the concatenation to occur on, by default, Pandas uses the row axis. It joined the rows together. If we want the concatenation to occur along the column, we have to set axis=1:

Notice here that we have a bunch of missing values. This is because the DataFrame did not have values for all the indexes we wanted it to concatenate on. Hence, when working with pd.concat(), make sure you your data contains all information correctly on the axis.

Merging

Using the pd.merge() function, Pandas allows you to merge DataFrames together same way you merge SQL Tables. The syntax for merging DataFrames is pd.merge(left, right, how='inner', on='Key'. The left signifies the DataFrame that should appear on the left side, right signifies the DataFrame that should appear on the right side, how='inner' signifies whether the merge operation should be handled as an intersection inner or a union Outer. Lastly, on='Key' signifies the key column on which the merge occurs. This key column has to be similar across all the DataFrames before the merge function can occur.

Having two DataFrames which both has a key column, we can merge them like this:

We can also merge on multiple keys by simply passing the keys:

Joining

Joining is a more convenient method for combining the columns of two potentially differently-indexed DataFrames into a single DataFrame. Joining is similar to merging but differs in that the join method occurs on the index key instead of the column.

Finding unique values in a DataFrame

Unique values, are simply distinct values in a DataFrame. There are three main useful methods for finding unique values in a DataFrame.

Using .unique(), we can find all the unique values in col2 of the DataFrame below:

Instead of displaying the unique values, we can choose to only count the number of unique values in the DataFrame using .nunique():

Lastly, we could decide to only return the number of times a unique value show up in a column using .value_counts():

The apply() Method

The .apply() method is used to call custom functions on a DataFrame. Imagine we have a function:

In the example above, we broadcast the function to each element in the column. We can also apply built-in functions to DataFrames. Let’s say we want to get the length of the strings in col3:

Sometimes, you may define a function which you end up using only once. Instead of defining such function in several lines of code, you could simply use a lambda expression which is a short-hand version of the function. For example, we could represent the square function in ln[47] above by a lambda expression:

Getting DataFrame Attributes

Attributes are the column and index names of a DataFrame. Say we are not so sure if the column names in our DataFrame contains space characters, we could simple get their attributes:

Sorting and Ordering DataFrame

Imagine we wanted to display the DataFrame with a certain column being displayed in ascending order, we could easily sort it using .sort_values():

As you see, the values in col2 were displayed from lowest to highest. Also note how the index stayed attached to each of the row so that the information is not lost.

Finding Null Values

Say you have a large dataset, Pandas has made it very easy to locate null values using .isnull():

Pivot Table

You may already be familiar with pivot tables in Excel. A pivot table is a table that summarizes the data in another table. It enables you to automatically group, slice, filter, sort, count, total, or average the data stored in one table. Having a DataFrame:

We can create a pivot table from it using syntax: .pivot_table(data, values='', index=[''], columns=['']). where values represents the column which we want the datapoints to be made up of, the index is the column you’d like to group your data by and columns represents the columns by which the DataFrame should be defined. You can read more on pivot table here.

In the example above, we got some NaN values because there were no values that matched up for those specific points.

Data Input and Output

Using the pd.read_ methods Pandas allows you access data from a wide variety of sources such as; excel sheet, csv, sql, or html.

To reference any of the files, you have to make sure it is in the same directory where your jupyter notebook is.

Reading a CSV file

Using pd.read_csv() we can output the content of a .csv file as a DataFrame like so:

Writing to a CSV file

We can create a DataFrame and store it in a.csv file using .to_csv() like so:

To confirm that the data was saved, go ahead and read the csv file you just created using pd.read_csv('New_dataframe').

Reading from an Excel file

Excel files are good sources of data. Using pd.read_excel() we can import data from an excel file. Keep in mind that Pandas can only import the data in an excel file, and cannot import macros, images or formulas else . Having macros, images or formulas in your excel file can cause the pd.read_excel() method to crash.

Say we have an excel file 'excel_output.xlsx', let’s go ahead and output it’s content:

Keep in mind that every excel file contains one or more sheets. From our example,Sheet1 refers to the particular sheet in 'excel_output.xlsx' we want to read from.

Writing to an Excel file

Like we did with the csv, we can create a DataFrame and store it in a.xlsx file using .to_excel() like so:

Reading from an HTML

To read data from HTML file, you need to install htmllib5, lxml, and BeautifulSoup4. In your terminal run:

conda install lxml
conda install html5lib
conda install BeautifulSoup4

Using .read_html() let’s tell Pandas to read tables off this html page. Since there are different tables in that page, we can access the specific table we want using index numbers [0, ..., len(tables) - 1]. For this example, we will return the second table in that page.

We have reached the end of this tutorial, Congratulations!

If you completed this lesson, you should have a good knowledge of what Pandas is and how to use it for data preparation. All you need do now is practice! practice! practice!.

Got questions, got stuck or just want to say hi? kindly use the comment box. If this tutorial was helpful to you in some way, show me some 👏.

--

--