The world’s leading publication for data science, AI, and ML professionals.

Introduction to Pandas – Part 2: Quick Data Exploration

Explore data with a few lines of codes using Pandas and Pandas_Profiling

Practical Python Pandas Tricks – Part 2: Data Preview and Subsetting

Useful Pandas functions and tricks to preview and subset a DataFrame

Photo by Jay Mantri on Unsplash
Photo by Jay Mantri on Unsplash

This article is the 2nd part of a series of Pandas tricks. Please Stay Tune for more future articles on this topic.

Part 1: Import and Create DataFrame

Part 2: Data Preview and Subsetting

Part 3: Data Wrangling

Introduction

In the beginning of any data science project, we always want to get familiar with the data as quickly as possible. Reviewing the first n rows and computing basic information such as column names, data type, distribution, statistic summary would be helpful for us to understand the data. Once we have a preliminary understanding of the data, one of the most common data cleaning steps is subsetting.

In this article, I’m going to use "supermarket_sales – Sheet1.csv" (Download Link) as the data source to introduce some Pandas functions that allow us to do quick Data Exploration and subsetting.

Prerequisite

# Install Pandas library
!pip install pandas
# Import pandas library
import pandas as pd
# Import data source
df = pd.read_csv('supermarket_sales - Sheet1.csv')

Quick Data Preview

head & tail: Instead of reviewing the whole dataset, it is easy for us to digest the information with just a few rows of records. These two functions return the first n and last n rows of dataframe respectively.

df.head(5)
Out[65]: 
    Invoice ID Branch       City  ... gross margin percentage gross income Rating
0  750-67-8428      A     Yangon  ...                4.761905      26.1415    9.1
1  226-31-3081      C  Naypyitaw  ...                4.761905       3.8200    9.6
2  631-41-3108      A     Yangon  ...                4.761905      16.2155    7.4
3  123-19-1176      A     Yangon  ...                4.761905      23.2880    8.4
4  373-73-7910      A     Yangon  ...                4.761905      30.2085    5.3
[5 rows x 17 columns]
df.tail(5)
Out[66]: 
      Invoice ID Branch  ... gross income Rating
995  233-67-5758      C  ...       2.0175    6.2
996  303-96-2227      B  ...      48.6900    4.4
997  727-02-1313      A  ...       1.5920    7.7
998  347-56-2442      A  ...       3.2910    4.1
999  849-09-3807      A  ...      30.9190    6.6
[5 rows x 17 columns]

Trick 1: When we want to preview a huge CSV file (>10GB), we might not want to wait a long time to import the whole data. Sometimes a quick snippet of data could be very helpful to determine the information included in the data file and its data structure. In that case, we can use skiprows and nrows options in read_csv to specify the number of lines we want to skip and read, which might reduce importing time from hours to a few seconds.

skiprows = 0
lenrow = 100
preview_df = pd.read_csv('supermarket_sales - Sheet1.csv', skiprows = skiprows, nrows = lenrow, header=0, engine='Python')

columns & info: Knowing the column name and data type are important for initial review. columns and inf functions could output column names, the number of non-null values and data type of each column.

df.columns

Index(['Invoice ID', 'Branch', 'City', 'Customer type', 'Gender',
       'Product line', 'Unit price', 'Quantity', 'Tax 5%', 'Total', 'Date',
       'Time', 'Payment', 'cogs', 'gross margin percentage', 'gross income',
       'Rating'],
      dtype='object')
df.info()
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Invoice ID               1000 non-null   object 
 1   Branch                   1000 non-null   object 
 2   City                     1000 non-null   object 
 3   Customer type            1000 non-null   object 
 4   Gender                   1000 non-null   object 
 5   Product line             1000 non-null   object 
 6   Unit price               1000 non-null   float64
 7   Quantity                 1000 non-null   int64  
 8   Tax 5%                   1000 non-null   float64
 9   Total                    1000 non-null   float64
 10  Date                     1000 non-null   object 
 11  Time                     1000 non-null   object 
 12  Payment                  1000 non-null   object 
 13  cogs                     1000 non-null   float64
 14  gross margin percentage  1000 non-null   float64
 15  gross income             1000 non-null   float64
 16  Rating                   1000 non-null   float64
dtypes: float64(7), int64(1), object(9)
memory usage: 132.9+ KB

Trick 2: You might notice object data type in the above output. object can include multiple different types such as integers, floats and strings, which are labeled as an object collectively. To inspect the data type for a given column one step further, we can use type to output more details.

In the following example, the "Salary" column includes both integer and string values. By using type , we’re able to produce frequency count and review a subset based on the data type.

(Created by Author)
(Created by Author)
data= {'Salary': [30000, 40000, 50000, 85000, '75,000'],            
        'Exp': [1, 3, 5, 10, 25],          
        'Gender': ['M','F', 'M', 'F', 'M']} 
df_salary = pd.DataFrame(data)
df_salary.info()
Out[74]:
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Salary  5 non-null      object
 1   Exp     5 non-null      int64 
 2   Gender  5 non-null      object
df_salary.apply(lambda x: type(x['Salary']), axis = 1).value_counts()
Out[75]:
<class 'int'>    4
<class 'str'>    1
dtype: int64
df_salary[df_salary.apply(lambda x: type(x['Salary']), axis = 1)== str]
Out[80]: 
   Salary  Exp Gender
4  75,000   25      M

Trick 3: I’ve mentioned value_counts above. It outputs counts of unique values for a given column. Sometimes, we need to produce a cross tabulation of two columns. crosstab is a handy function to serve that purpose. By default, it produces a frequency table by any two columns in the dataframe. You can also pass an aggregate function, such as sum. In the following code, we can compute total "Quantity" by "Product line and "City".

df['City'].value_counts()
Out[19]: 
Yangon       340
Mandalay     332
Naypyitaw    328
Name: City, dtype: int64
pd.crosstab(df['Product line'], df['City'])
Out[22]: 
City                    Mandalay  Naypyitaw  Yangon
Product line                                       
Electronic accessories        55         55      60
Fashion accessories           62         65      51
Food and beverages            50         66      58
Health and beauty             53         52      47
Home and lifestyle            50         45      65
Sports and travel             62         45      59
pd.crosstab(df['Product line'], df['City'], values = df['Quantity'], aggfunc= 'sum')
Out[98]: 
City                    Mandalay  Naypyitaw  Yangon
Product line                                       
Electronic accessories       316        333     322
Fashion accessories          297        342     263
Food and beverages           270        369     313
Health and beauty            320        277     257
Home and lifestyle           295        245     371
Sports and travel            322        265     333

describe: This function outputs a descriptive statistical summary that includes number of observation, mean, standard deviation, min, max and percentiles.

df.describe()
Out[18]: 
        Unit price     Quantity  ...  gross income      Rating
count  1000.000000  1000.000000  ...   1000.000000  1000.00000
mean     55.672130     5.510000  ...     15.379369     6.97270
std      26.494628     2.923431  ...     11.708825     1.71858
min      10.080000     1.000000  ...      0.508500     4.00000
25%      32.875000     3.000000  ...      5.924875     5.50000
50%      55.230000     5.000000  ...     12.088000     7.00000
75%      77.935000     8.000000  ...     22.445250     8.50000
max      99.960000    10.000000  ...     49.650000    10.00000
[8 rows x 8 columns]

Trick 4: describe is helpful, but pandas_profiling.ProfileReport is even better. Python library, pandas_profiling.ProfileReport ** allows us to generate a comprehensive data exploration report. Similar to describe, pandas_profiling.ProfileReport** would produce basic information and a descriptive statistical summary of each column.

# Install the library
!pip install pandas-profiling
# Import the library
import pandas_profiling
profile = pandas_profiling.ProfileReport(df, title = "Data Exploration")
profile.to_file(output_file='Data Profile Output.html')

It also produces a quick histogram of each column and allows us to create a scatter plot of any two numeric columns. Moreover, it also includes different kinds of correlations of columns and count of missing values. All this good information can be generated using just few lines of code.

Another advantage of using this function is we can save the report in a web format, that we can send to someone else to review.

(Created by Author)
(Created by Author)
(Created by Author)
(Created by Author)
(Created by Author)
(Created by Author)

Subset DataFrame

For this section, let’s create another sample dataframe. In this dataframe, it contains index of [‘r1’, ‘r2’, ‘r3’] and columns of [‘A’, ‘B’, ‘C’].

# Create a dataframe
df2 = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],                 index=['r1', 'r2', 'r3'], columns=['A', 'B', 'C'])
(Created by Author)
(Created by Author)

Row Selection Using Bracket

df[Beginning Row : Ending Row]: This would give a subset of consecutive rows of data. For example, df2[0:3] produces the first row to the third row in a dataframe.

Trick 5: df2[0] doesn’t extract the first row. Instead, this would give us a column with name of "0", which is not available in our data columns, [‘A’, ‘B’, ‘C’]. Instead, we should use df[0:1] to get the first row.

df[condition(boolean)]: This would create a subset of dataframe based on a specified condition. For example, df2[df2[‘C’] > 10] produces rows where column "C" is greater than 10. df2[df2[‘C’] ≥ 3 & df2[‘A’] > 3] produces rows where column "C" is greater than or equal to 3 and column "A" is greater than 3.

Trick 6: When specifying more than one conditions, we can also use apply inside the boolean condition with more flexibility. For example, axis = 1 allows us to implement analysis in a row-wise fashion. We can bool to work with regular expression to output True/False.

df2[df2.apply(lambda x: x['C']>=3 &amp; x['A'] > 3, axis = 1)]
df[df.apply(lambda x: bool(re.findall('beauty', x['Product line'])) &amp; (x['Gender'] == 'Female'), axis = 1)]

Trick 7: We can use query to specify a boolean expression. One advantage is we don’t need to explicitly write dataframe name multiple times.

# df[df['Gender']=='Female']
df.query('Gender == "Female"') 
# df[df['Quantity']> 7]
df.query('Quantity > 7')

Row Selection Using iloc

df.iloc[Selected Row(s),:]: iloc allows us to extract a subset based on row numbers. For example: df2.iloc[0:10, :] would produce the first 10 rows and df2.iloc[[0, 2, 5], :] would produce a subset that contains the first, third and sixth rows. The : inside the square bracket implies that we would keep all the columns.

Row Selection Using loc

df.loc[Selected Index, :]: loc allows use to extract a subset based on an index. For example, df2.loc[‘r1’, :] would produce a subset with "index = ‘r1’". df2.loc[[‘r1’, ‘r2’],:] would produce a subset with "index = ‘r1’ and ‘r2’".

Column Selection Using Double Bracket

df[[Selected Columns]]: With a double bracket, we can select single and multiple columns in a dataframe. For example, df[[‘Unit price’, ‘Quantity’]] would extract ‘Unit price’ and ‘Quantity’ columns in the dataframe.

Trick 8: Single Bracket vs Double Bracket. We can use a single bracket to extract a single column. But the output will be stored as a series, whereas a double bracket would give us a dataframe.

test1 = df[['Quantity']]
type(test1)
Out[176]: pandas.core.frame.DataFrame
test2 = df['Quantity']
type(test2)
Out[178]: pandas.core.series.Series

df.iloc[:, Selected Column Number] vs df.loc[:, Selected Columns]: Both iloc and loc can extract a subset of columns from a dataframe. The difference is iloc is based on the column number, whereas loc is using the actual column name. For example, df2.iloc[:, 0] would extract the first column without mentioning the column name, and df2.loc[:,’A’] would extract column "A" from the dataframe.

Thank you for reading !!!

If you enjoy this article and would like to Buy Me a Coffee, please click here.

You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.


Related Articles