Practical Python Pandas Tricks – Part 2: Data Preview and Subsetting
Useful Pandas functions and tricks to preview and subset a DataFrame

This article is the 2nd part of a series of Pandas tricks. Please Stay Tune for more future articles on this topic.
Part 1: Import and Create DataFrame
Part 2: Data Preview and Subsetting
Introduction
In the beginning of any data science project, we always want to get familiar with the data as quickly as possible. Reviewing the first n rows and computing basic information such as column names, data type, distribution, statistic summary would be helpful for us to understand the data. Once we have a preliminary understanding of the data, one of the most common data cleaning steps is subsetting.
In this article, I’m going to use "supermarket_sales – Sheet1.csv" (Download Link) as the data source to introduce some Pandas functions that allow us to do quick Data Exploration and subsetting.
Prerequisite
# Install Pandas library
!pip install pandas
# Import pandas library
import pandas as pd
# Import data source
df = pd.read_csv('supermarket_sales - Sheet1.csv')
Quick Data Preview
head & tail: Instead of reviewing the whole dataset, it is easy for us to digest the information with just a few rows of records. These two functions return the first n and last n rows of dataframe respectively.
df.head(5)
Out[65]:
Invoice ID Branch City ... gross margin percentage gross income Rating
0 750-67-8428 A Yangon ... 4.761905 26.1415 9.1
1 226-31-3081 C Naypyitaw ... 4.761905 3.8200 9.6
2 631-41-3108 A Yangon ... 4.761905 16.2155 7.4
3 123-19-1176 A Yangon ... 4.761905 23.2880 8.4
4 373-73-7910 A Yangon ... 4.761905 30.2085 5.3
[5 rows x 17 columns]
df.tail(5)
Out[66]:
Invoice ID Branch ... gross income Rating
995 233-67-5758 C ... 2.0175 6.2
996 303-96-2227 B ... 48.6900 4.4
997 727-02-1313 A ... 1.5920 7.7
998 347-56-2442 A ... 3.2910 4.1
999 849-09-3807 A ... 30.9190 6.6
[5 rows x 17 columns]
Trick 1: When we want to preview a huge CSV file (>10GB), we might not want to wait a long time to import the whole data. Sometimes a quick snippet of data could be very helpful to determine the information included in the data file and its data structure. In that case, we can use skiprows
and nrows
options in read_csv
to specify the number of lines we want to skip and read, which might reduce importing time from hours to a few seconds.
skiprows = 0
lenrow = 100
preview_df = pd.read_csv('supermarket_sales - Sheet1.csv', skiprows = skiprows, nrows = lenrow, header=0, engine='Python')
columns & info: Knowing the column name and data type are important for initial review. columns
and inf
functions could output column names, the number of non-null values and data type of each column.
df.columns
Index(['Invoice ID', 'Branch', 'City', 'Customer type', 'Gender',
'Product line', 'Unit price', 'Quantity', 'Tax 5%', 'Total', 'Date',
'Time', 'Payment', 'cogs', 'gross margin percentage', 'gross income',
'Rating'],
dtype='object')
df.info()
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Invoice ID 1000 non-null object
1 Branch 1000 non-null object
2 City 1000 non-null object
3 Customer type 1000 non-null object
4 Gender 1000 non-null object
5 Product line 1000 non-null object
6 Unit price 1000 non-null float64
7 Quantity 1000 non-null int64
8 Tax 5% 1000 non-null float64
9 Total 1000 non-null float64
10 Date 1000 non-null object
11 Time 1000 non-null object
12 Payment 1000 non-null object
13 cogs 1000 non-null float64
14 gross margin percentage 1000 non-null float64
15 gross income 1000 non-null float64
16 Rating 1000 non-null float64
dtypes: float64(7), int64(1), object(9)
memory usage: 132.9+ KB
Trick 2: You might notice object
data type in the above output. object
can include multiple different types such as integers, floats and strings, which are labeled as an object
collectively. To inspect the data type for a given column one step further, we can use type
to output more details.
In the following example, the "Salary" column includes both integer and string values. By using type
, we’re able to produce frequency count and review a subset based on the data type.

data= {'Salary': [30000, 40000, 50000, 85000, '75,000'],
'Exp': [1, 3, 5, 10, 25],
'Gender': ['M','F', 'M', 'F', 'M']}
df_salary = pd.DataFrame(data)
df_salary.info()
Out[74]:
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Salary 5 non-null object
1 Exp 5 non-null int64
2 Gender 5 non-null object
df_salary.apply(lambda x: type(x['Salary']), axis = 1).value_counts()
Out[75]:
<class 'int'> 4
<class 'str'> 1
dtype: int64
df_salary[df_salary.apply(lambda x: type(x['Salary']), axis = 1)== str]
Out[80]:
Salary Exp Gender
4 75,000 25 M
Trick 3: I’ve mentioned value_counts above. It outputs counts of unique values for a given column. Sometimes, we need to produce a cross tabulation of two columns. crosstab is a handy function to serve that purpose. By default, it produces a frequency table by any two columns in the dataframe. You can also pass an aggregate function, such as sum
. In the following code, we can compute total "Quantity" by "Product line and "City".
df['City'].value_counts()
Out[19]:
Yangon 340
Mandalay 332
Naypyitaw 328
Name: City, dtype: int64
pd.crosstab(df['Product line'], df['City'])
Out[22]:
City Mandalay Naypyitaw Yangon
Product line
Electronic accessories 55 55 60
Fashion accessories 62 65 51
Food and beverages 50 66 58
Health and beauty 53 52 47
Home and lifestyle 50 45 65
Sports and travel 62 45 59
pd.crosstab(df['Product line'], df['City'], values = df['Quantity'], aggfunc= 'sum')
Out[98]:
City Mandalay Naypyitaw Yangon
Product line
Electronic accessories 316 333 322
Fashion accessories 297 342 263
Food and beverages 270 369 313
Health and beauty 320 277 257
Home and lifestyle 295 245 371
Sports and travel 322 265 333
describe: This function outputs a descriptive statistical summary that includes number of observation, mean, standard deviation, min, max and percentiles.
df.describe()
Out[18]:
Unit price Quantity ... gross income Rating
count 1000.000000 1000.000000 ... 1000.000000 1000.00000
mean 55.672130 5.510000 ... 15.379369 6.97270
std 26.494628 2.923431 ... 11.708825 1.71858
min 10.080000 1.000000 ... 0.508500 4.00000
25% 32.875000 3.000000 ... 5.924875 5.50000
50% 55.230000 5.000000 ... 12.088000 7.00000
75% 77.935000 8.000000 ... 22.445250 8.50000
max 99.960000 10.000000 ... 49.650000 10.00000
[8 rows x 8 columns]
Trick 4: describe is helpful, but pandas_profiling.ProfileReport is even better. Python library, pandas_profiling.ProfileReport ** allows us to generate a comprehensive data exploration report. Similar to describe, pandas_profiling.ProfileReport** would produce basic information and a descriptive statistical summary of each column.
# Install the library
!pip install pandas-profiling
# Import the library
import pandas_profiling
profile = pandas_profiling.ProfileReport(df, title = "Data Exploration")
profile.to_file(output_file='Data Profile Output.html')
It also produces a quick histogram of each column and allows us to create a scatter plot of any two numeric columns. Moreover, it also includes different kinds of correlations of columns and count of missing values. All this good information can be generated using just few lines of code.
Another advantage of using this function is we can save the report in a web format, that we can send to someone else to review.



Subset DataFrame
For this section, let’s create another sample dataframe. In this dataframe, it contains index of [‘r1’, ‘r2’, ‘r3’] and columns of [‘A’, ‘B’, ‘C’].
# Create a dataframe
df2 = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=['r1', 'r2', 'r3'], columns=['A', 'B', 'C'])

Row Selection Using Bracket
df[Beginning Row : Ending Row]: This would give a subset of consecutive rows of data. For example, df2[0:3] produces the first row to the third row in a dataframe.
Trick 5: df2[0] doesn’t extract the first row. Instead, this would give us a column with name of "0", which is not available in our data columns, [‘A’, ‘B’, ‘C’]. Instead, we should use df[0:1] to get the first row.
df[condition(boolean)]: This would create a subset of dataframe based on a specified condition. For example, df2[df2[‘C’] > 10] produces rows where column "C" is greater than 10. df2[df2[‘C’] ≥ 3 & df2[‘A’] > 3] produces rows where column "C" is greater than or equal to 3 and column "A" is greater than 3.
Trick 6: When specifying more than one conditions, we can also use apply
inside the boolean condition with more flexibility. For example, axis = 1
allows us to implement analysis in a row-wise fashion. We can bool
to work with regular expression
to output True/False.
df2[df2.apply(lambda x: x['C']>=3 & x['A'] > 3, axis = 1)]
df[df.apply(lambda x: bool(re.findall('beauty', x['Product line'])) & (x['Gender'] == 'Female'), axis = 1)]
Trick 7: We can use query
to specify a boolean expression. One advantage is we don’t need to explicitly write dataframe name multiple times.
# df[df['Gender']=='Female']
df.query('Gender == "Female"')
# df[df['Quantity']> 7]
df.query('Quantity > 7')
Row Selection Using iloc
df.iloc[Selected Row(s),:]: iloc
allows us to extract a subset based on row numbers. For example: df2.iloc[0:10, :] would produce the first 10 rows and df2.iloc[[0, 2, 5], :] would produce a subset that contains the first, third and sixth rows. The :
inside the square bracket implies that we would keep all the columns.
Row Selection Using loc
df.loc[Selected Index, :]: loc
allows use to extract a subset based on an index. For example, df2.loc[‘r1’, :] would produce a subset with "index = ‘r1’". df2.loc[[‘r1’, ‘r2’],:] would produce a subset with "index = ‘r1’ and ‘r2’".
Column Selection Using Double Bracket
df[[Selected Columns]]: With a double bracket, we can select single and multiple columns in a dataframe. For example, df[[‘Unit price’, ‘Quantity’]] would extract ‘Unit price’ and ‘Quantity’ columns in the dataframe.
Trick 8: Single Bracket vs Double Bracket. We can use a single bracket to extract a single column. But the output will be stored as a series, whereas a double bracket would give us a dataframe.
test1 = df[['Quantity']]
type(test1)
Out[176]: pandas.core.frame.DataFrame
test2 = df['Quantity']
type(test2)
Out[178]: pandas.core.series.Series
df.iloc[:, Selected Column Number] vs df.loc[:, Selected Columns]: Both iloc
and loc
can extract a subset of columns from a dataframe. The difference is iloc
is based on the column number, whereas loc
is using the actual column name. For example, df2.iloc[:, 0] would extract the first column without mentioning the column name, and df2.loc[:,’A’] would extract column "A" from the dataframe.
Thank you for reading !!!
If you enjoy this article and would like to Buy Me a Coffee, please click here.
You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.