The world’s leading publication for data science, AI, and ML professionals.

6 Pandas Tricks That You Might Not Know, But Should

Helpful for your daily tasks

Photo by Vlada Karpovich from Pexels
Photo by Vlada Karpovich from Pexels

If you are willing to learn Data Science, Pandas is one of the most important libraries that you need to learn.

Pandas is the industry standard library for data organization cleaning and manipulation and it is used by almost all data scientists across the globe. Although really powerful to work with, there are a few tricks that can make your workflow more efficient.

In this article, we are going to discuss 6 such tricks that can be beneficial to you whether you are a beginner or an experienced programmer.


1. Adjusting rows and columns of a data frame

Many times when we are checking some unprocessed datasets, we encounter problems like an excessive number of rows or columns, irregularities in the size of the cells, and floating-point numbers with a lot of digits etc.

We can easily overcome those issues by assigning, a certain value for the max number of rows/columns we can see, the precision of floating-point numbers etc…

After importing the Pandas library, you can write the below lines of code to set things up.

pd.options.display.max_columns = 50  
# to see max 50 columns
pd.options.display.max_rows = 200    
# max 200 rows at a time
pd.options.display.max_colwidth = 100 
#max column width is set to 100
pd.options.display.precision = 3     
#floating point precision upto 3 decimal

2. Selecting specific columns/rows

Let’s say you want a few specific rows and columns from a large dataset with numerous rows and columns. You can do that easily using df.iloc and df.loc in Pandas. df.iloc helps you to select a specific range of rows and columns based on your need and df.loc can perform Boolean selections based on a specific condition.

Have a look at the below lines of code.

>> df.iloc[4:7,2:5]                                
# select the 4th to 6th row & 2nd to the 4th column
>> df.loc[:,'column_x':]                           
# select all columns starting from 'column_x'
>> df.loc[df['value'] < 100, ['name', 'location']] 
#selects the columns 'name' &amp; 'location' having value < 100

3. Sum of all rows and columns in a dataset

Let’s say you want to get the sum of all the rows on all the columns in a specific data set. You can do that easily using df.apply() method in Pandas with the help of a lambda function.

df = pd.DataFrame(dict(A=[1,2,3], B=[4,5,6], C=[7,8,9]))
df['column total'] = df.apply(lambda x: x.sum(), axis=1)
df.loc['row total'] = df.apply(lambda x: x.sum())
df.head()

The output will be something like this:

4. Masking dataframes with condition

Let’s say you want to mask the rows and columns of a dataset which does not meet a specific requirement. You can do that easily in Pandas by setting a condition:

con = df['A'] < 2 #elements of A having value less than 2

And then apply the condition to the dataset using the following method:

df = pd.DataFrame(dict(A=[1,2,3], B=[4,5,6], C=[7,8,9]))
con = df['A'] < 2 #elements of A having value less than 2
df2 = df[con]
df2.head()

Output

Before Vs After Masking
Before Vs After Masking

5. "Explode" a dataframe

If your Pandas dataframe contains some column that contains a list or dictionary as a value like the following:

df = pd.DataFrame(dict(A=[1,2,3], B=[4,5,6], C=[[7,8],9,10]))
df.head()

To flatten the list we can use the df.explode() method which takes the column name as the argument and changes the dataset to this:

df.explode('C')

6. Setting all the values of a column to a particular datatype

While working with unprocessed data you can face some situations where a column contains more than 1 type of datatypes within it.

Using the df.infer_objects() we can easily change the datatype of all the elements to a single type. The infer_objects function sets all the datatypes of a particular column to a particular datatype taking an educated guess:

df = pd.DataFrame({"A": ["a", 1, 2, 3]})
df = df.iloc[1:]
print(df.head())
df.dtypes

Here, normally the datatype of most of the elements of the column is int, but the output shows it as object datatype due to the presence of the character ‘a’.

After applying df.infer_objects() we get:

The datatype of the current dataframe is int.


Conclusion

These were a few Pandas tricks that can save you a lot of time while working with large and unprocessed datasets.

There are more such functions and methods which can save you a lot of time when working with pandas. You will get to learn them if you play around with the library and take a look at the documentation.

Stay tuned for more Python-based tips and tricks in my upcoming articles.


Related Articles