The world’s leading publication for data science, AI, and ML professionals.

7 Pandas Functions you Might not Know

These functions are useful but rarely used

Image by Author
Image by Author

Pandas package is a staple data analysis tool for data scientists who uses Python as their main Programming language. Pandas package contain everything that data scientist need and every course taught us to use the Pandas package initially.

Even though the Pandas package is common, there are still abundant functions within this Python package that people might miss, maybe because it is rarely used or people don’t know about it. That is why in this article, I want to outline some Pandas functions that I might miss. Let’s get into it!


1. Holiday

Pandas allowed us to easily analyze a data object (datetime) because the Series object allowed us to store the datetime object. For an easier time working with datetime data, the Pandas package provides us with a class function to create holiday custom calendars. While we still need to set up the calendars, the function allowed us to acquire all the holiday dates we create (or the business day). Let’s try to create the Holiday custom calendars.

#Importing the Pandas Function to create the Custom Calendar and The existing holiday
from pandas.tseries.holiday import AbstractHolidayCalendar, Holiday, EasterMonday, Easter
from pandas.tseries.offsets import Day
#Creating the business calendaer and listing the holiday you feel important
class BusinessCalendar(AbstractHolidayCalendar):
rules = [
Holiday('New Years Day', month=1, day=1),
EasterMonday,
Holiday('Easter Day', month=1, day=1, offset=[Easter()]),
Holiday('Ascension Day', month=1, day=1, offset=[Easter(), Day(39)]),
Holiday('Christmas Day', month=12, day=25)
    ]

The above code would create a Business Calendar class and also setting up the Holiday you feel important. If you are not sure when each holiday happens, Pandas also provide the class such as Easterto set the Easter holiday each year. Let’s see the holiday date created by the custom holiday class.

import pandas as pd
from datetime import date
#Creating the year we want to take the holidat date
year = 2021
start = date(year, 1, 1)
end = start + pd.offsets.MonthEnd(12)

# Getting the holidays
cal = BusinessCalendar()
cal.holidays(start=start, end=end)
Holiday date (Image by Author)
Holiday date (Image by Author)

The above date is the holiday date based on the custom class created previously in the year 2021. You could change the year to have a different easter holiday as each year is different. If you want to know all the available Holiday classes, you could check them in the documentation.


2. Query

The query function from the Pandas package is used for data selection in human words. It is a function designed to eliminate the hassle when we do data selection and used less wordy way. Let’s try it with a dataset example.

import pandas as pd
import seaborn as sns
mpg = sns.load_dataset('mpg')
mpg.head()
Image by author
Image by author

Usually, we would use a list subset to do a data selection. For example, we do a data selection like this code below.

data[(data['mpg'] >15) |(data[model_year] == 70)]

Using the Query function, we could select data easier and in a more humanly word. Let me show you the example below.

mpg.query('mpg > 15 or model_year == 70')
Image by Author
Image by Author

Using the string object on what you want to select to the Query function, you would have the data according to your condition. The result is the same as the usual selection method, but the Query function have a less wordy condition, and we could use English words on the Query.


3. Mask

The Mask function is specific to Pandas Series object to replace values within the series with another value but with the if-else condition set up. In simpler terms, you would replace the value according to the condition you want. Let’s try to use the Mask function with the mpg dataset example.

mpg['mpg'].mask(mpg['mpg'] < 20, 'Less than Twenty' )
Image by Author
Image by Author

With a mask method, often, we pass two parameters to the method; the condition and values to replace. In this case, I give a condition where the mpg values are less than 20 then replace the values with ‘Less than Twenty’.

In case you need more than one condition, we need to chain the methods.

mpg['mpg'].mask(mpg['mpg'] < 20, 'Less than Twenty' ).mask(mpg['mpg'] > 30, 'More than Thirty')
Image by Author
Image by Author

I mentioned previously that the function works specifically with the Series object. If you try it in the DataFrame object, it will replace every value in the current row with your condition value, which we did not want.


4. Highlight

Working with Pandas DataFrame object doesn’t mean that we cannot do anything with the DataFrame aesthetic. In fact, you could play around with the object to create a visually interesting DataFrame. This is why we are using the Style function to styling our DataFrame – for presenting data, better aesthetic, and many more.

You could explore many functions within the Style function, but I would show some of the functions I often used – The highlight function.

#Highlight the Highest and the lowest values
mpg[['mpg', 'weight', 'acceleration']].head(10).style.highlight_max(color='yellow').highlight_min(color = 'lightblue')
Image by Author
Image by Author

Using the Style highlight_max and highlight_min functions, you could highlight which values within the columns are the highest and the lowest. This is useful if you want to have a presentation and want to present your point.

If you want to highlight the columns from the lowest to the highest values gradient, you could do that using the following code.

mpg[['mpg', 'weight', 'acceleration']].head(10).style.background_gradient(cmap = 'Blues')
Image by Author
Image by Author

The function background_gradient would produce a nice way to presenting the data where the audience would get the insight better.


5. Applymap

There are times that you want to execute a certain function that return a value on the DataFrame or Series by processing all the value within the dataset. This is why you need the applymap function to execute your intended function to all the DataFrame values. Let me show you an example.

#Here I create a function to transform each value into string object and return the length of the string
mpg.applymap(lambda x: len(str(x)))
Image by Author
Image by Author

The result is a DataFrame object where the function we pass on theapplymap function is applied to each value in the dataset. This applymapfunction is specific to the DataFrame object. For the Series object, we have map function that is equivalent to the Data Frame applymap attribute.

mpg['name'].map(lambda x: len(str(x)))
Image by Author
Image by Author

In bothapplymap or map , you need to pass the function you have previously created or using the Lambda function.


6. Method Chaining

Method chaining is a continuous function executed in the same line of code to produce the result. We use a chain method to decrease the line we write and to execute the function faster. Let me show you the example of Method Chaining in the code below.

#Method Chaining
mpg.head().describe()
Image by Author
Image by Author

As you can see in the code above, the function is chained one way after another to produce the result in the image above. Well, what about if you want to Method Chaining your own functions? In this case, we could use the pipe function for a faster Method chaining function. Let’s use a code example to get a better understanding. First, I would create two different functions.

#Function to extract the car first name and create a new column called car_first_name
def extract_car_first_name(df):
    df['car_first_name'] = df['name'].str.split(' ').str.get(0)
    return df
#Function to add my_name after the car_first_name and create a new column called car_and_name
def add_car_my_name(df, my_name = None):
    df['car_and_name'] = df['car_first_name'] + my_name

These functions would produce a different result, but they need to be chained together to get the result. Let’s use the pipe function to do the Method Chaining using these functions I just recently created.

mpg.pipe(extract_car_first_name).pipe(add_car_my_name, my_name = 'Cornellius')
mpg.head()
Image by Author
Image by Author

With the pipe function, we chain all the functions we created and produced the above result. Why are we use the pipe instead of applying the function directly to the DataFrame? This is because the pipe function is faster compared to the direct execution of the functions.


7. Plotting

Do you know that the Pandas package allows you to do plotting directly from the DataFrame or Series object? And they even provide you with some interesting Plotting functions?. You might know the simple one such as the plot function.

mpg['mpg'].plot(kind = 'hist')
Image by Author
Image by Author

However, did you know that Pandas also have a little bit more advanced function for plotting purposes? Let’s take a look at some of these functions.

  • Radviz Plot
mpg = sns.load_dataset('mpg')
pd.plotting.radviz(mpg.drop(['name'], axis =1), 'origin')
Image by Author
Image by Author

Radviz plot is a plotting function to project multi-dimensional data into a 2D space in a primitive way. Basically, the function allowed us to visualized 3-dimension data or more into 2-dimension visualization.

  • Bootstrap_plot
pd.plotting.bootstrap_plot(mpg['mpg'],size = 50 , samples = 500)
Image by Author
Image by Author

Boostrap plot is a plot function to determine the uncertainty in fundamental statistics such as mean and median by resampling the data with replacement (you could sample the same data multiple times). Take an example of the image above. The mean plot above shown that most of the result is around 23, but it could be between 22.5 and 25 (more or less). This set the uncertainty in the real world that the mean in the population could be between 22.5 and 25.

  • Scatter_matrix
pd.plotting.scatter_matrix(mpg, figsize = (12,12))
plt.show()
Image by Author
Image by Author

The scatter_matrixfunction produces a scatter plot between the numerical columns. As you can see, the function automatically detects the numerical features within the Data Frame we passed to the function and create a matrix of the scatter plot with the diagonal plot is the single columns distribution plot.


Conclusion

Pandas is a commonly used package for manipulating data by Data scientists. However, there are many functions within this package that many people did not know.

In this article, I try to explain 7 functions that I feel are rarely used by many. They are:

  1. Holiday
  2. Query
  3. Mask
  4. Highlight
  5. Applymap
  6. Method Chaining
  7. Plotting

I hope it helps!


If you enjoy my content and want to get more in-depth knowledge regarding data or just daily life as a Data Scientist, please consider subscribing to my newsletter here.

If you are not subscribed as a Medium Member, please consider subscribing through my referral.

Visit me on my LinkedIn or Twitter.


Related Articles