The world’s leading publication for data science, AI, and ML professionals.

14 Pandas Operations That Every Data Scientist Must Know!

Complete guide on fourteen of the most essential Pandas Operations

that will help developers to accomplish most Data Science Projects with ease

Photo by Sid Balachandran on Unsplash
Photo by Sid Balachandran on Unsplash

Understanding the data that is available to data scientists is the most crucial aspect of Data Science. Only with the correct knowledge of data and the proper data for the required task will you be able to accomplish the most successful results. In Data Science, the analysis, visualization, and manipulation of the data hold a high significance.

With the help of Python, everything about machine learning and data science is made extremely simple. We can make use of some of the best libraries that are available in Python for accomplishing any desired action with ease. One such library that allows us to analyze and manipulate the data to decrease the complexity and speed up the process of computing the problem is Pandas.

The Pandas library is one of the best features available in Python for data analysis procedures. You can perform a wide array of tasks with ease. In this article, we will look at the different types of operations that every data scientist must make use of to accomplish the particular project while utilizing the least resources but achieving the highest efficiency.

It is recommended that the users following along with this guide use a Jupyter Notebook to follow with most of the coding sections mentioned in this article to get the best experience possible. If you are unsure about the concept of Jupyter notebooks, check out the following link provided below to understand almost everything you need to know about Jupyter Notebooks and how to utilize them effectively.

Everything You Need To Know About Jupyter Notebooks!

Let us start exploring each of these functionalities and operations that data scientists can use with the Pandas library accordingly in the upcoming sections.


1. Creating a data frame with pandas:

### Creating a dataframe
import pandas as pd
dataset = {'Fruits': ["Apple", "Mango", "Grapes", "Strawberry", "Oranges"], 'Supply': [30, 15, 10, 25, 20]}
# Create DataFrame
df = pd.DataFrame(dataset)

# Print the output.
df
Image By Author
Image By Author

With the help of the Pandas library, it is possible for the users to easily create new data frames. The Pandas library allows the users to create new material for data analysis and visualization in the most aesthetically pleasing approach. It allows the developers to clearly understand the type and formats of data they are working or dealing with.

Although the procedure of creating data frames can be accomplished in many ways, one of the best ways to create a data frame is as mentioned in the following code shown above. Here, we are importing the Pandas framework and defining our dataset. The dataset variable contains a dictionary of all the elements with the keys mentioned as fruits and supply.

Finally, you can use the DataFrame function available in the Pandas module to construct the dataset and proceed to store it in a desirable format. The following procedure is also possible with the help of just lists. To learn more about dictionaries and how you can master them, I would recommend checking out one of my previous articles from the link mentioned below.

Mastering Dictionaries And Sets In Python!


2. Reading a CSV File:

# Importing the framework
import pandas as pd
# Reading a random csv file and displaying first five elements
data = pd.read_csv("name.csv")
data.head()

With the help of the Pandas library, it is possible to read the data that is stored in different formats with ease. One of the most common formats in which most information related to data science is stored is in the .CSV format. With the help of the read csv command available in the Pandas framework, one can easily read all the intricate and elementary details stored in the particular format.

Apart from the Comma-separated values (CSV) format, the Pandas library also supports XLSX files, files in a .zip, JSON files, text files, HTML files, PDF and DOCx files, and Hierarchical Data Formats. You can access all these formats as well with the Pandas library and manipulate them accordingly.


3. Read the top element chart:

# print the first couple of elements
df.head(2)
Image By Author
Image By Author

In the next couple of sections, we will understand the details of the two basic Pandas operations. One of these functions is the head() operation which will display the first five elements by default. You can specify the number of elements you want to view in the function, and you will receive the first "n" entries that you requested. This function is important for understanding the basic architecture of your data without dwelling too deep into the intricate details of the data.


4. Read the Bottom element chart:

# print the last couple of elements
df.tail(2)
Image By Author
Image By Author

Similar to the head function, we can make use of the tail() operation to read the last few data elements for the given datasets. The tail function allows us to quickly grasp an idea of a few things. Firstly, we can obtain the total number of elements with a quick command. We can also use this command to determine and verify any other preceding sorting or appending operations to ensure that the previous procedures were correctly followed.


5. Understanding the statistical information of the data:

# Understand all the essential features
df.describe()
Image By Author
Image By Author

With the help of the Pandas library framework, it is possible to visualize the most essential statistical information. The describe function() in Pandas will allow us to receive all the important statistical data from the data frame. Some of these parameters include the total count of the number of elements present in the dataset, the average or mean of the entire data, the minimum and maximum values, the respective quartiles, and the standard deviation of the computation.

With the help of all this crucial information that is collected with a single command, we can make use of these statistics to simplify our overall data analysis procedure. With this data, we can also understand which would be the best visualization techniques to explore. For learning more about the eight best visualization methods that you must consider while constructing data science projects, check out the following article below.

8 Best Visualizations To Consider For Your Data Science Projects!


6. Writing a CSV file:

# Create DataFrame
dataset = {'Fruits': ["Apple", "Mango", "Grapes", "Strawberry", "Oranges"], 'Supply': [30, 15, 10, 25, 20]}
df = pd.DataFrame(dataset)
# Writing the CSV File
df.to_csv('Fruits.csv')
Image By Author
Image By Author

With the help of the Pandas library, it is also possible for the users to create and write new data frames and save them in the desired file formats. The Pandas library allows the save formats similar to the previously discussed reading CSV files section. The following code block mentioned above will save the created data frame into a .CSV file. Once you open the saved file from the respective directory where it was saved, you should find the CSV file with the following information as shown in the above image representation.


7. Merging the values:

### Creating a dataframe
import pandas as pd
dataset1 = {'Fruits': ["Apple", "Mango", "Grapes", "Strawberry", "Oranges"], 'Supply': [30, 15, 10, 25, 20]}
dataset2 = {'Fruits': ["Melons", "Pear"], 'Supply': [10, 20]}
# Create DataFrame
df1 = pd.DataFrame(dataset1)
df2 = pd.DataFrame(dataset2)

# Print the output.
df1.merge(df2, how = "outer")
Image By Author
Image By Author

Assume that we have two or more datasets, and you want to combine them together as one individual entity. With the help of the Pandas framework, we can utilize the merge command operation that is available in this library to combine all the data frames and individual elements into one entity or data frame. The code block shown above is a great representation of how the following action can be performed.

In the above code block, we have declared two different data frames, which are two different entities on their own. With the help of the Pandas function merge() and specification of how they should be combined, we have created the particular combined data frame. The how parameter can be varied and experimented with. Some of the options are left, right, cross, and the outer function that we have used in the above code.


8. Grouping the values:

import pandas as pd
dataset = {'Fruits': ["Apple", "Mango", "Grapes", "Strawberry", "Oranges"], 'Supply': [30, 15, 10, 25, 20]}
df = pd.DataFrame(dataset)
a = df.groupby('Fruits')
a.first()
Image By Author
Image By Author

Another amazing functionality that the Pandas library can be used for is to group elements together in a more presentable manner. In the created data frames that we have used several times in previous sections, we can notice that we have an index column followed by the other columns.

However, we can use the groupby() function in Pandas to segregate and visualize these elements in a more condensed manner to show the more essential features. The above code block and image representation show us how this procedure can be completed accordingly to achieve the following results.


9. Accessing Specific Rows and Columns:

# Access specific rows and columns by specific positions
df.iloc[1:4, [0]]
Image By Author
Image By Author

With the help of the Pandas library, the users can easily access any particular element that they want with the help of the iloc operation. The iloc command is used to select a single value of a row and column by the specified position. In the above code block, we can specify the number of rows (three in this case) that have to be displayed alongside the first column.


10. Accessing by labels:

# Access specific rows and columns by specific labels
df.loc[1:3,  ['Supply']]
Image By Author
Image By Author

Similar to how you can access the particular element requirement with the help of the position, you can also make use of the specific label to access any column you want. In the Pandas library, we can make use of the loc command is used to select a single value of a row and column by specified labels. The above code block specified is used to access the row labeled zero to the third row with the column name specified as supply.


11. Sort the Values in a data frame:

import pandas as pd
dataset = {'Fruits': ["Apple", "Mango", "Grapes", "Strawberry", "Oranges"], 'Supply': [30, 15, 10, 25, 20]}
df = pd.DataFrame(dataset)
df.sort_values(by = ["Supply"])
Image By Author
Image By Author

The Pandas library also allows the users to sort their data accordingly, similar to data structures like lists. We can you the sort values functionality available in Pandas and mention the specific column according to which you can want to sort the entire dataset and arrange them in their respective increasing order. If you want to learn more about mastering lists and most operations related to lists, check out the following article mentioned below.

Mastering Python Lists For Programming!


12. Applying a particular computation:

import pandas as pd
import numpy as np
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
df.apply(np.mean, axis=1)

The apply function in the Pandas library will allow the developers to apply any kind of specific computation on a particular axis, such as an axis equivalent to zero or an axis equivalent to one. We can apply numerous numpy operations such as the mean to compute average, sum to calculate the sum, square roots of the numbers, and many other similar operations. The final return type is inferred from the return type of the applied function. Otherwise, it depends on the result type argument. For further information on this operation, check out the following reference.


13. Time expressions:

import pandas as pd
pd.Timedelta('6 days 3 hours 20 minutes 15 seconds')

Output: Timedelta(‘6 days 03:20:15’)

The Pandas library frameworks allow the users to use multiple expressions related to time. The time delta operation will specifically calculate the differences in times that can be expressed in different units. The time can be displayed in terms of days, hours, minutes, and seconds as shown in the above code block. There are multiple other operations that you can perform with the time modules available in Pandas, such as the addition of time. For further information on this section of the article, check out the following website.


14. Plotting Graphs with Pandas:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,4),index=pd.date_range('1/1/2000',
   periods=10), columns=list('ABCD'))
df.plot()
Image By Author
Image By Author

One of the significant operations that the Pandas library can perform is the task of data visualization. Albeit it does not act as a direct replacement for other data plotting libraries, such as matplotlib and seaborn, the plot function in Pandas is still a useful option that is usable alongside the numpy library to achieve some quick and useful plots for performing quick data analytics and visualizations.

In the above code block, we can create a random numpy array consisting of specific rows and columns. We then declare the index with the range of the date and proceed to specify what all the values and methodologies of the rows and columns particularly mean. The code reference for the following code block above is taken from the following website. Check it out to gain further information on the topic if you are interested to learn more about it.


Conclusion:

Photo by Lukas W. on Unsplash
Photo by Lukas W. on Unsplash

"If you just use the scientific method as a way to approach data-intensive projects, I think you’re more apt to be successful with your outcome."Bob Hayes

The Pandas library framework is one of the best tools that are available in Python Programming. With the help of this module, the life of a data scientist is made much simpler as the handling, processing, analysis, and manipulation of data becomes significantly easier with the proper use of all the operations and functionalities that are available through this particular framework.

The fourteen Pandas functions discussed in this article are a must-know for data scientists of all levels and something that they must have in their arsenal of vast knowledge for addressing any kind of problems that they encounter. With the proper use of these operations and functionalities, most of the common issues can easily be addressed.

If you have any queries related to the various points stated in this article, then feel free to let me know in the comments below. I will try to get back to you with a response as soon as possible.

Check out some of my other articles that you might enjoy reading!

7 Best UI Graphics Tools For Python Developers With Starter Codes

15 Numpy Functionalities That Every Data Scientist Must Know

Best PC Builds For Deep Learning In Every Budget Ranges

17 Must Know Code Blocks For Every Data Scientist

6 Best Projects For Image Processing With Useful Resources

Thank you all for sticking on till the end. I hope all of you enjoyed reading the article. Wish you all a wonderful day!


Related Articles