The world’s leading publication for data science, AI, and ML professionals.

5 Examples to Explain SQL Versions of Pandas Operations

SQL is much more than a query language

Photo by Nick Fewings on Unsplash
Photo by Nick Fewings on Unsplash

SQL stands for structured query language. However, it offers much more than queries to retrieve data from a relational database. I think of SQL as a complete tool for data analysis and manipulation.

One of the most popular data analysis tools in the Data Science ecosystem is the Pandas library for Python. SQL can perform pretty much any operation that you can do with Pandas.

In this article, we will go over 5 examples that demonstrate the SQL versions of Pandas operations. The examples will include both data manipulation and analysis tasks.

I will do examples that go beyond the very simple tasks because I would like to emphasize how Sql can be used as a data analysis tool.

We will be using the Melbourne housing dataset available on Kaggle. I have previously downloaded the dataset and created an SQL table that contains some the columns from it.

Let’s first take a look the table.

(image by author)
(image by author)

Note: There are several relational database management systems such as MySQL, SQL Server, PostgreSQL, and so on. Although they mostly adapt the same SQL syntax, there might be small differences. In this article, we will use PostgreSQL.


Example 1

The first example is about date manipulation. In some cases, we need to extract a particular information from a date such as month and year.

In Pandas, the month information can be extracted from a date using the month method under the dt accessor.

We can do the same operation with SQL as follows:

SELECT date, EXTRACT(month from date) AS month FROM melb LIMIT 5;
(image by author)
(image by author)

We use the extract function which can be used to get several pieces of information from a date. For instance, year can be extracted by replacing the "month" argument inside the extract function with "year".


Example 2

The second example is about string manipulation. It is a highly common task in data analysis because a substantial amount of data is in textual form.

A typical operation is to split strings. Pandas provides the split function under the str accessor for this task. In PostgreSQL, the split_part function can be used to split strings.

Let’s create a new column that contains the first word in the region name column. It can be used as a region group.

SELECT 
 regionname, 
 SPLIT_PART(regionname,' ', 1) AS regiongroup 
FROM  melb LIMIT 5;
(image by author)
(image by author)

The limit keyword is only used to limit the number of rows to be displayed.


Example 3

In this example, we will replicate the value_counts function of Pandas. It is used for counting the number of occurrences of distinct values in a column. In our case, we can find the number of houses in each region.

SELECT regionname, count(1) AS house_count
FROM melb
GROUP BY regionname
ORDER BY house_count desc;
(image by author)
(image by author)

We use the group by function to group the observations (i.e. rows) based on the distinct values in the region name column. Then, the count function is used for count the rows for each region.


Example 4

In this example, we will implement an if-else statement in a query. Assume we want to classify houses based on their distance to the central business district. We can accomplish this task using a case when function.

While retrieving data from this table, we will create a new column called "distance type". The value it takes will be based on the distance column.

SELECT 
   distance,                                                                                                                    
   CASE WHEN distance > 20 THEN 'far'                                                                                                             
        WHEN distance <= 20 and distance > 10 THEN 'medium'                                                                                          
        ELSE 'close'                                                                                                                                
        END AS distancetype                                                                                                                         FROM melb                                                                                                                                   LIMIT 5;
(image by author)
(image by author)

The first 5 rows have the same distance. To confirm the case when function works as expected, we can check the maximum distance in each distance type.

SELECT 
   MAX(distance),                                                                                                                    
   CASE WHEN distance > 20 THEN 'far'                                                                                                             
        WHEN distance <= 20 and distance > 10 THEN 'medium'                                                                                          
        ELSE 'close'                                                                                                                                
        END AS distancetype                                                                                                                         FROM melb                                                                                                                                   GROUP BY distancetype;
(image by author)
(image by author)

We can complete this task using the cut function in Pandas.


Example 5

The last example is about one of the most frequently used Pandas functions, the groupby. We use it to compare the distinct values or categories in a column based on the values in a different column.

SQL provides the group by statement for this task. Just like with Pandas, we can create nested groups with SQL. We just need to pass the column names to be used for grouping.

Once the groups are created, we can apply aggregate functions on the columns we are interested in.

Let’s group the houses according to the categories in the type and rooms columns. Then, we can calculate the following values for each category:

  • Maximum price
  • Average distance
  • Number of houses in that group
SELECT 
   type, 
   rooms,                                                                                                                 
   max(price/1000000) as max_price,                                                                                                            
   cast(avg(distance) as decimal(5,2)) as avg_distance,                                                                                        
   count(1) as house_count                                                                                                                     FROM melb                                                                                                                                   WHERE rooms <= 4                                                                                                                            GROUP BY type, rooms
ORDER BY type, rooms;
(image by author)
(image by author)

In order to simplify the result set, I have only included the houses that have 4 or less rooms. The results are ordered by the type and rooms columns.

The Pandas code that performs this task is as follows:

melb[melb.Rooms <= 4]
   .groupby(["Type", "Rooms"], as_index=False).agg(
      max_price = ("Price", "max"),
      avg_distance = ("Distance", "mean"),
      house_count = ("Distance", "count")
   ).sort_values(by=["Type","Rooms"])

Conclusion

SQL is a must-have skill for data scientists. Although it is called a query language, it offers much more than retrieving data. It is possible to do almost all the typical data analysis and manipulation tasks with SQL.

Thank you for reading. Please let me know if you have any feedback.


Related Articles