The world’s leading publication for data science, AI, and ML professionals.

SQL Queries for Data Scientists

Explained with examples.

Photo by Miguel A. Amutio on Unsplash
Photo by Miguel A. Amutio on Unsplash

SQL is a Programming language that is used by most relational database management systems (RDBMS) to manage data stored in tabular form (i.e. tables).

SQL is a fundamental skill expected from a data scientist. You might argue that it is the job of a data engineer but the data scientist roles are inclined to being full-stack. Besides, as a data scientist, you wouldn’t want to depend on a data engineer to retrieve data from a database.

In this post, we will write complex queries to retrieve data stored in a table. I have uploaded the customer churn dataset to a table in a MySQL database.

We will start with simple queries and steadily increase the complexity. I will describe the data needed and then write the query to retrieve that data from the table.

Let’s first take a look at the columns in the table.

(image by author)
(image by author)

There are some features about customers and their accounts at a bank. The "Exited" column indicates whether a customer churns (i.e. leaves the bank).

We can now start on the queries.

The first 5 rows of the "CustomerId" and "Surname" columns

SELECT CustomerId, Surname 
FROM CHURN
LIMIT 5;
(image by author)
(image by author)

The ID of the customer with the highest balance

SELECT CustomerId, MAX(Balance)
FROM CHURN;
(image by author)
(image by author)

Instead of retrieving the entire "Balance" column, we have used the MAX aggregate function to select only the maximum value in that column.


The location of the top 5 customers in term of the highest balance

We cannot use the MAX for this query because we need the top 5 customers. What we can do is to sort the customers based on the balance and then select the top 5 using LIMIT.

SELECT Geography, Balance
FROM CHURN
ORDER BY Balance DESC
LIMIT 5;
(image by author)
(image by author)

The average age of customers who do not have a credit card

There is a condition that requires us to use the WHERE statement.

SELECT AVG(Age)
FROM CHURN
WHERE HasCrCard = 0;
39.1121

In case you wonder, the average age of customers with a credit card is 38.8424.


The number of customers in each country who have more than 2 products

We will use another aggregate function to COUNT the number of customers. In order to group customers based on an attribute, the GROUP BY statement is used.

SELECT Geography, COUNT(CustomerId)
FROM CHURN
WHERE NumOfProducts > 2
GROUP BY Geography;
(image by author)
(image by author)

Average estimated salary based on the number of products

We can apply the AVG function to the salary and GROUP BY the number of products.

SELECT NumOfProducts, AVG(EstimatedSalary)
FROM CHURN
GROUP BY NumOfProducts;
(image by author)
(image by author)

I want to introduce another statement related to conditions. The WHERE statement allows us to select entries that fit a condition or multiple conditions. However, it cannot be used with the aggregate functions.

For the query above, consider we are only interested in the number of product categories with an average higher than 100000. Thus, we need to apply a condition on the averages which can be done using the HAVING statement.

SELECT NumOfProducts, AVG(EstimatedSalary)
FROM CHURN
GROUP BY NumOfProducts
HAVING AVG(EstimatedSalary) > 100000;
(image by author)
(image by author)

Customers older than 50 and have a higher balance than the average

We are introducing two new topics here. One is using multiple conditions (age and balance), the other one is the nested SELECT statements.

We can combine the multiple conditions in a WHERE statement using logical operators such as AND and OR. One condition is explicitly given (age > 50) but the other one needs to calculated on the table using another SELECT statement. That’s where we need nested SELECT statements.

SELECT CustomerId, Age, Balance 
FROM CHURN
WHERE Age > 50 AND Balance > (
    SELECT AVG(Balance)
    FROM CHURN );
The first 7 rows of the resultset (image by author)
The first 7 rows of the resultset (image by author)

The condition on the balance is another SELECT statement.


The number of female customers who stayed with the bank more than average (tenure) and churned

It is similar to the previous example with an additional condition. We will count the number of customers based on three conditions:

  • Gender
  • Churned (exited=1)
  • The duration of being a customer (tenure)
SELECT COUNT(CustomerId) AS 'Number of Customers'
FROM CHURN
WHERE Gender = 'Female' AND Exited=1 AND Tenure > (
 SELECT AVG(Tenure)
    FROM CHURN);

We can also adjust the name of the column in the result set using the "AS" keyword.

(image by author)
(image by author)

Conclusion

We have covered some basic and complex queries. What we achieve with the queries is that some calculations and filtering are done at the database level. Thus, we can only retrieve the data we need instead of getting all the data and then applying filtering and calculations.

Since real-life databases contain much more data and many relational tables, it is very important to be able to query the desired data using SQL.

Thank you for reading. Please let me know if you have any feedback.


Related Articles