The world’s leading publication for data science, AI, and ML professionals.

All The SQL a Data Scientist Needs to Know

What you need to know, best practices, and where you can practice your skills

Image artificially generated using Grok 2.
Image artificially generated using Grok 2.

Introduction

In my opinion, SQL is one of the most important skills a Data professional should have. Whether you’re a Data Analyst, Data Scientist, or Software Developer, you’re likely going to be interacting with databases daily using SQL.

Speaking from a data scientist perspective, you do not need to be an expert in SQL. Being able to extract, manipulate, and analyse data using SQL should be enough for the majority of data scientists’ tasks. You will often find that you only use SQL for loading data into a Jupyter Notebook prior to implementing some exploratory data analysis (EDA) using Pandas.

The purpose of this article is to discuss the fundamentals of SQL syntax, discuss SQL best practices, and what resources are available for you to practice your SQL skills.

What is SQL?

SQL is a domain-specific language created to manage and manipulate relational databases. SQL has been widely adopted by not only Data Scientists but the majority of data professionals as the go-to language whenever interacting with databases.

The acronym SQL stands for:

  • Structured: Data is stored in an organised state, unlike unstructured data (e.g. audio, video, text).
  • Query: How users speak to the database, extracting the information they’re looking for by writing SQL queries.
  • Language: SQL is a programming language, designed to be extremely user-friendly and very easy to read, unlike some traditional programming languages.

SQL comes in many different flavors, the main difference when comparing flavors is whether they’re a paid or free service. Over the years there have been several open-source flavors of SQL released, the most popular being MySQL and PostgreSQL.

From my experience, Transact-SQL (via MS SQL Server), GoogleSQL (via BigQuery), and PostgreSQL are the most popular. I would focus on Transact-SQL (via MS SQL Server) if I was starting from scratch as most tutorials cover this flavor of SQL.

For more information on SQL, please see here.

SQL Fundamentals

Some professions such as Data Engineers and Database Administrators (DBAs) need to have an advanced knowledge of SQL, but this is not the case for Data Scientists. As you gain experience, you will find that writing SQL scripts becomes quite repetitive, and most of the time you’re just copying previous scripts and making minor amendments.

Most Data Scientists will use SQL to perform basic data transformations before importing into a Python environment. I am going to provide you with all the fundamental commands required to perform 90% of your day-to-day SQL-related tasks as a Data Scientist.

Selecting Data

The most important SQL command is SELECT, this command allows you to define what columns you would like to select from the table stated in your query.

select
    order_date,
    product_sku,
    order_quantity
from
    my_store.ecommerce.orders

Columns can either be stated individually or you can use an asterisk (*) indicating that you would like to select all columns in that table.

The query above will select all rows from the my_store.ecommerce.orders table regardless of the presence of duplicated rows. To prevent this from happening you can use the DISTINCT command to only return unique rows.

select distinct
    order_date,
    product_sku,
    order_quantity
from
    my_store.ecommerce.orders

Engineering Data

Sometimes your table does not contain the column you would like, but what it does have is the underlying data to create that column. Using something like a CASE command, you can engineer your own feature in your SQL query.

select distinct
    order_date,
    product_sku,
    order_quantity,
    case when order_quantity >= 5 then "High"
         when order_quantity between 3 and 5 then "Medium"
         else "Low" end as order_quantity_status
from
    my_store.ecommerce.orders

In the query above, we have created the column order_quantity_status based on the values in the column order_quantity. The CASE command acts as an IF-ELSE statement, similar to something you might have come across in alternative programming languages.

Note: There are many alternative approaches than using CASE to engineer new features. More information on these approaches is available via the learning resources at the bottom of this article.

Grouping and Ordering Data

These clauses are very self-explanatory, the GROUP BY clause is used when aggregating columns whereas the ORDER BY clause is used when you want to order columns in your output.

select
    order_date,
    count(distinct product_sku) as distinct_product_count
from
    my_store.ecommerce.orders
group by
    order_date
order by
    count(distinct product_sku) desc

In the query above we are grouping by order_date and counting how many unique products were sold each day. After calculating this aggregation, we return the output ordered descending by the newly created distinct_product_count column.

Filtering Data

It is not uncommon to encounter database tables which are terabytes in size. To reduce processing costs and time, including filtering in your queries is essential.

select
    order_date,
    product_sku,
    order_quantity
from
    my_store.ecommerce.orders
where
    order_date >= "2024-12-01"

Including the WHERE clause in your query allows you to take advantage of partitions and/or indexing. By reducing the size of the data your query has to process, your queries will run much faster at an extremely lower cost. Your Data Engineers and DBAs will thank you!

Not only is the WHERE clause good for filtering dates, it can also be implemented on any column in your table. For example, if we wanted to only include SKUs SKU100, SKU123, and SKU420, and only wanted to see orders of those products with a quantity of less than 3, we can use the following query:

select
    order_date,
    product_sku,
    order_quantity
from
    my_store.ecommerce.orders
where
    order_date >= "2024-12-01"
    and product_sku in ("SKU100", "SKU123", "SKU420")
    and order_quantity < 3

Note: Also spend some time looking at the HAVING clause, this is an alternative approach to filtering using aggregated column values. The query below demonstrates this by only returning order dates and the total number of orders when the daily sum is greater or equal to 100.

select
    order_date,
    sum(order_quantity) as total_orders
from
    my_store.ecommerce.orders
where
    order_date >= "2024-12-01"
group by
    order_date
having
    sum(order_quantity) >= 100

Joining Data

The most adopted database design pattern is the Star Schema which uses fact and dimension tables. Fact tables consist of quantitative data such as metrics and measurements whereas dimension tables provide more descriptive information adding further context to the information provided in the fact table.

As a Data Scientist, it’s your responsibility to identify the tables where the data you require is located. Secondly, you must perform the correct join to combine these tables.

select
    o.order_date,
    o.product_sku,
    o.order_quantity,
    p.product_name,
    p.product_weight
from
    my_store.ecommerce.orders o
inner join
    my_store.ecommerce.product_details p
on
    o.product_sku = p.product_sku
where
    o.order_date >= "2024-12-01"

In the query above, we are performing an INNER JOIN on the product_sku column. An INNER JOIN will return all of the order rows where we successfully identify the product_sku in the product_details table.

It is important to pay attention to the aliases assigned to each table, it is not uncommon for more than one table to have the same column name. By using aliases, it allows you to state the specific column you are referencing.

Note: Ensure you spend the time researching alternative joins e.g. LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN. For those who are visual learners, check out this link on SQL joins.

Aggregating Data

Aggregating columns is something you should get extremely familiar with when using SQL. The most common commands you will use frequently are COUNT(), SUM(), MIN(), MAX(), and AVG().

select
    count(product_sku) as product_count,
    sum(order_quantity) as total_orders,
    min(order_quantity) as minimum_orders,
    max(order_quantity) as maximum_orders,
    avg(order_quantity) as average_orders
from
    my_store.ecommerce.orders
where
    order_date >= "2024-12-01"

These aggregation functions are used to generate descriptive statistics from your data. Although this can be completed using Python, I find it more productive for me to complete this task using SQL, especially if it is answering a stakeholder question on the fly.

Next Steps

After mastering the fundamentals, you should expand your knowledge and focus on intermediate SQL. Some common processes that arise frequently in my day job are [common expression tables (CTEs)](https://www.atlassian.com/data/sql/using-common-table-expressions#:~:text=A%20Common%20Table%20Expression%20(CTE,focus%20on%20non%2Drecurrsive%20CTEs.) and window functions.

CTEs

As I do the majority of my SQL using BigQuery via GCP, I use CTEs in almost all my queries. CTEs allow you to create temporary tables that can be later referenced as part of a broader, larger SQL query.

with total_product_orders_daily as 
(
  select
    order_date,
    product_sku,
    sum(order_quantity) as total_orders
  from  
    my_store.ecommerce.orders
where
    order_date >= "2024-12-01"
)

select
    tpod.order_date,
    tpod.product_sku,
    p.product_name,
    tpod.total_orders
from
    total_product_orders_daily tpod
inner join
    my_store.ecommerce.product_details p
on
    tpod.product_sku = p.product_sku

The query above creates a CTE that first calculates the total_orders before then joining the total_product_orders_daily table to the my_store.ecommerce.product_details table. Also note, the WHERE clause occurs as early as possible in the CTE, you should always aim to reduce the amount of data you’re working with as soon as possible.

Window Functions

A window function performs a calculation across a set of rows that are related to the current row, each row maintains a separate identity. For example, if you want to rank your data or identify duplicated records, you can do this by implementing window functions.

select
    order_date,
    product_sku,
    order_quantity,
    rank() over (partition by order_date, product_sku order by order_quantity desc) as daily_sku_order_rank
from
    my_store.ecommerce.orders
where
    order_date >= "2024-12-01"

The query above creates the column daily_sku_order_rank which is ranking each product_sku per order_date in descending order.

To identify and remove duplicated records using window functions you can use the following code:

with base_table as 
(
  select
    order_date,
    product_sku,
    order_quantity,
    row_number() over (partition by order_date, product_sku) as daily_sku_row_num
  from
    my_store.ecommerce.orders
  where
    order_date >= "2024-12-01"
)

select
    order_date,
    product_sku,
    order_quantity,
from
    base_table
where
  daily_sku_order_rank = 1

For instances where daily_sku_order_rank is greater than 1 (a duplicated record), these will be removed once the CTE executes and generates an output.

Note: There are more functions available when performing window functions such as _DENSE_RANK_, more information is available here.

SQL Best Practices

Similar to alternative programming scripts, when composing a SQL script you should always consider somebody else reusing your code. To make this process easier, it is best to follow some SQL best practices. Some standout best practices are:

  1. Use Meaningful Naming Conventions: It is better to have longer more descriptive column/table naming conventions.
  2. Code Formatting: Use consistent indentation throughout your script. There is no right or wrong regarding upper or lowercase text, pick one and stick to it.
  3. Refrain from Selecting All: Select specific columns you want to include in your output, do not use the asterisks when selecting from a table.
  4. Indexing Columns: Columns that are frequently used in WHERE, JOIN, or ORDER BY clauses should be indexed, therefore optimising query performance.
  5. Where to Use Functions: Similar to where you should index your columns, you should also not be using any functions (e.g. CAST(), LEN()) within WHERE, JOIN, or ORDER BY clauses. This is also the same for wildcards.

Note: There are more SQL best practices than those stated above, this can sometimes be dependant on the flavor of SQL you’re using. I encourage you to enquire within your company to see whether there has been any internal SQL best practices established which you could look to implement in your work.

Resources for Practicing SQL

Your SQL development will always advance at a faster pace when working with real data in a professional setting. For aspiring Data Scientists who are yet to land their first position, there are many online alternatives you can use to maintain and grow your SQL skills.

Some of the top resources I have found for learning SQL are:

W3Schools.com

Solve SQL Code Challenges

Master Coding for Data Science

SQLZoo

SQLBolt – Learn SQL – Introduction to SQL

DataLemur – Ace the SQL & Data Science Interview

Personally, I find StrataScratch to be the best as it allows you to choose between different flavors of SQL, has a great selection of questions, and has a good UI (similar to LeetCode).

For more theoretical learning, W3Schools is the one I would select. I started reading this resource when I was first learning SQL, it is always in my bookmarks should I need to refresh my memory on a specific topic.

One thing I would suggest is not to spend too much time trying to find the right resource, pick one, and start tackling the challenges. Start with the beginner tasks and work your way up, be patient and consistent with your learning. You do not need to complete all the hard challenges before you are deemed interview-ready, as you progress your confidence will grow.

Note: Some of these resources are free, whereas others have free tiers but some of their content sits behind a paywall.

Final Thoughts

All Data Scientists should have at least a foundational knowledge of SQL. Unfortunately, SQL does not get the recognition at an academic level which often results in graduates lacking the skills when trying to land their first Data Scientist role.

The language isn’t the most attractive and when compared to learning Python, can often be described as being quite boring. It is only when you begin working in a professional setting that you understand how important SQL will be in your career.

Not only are there vast amounts of free resources online to teach you, but there is also a great community of engineers and scientists discussing SQL best practices online.

Take the time to learn SQL, having this skill in your toolbox early in your career will definitely make you stand out from the competition.


Disclaimer: I have no affiliation with any of the companies, software, or products discussed in this article. Furthermore, unless stated otherwise, all images included in this article are owned by the author.


If you enjoyed reading this article, please follow me on Medium, X, and GitHub for similar content relating to Data Science, Artificial Intelligence, and Engineering.

Happy learning! 🚀


Related Articles