The world’s leading publication for data science, AI, and ML professionals.

4 Important SQL Concepts for Beginners

Let's learn some important SQL concepts for beginners to power up your queries

Photo by @chrisliverani - unsplash.com
Photo by @chrisliverani – unsplash.com

SQL is one of the most flexible programming languages around. With its first implementation in 1974, SQL is a querying language that enables developers to work smoothly with Data stored in relational tables.

If you work in an organization, there is a really high chance that you will find some SQL code scattered around your company’s repos. Why? Because SQL is used by roles as diverse as Application Developers, Data Scientists, Data Analysts or Cloud Engineers.

SQL is also highly flexible – you can integrate it with Python, R, Java and most programming languages. Mostly, if your application has a Database layer, you will probably communicate with it using some type of SQL implementation such as PostGres, MySQL or SQLite.

In this post, we’ll check 4 important concepts for SQL Beginners that will help you practice and improve further in the language.


Example Data

For our sandbox example, we’ll use a table called companies with information about several organizations accross different industries.

I’ll create it from scratch so that you can also replicate these examples on some Sql implementation (I’ll use MySQL, so be aware that some query clauses may have a different syntax if you choose other implementation).

To make this code run, make sure you have a sandboxdatabase created – if you don’t, just execute CREATE DATABASE SANDBOX;

The following code creates our companies table structure and inserts some data into it:

create table sandbox.companies (
 company_name varchar(30),
    industry varchar(30),
    ticker varchar(5),
    market_cap bigint,
    num_employees int
    );
insert into sandbox.companies (
 company_name, industry, ticker, market_cap, num_employees
    ) values ("Apple","Technology","AAPL", 2621000000, 154000),
    ("L'oreal", "Consumer Goods", "LRLCY", 214272000, 85087),
             ("Intel", "Semiconductors", "INTC", 196323000, 110600),
             ("Novo Nordisk", "Pharmaceutical", "NVO",246022000,45000),
             ("Bayer", "Pharmaceutical", "BAYRY",56833000,99637),
             ("Abbvie", "Pharmaceutical", "ABBV",280506000,50000),
             ("Facebook", "Technology", "FB",552707000,71970),
             ("Johnson & Johnson", "Pharmaceutical", "JNJ",464855000,130000);

Here is the general look of our data in table format:

Companies Table - Image by Author
Companies Table – Image by Author

This is the table that we will be using throughout the post.


Grouping Data

Grouping and aggregating data is one of the most cool skills to have when building data pipelines. Grouping data enables you to completely transform your table structure and build several aggregations on the raw data based on several columns.

Some examples for some fictional use cases:

  • Grouping daily stock prices for a Ticker to understand the average price for a specific day;
  • Grouping store data to get the sum of total sales per store and month;

Regarding use cases, grouping data is mainly done for two goals:

  • Building data pipelines and creating custom ETL’s with aggregated data for specific needs – for instance, preparing data for dashboards or for machine learning algorithms.
  • Perform quick statistics on the data.

Mostly, grouping data is done using the GROUP BY SQL clause. Combining GROUP BYclauses with aggregator functions is such a powerful SQL feature – for instance, if we want to check the average of the number of employees per industry in our companiestable, we can just:

select industry, avg(num_employees) as avg_n_employees
from sandbox.companies
group by industry;

Notice how simple this code is – almost as if we are talking!

"I want to select the industry and the average of the number of employees from the companies table, grouping the data by industry".

This will yield a result where we have two columns – the industry and the avg_n_employees. How does SQL calculate this avg_n_employees? Let me walk you through the reasoning.

We have 4 pharmaceutical companies in our original table:

Filter Query - Image by Author
Filter Query – Image by Author

The average number of employees for this industry will be:(45000+99637+50000+130000)/4 , which should yield approximately 81.159. This number represents the "average number of employees of companies that work on the Pharmaceutical industry".

If everything is correct, our GROUP BY output must contain the same data for the pharmaceutical industry row:

Average Number of Employees by Company— Image by Author
Average Number of Employees by Company— Image by Author

And, that’s right! For the other industries, SQL automatically calculates the same formula – as semiconductors and consumer goods have only company in our companies table, the value of number of employees for those industries is the same of those companies as n=1 .

Cool! What other tricks can we do with GROUP BY? Another idea is to count the number of companies in each industry— using the count() function:

select industry, count(company_name) as nb_companies
from sandbox.companies
group by industry;

This yields:

Count per Industry - Image by Author
Count per Industry – Image by Author

TheGROUP BY clause normally comes after the FROMin SQL (if you don’t have any join or a filter. If joins or filters exist, those clauses will push the GROUP BY clause for a subsequent part of our query. Just memorize that they always need to come after the FROM clause!

You can learn more about the GROUP BY clause in the first section of my SQL course.


Joins

Another really important concept in SQL are table joins. Joins are mostly done to combine multiple tables into a single one. They are one of the crucial aspects to grasp in SQL and the efficiency of joins (when combines with indexes) is one of the main reasons developers may prefer the language to build data pipelines.

So.. what are some of the most common SQL usages that lean on joins?

Two main examples are:

  • connecting data in a relational data model;
  • building a new combined table for a datamart;

Joins come in multiple flavors. In this section, I’m going to explain some of the most common ones.

For our sandbox example, we need an extra table to cross with our companies one. Let’s create a new table that will contain the market size for each industry:

# Create industries data
create table sandbox.industries (
    industry varchar(30),
    market_size bigint
    );

# Insert some industries data
insert into sandbox.industries (
 industry, market_size
    ) values ("Technology",5200000000),
    ("Consumer Goods", 2000000000),
             ("Semiconductors", 6000000000),
             ("Banking",7500000000);

Our industries table looks like the following:

Industries Table - Image by Author
Industries Table – Image by Author

Notice two things about this table:

  • There is an industry here (banking) that is not on the companies table;
  • There is an industry that is in the companies table and that is not in theindustries(Pharmaceutical).

Having some missing information on your database is a pretty common scenario with real life data. This small detail is essential to understand the different joins we will do.

Let’s perform an inner join – imagine that I would like to add the market_sizecolumn to the companiestable – for example, for each company in a specific industry I would like to have a column with the information of the market size. I can use a join to do that!

Let’s learn the INNER JOIN syntax:

select left.*, right.market_size
from sandbox.companies as left
inner join sandbox.industries as right
on left.industry = right.industry;

For joins, it’s key to define the column that will be used as a connector between the tables. In our case, we are using the industry column because that’s the column that we have on both the left and the right table. Some important details about this query:

  • We are using aliases to refer to tables. The companies table is refered as left and the industries as right.
  • On clause states the columns that will be used to connect both tables – commonly called the join key.
  • left.* means that we want to select all the columns from the companies table.

INNER keyword only returns the rows where the key is present in both tables of the join. For instance, as Pharmaceutical companies are not present on the industries table, these companies will be ommited from the result:

Inner Join Result - Image by Author
Inner Join Result – Image by Author

Sometimes, you may want to override this and have, for example, the market_size as NULL when the industry data is not available – this can be achieved with a LEFT JOIN – meaning that you want to bring every row on the table on the left, regardless of its presence on the table on the right:

# Left join example
select left.*, right.market_size
from sandbox.companies as left
left join sandbox.industries as right
on left.industry = right.industry;

The syntax is super similar to the INNER JOIN – really cool. The only difference is that we now use LEFT JOIN as the joining clause. The result from the query above is the following:

Left Join Result— Image by Author
Left Join Result— Image by Author

In this way, you don’t lose the companies data, even though your don’t have more information about the industry.

You can also do the opposite with RIGHT JOIN:

# Right join example
select left.*, right.market_size
from sandbox.companies as left
right join sandbox.industries as right
on left.industry = right.industry;
Right Join Result - Image by Author
Right Join Result – Image by Author

Notice that, now, we have the market size for the banking industry with all the other rows as NULL . An odd thing is that we don’t have the industry for the "Banking" row – why? Because in the query we selected the industry from table left , our companies table.

Bottom line, joins are really cool concepts that enable us to connect different tables. Other joins that we didn’t cover wereOUTER or FULL but you can study them after getting a good grasp at INNER , RIGHT and LEFT .

You can learn more about the JOIN clauses in the third section of my SQL course.


SubQueries

Subqueries are another SQL feature that will make your queries ultra flexible.

With subqueries, you can nest queries and use query results directly as input for other queries. It sounds confusing but it is actually quite simple!

As an example, let us create a table that only filters Pharmaceutical companies on our companiestable:

select * 
from sandbox.companies
where industry = 'Pharmaceutical';

This query would yield the following result:

Subquery Result - Image by Author
Subquery Result – Image by Author

What if we now want to obtain the maximum number of num_employees from this query result? We can, of course, use a calculated column directly on the query on the where but we can also use subqueries, for instance:

# Subqueries Example
select max(subtable.num_employees) as max_n_employees 
from (
 select * 
 from sandbox.companies
 where industry = 'Pharmaceutical'
    ) as subtable;

Think of subqueries as substituting a physical table in the FROMclause – SQL flexibility enables us to use any query result directly into that same from clause. This has several advantages:

  • It’s particularly relevant when we want to do two sequential group by operations;
  • It’s useful when you want to build a data pipeline without creating a table physically.

You can learn more about subqueries in the fifth section of my SQL course.

Where =/= Having

A common issue beginners face when starting to code in SQL is understanding the difference between the WHERE and HAVING clause.

The main difference is that WHERE is applied before any GROUP BY is applied, while HAVING is applied after the grouping. Let’s check the following two queries:

# Query with WHERE
select industry, avg(market_cap) as market_cap
from sandbox.companies
where market_cap < 1000000000
group by industry;
# Query with HAVING
select industry, avg(market_cap) as market_cap
from sandbox.companies
group by industry
having market_cap < 1000000000;

We will have two different results:

Result from Query with Where - Image by Author
Result from Query with Where – Image by Author
Result from Query with Having - Image by Author
Result from Query with Having – Image by Author

Notice how the row for the industry Technology changes. Why is that? The first query translated into the following sentence:

"I want to select the average of market cap by industry but exclude the companies that have more than 1 trillion of Market Cap"

This is because the WHERE is always applied before the grouping. In this case, Apple is removed before the calculation – so the only company that enters into the calculation of the average is Facebook because it is the only technology company that matches the filter in the where clause.

On the other hand, with the second query we are doing the following:

"I want to select the average of market cap by industry but exclude the industries that have, on average, more than 1 trillion of Market Cap"

Notice the difference? This is because the HAVING clause acts after the grouping is done. In this case, Apple enters the calculation, making the average of the Technology industry a value of over 1 trillion market cap. As the average is over 1 trillion, our having filters it from the result and we no longer have the Technology row with the HAVING clause.

You can learn more about WHEREand HAVINGclauses in the first section of my SQL course.


And we’re done! Thank you for taking the time to read this post. When it comes to manipulate data, SQL has a flexibility that few languages can match. It is so simple to grasp that it is probably the most widely used language throughout the different data landscape roles (Analysts, Scientists, Engineers, etc.).

In this post we’ve explored some cool SQL concepts that will help beginners improve their queries flexibility – know other similar concepts that would make sense to identify? Write down in the comments below!

I’ve set up a course on learning SQL from Scratch on Udemy where I’ve included a deep dive into these concepts and more – the course is suitable for absolute beginners and I would love to have you around!


Related Articles