PostgreSQL: The world’s most advanced open source relational database
Sql provides numerous functions and methods to manage data stored in tabular form. A relational database management system (RDBMS) is a program that uses SQL to manage data stored in a relational database.
A relational database contains a lot of tables that relate to each other by means of shared columns. There are many different RDBMSs such as MySQL, PostgreSQL, SQL Server, and so on.
In this article, we will go over 6 examples that demonstrate how SQL joins are performed. If you’d like to start with a basic introduction to SQL and PostgreSQL, here are the two introductory articles that I have written previously:
SQL joins allow for retrieving data from multiple tables. We basically combine multiple tables based on the values in the common columns of related tables. We can then extract any piece of information from these tables.
I have created two tables by using the data from the Melbourne housing dataset available on Kaggle. The name of the tables are houses and location and they are related based on the "houseid" column.
Example 1
Find the addresses of the houses posted on 2017–04–03.
select location.address, houses.date from location join houses on location.houseid = houses.houseid where houses.date = '2017-04-03' limit 5;
address | date
-----------------+------------
5 Charles St | 2017-04-03
40 Federation La | 2017-04-03
50 Bedford St | 2017-04-03
23 Hart St | 2017-04-03
1/80 Hawker St | 2017-04-03
We first specify the columns to be selected along with their table names. The second and third lines contain the table names. Then, we specify the condition to be used for matching rows between tables. Finally, the where statement is used for filtering rows. Limit keyword, as its name suggests, limits the number of rows to be displayed.
Example 2
Find the number of houses with type h in the Northern Metropolitan region.
SELECT COUNT(1) FROM location JOIN houses ON location.houseid = houses.houseid WHERE type = 'h' AND regionname = 'Northern Metropolitan';
count
-------
2754
(1 row)
There are 2754 houses that fit our condition. You may notice that we do not have to write the table names for the columns used with the where statement. It is allowed unless the tables do not have the columns with the same name.
Example 3
Find the number of houses for each type in the Northern Metropolitan region.
SELECT type, COUNT(1) AS number_of_houses
FROM houses
JOIN location
ON houses.houseid = location.houseid
WHERE regionname = 'Northern Metropolitan'
GROUP BY type;
type | number_of_houses
-----+------------------
t | 307
h | 2754
u | 829
(3 rows)
This example is quite similar to the previous one. Instead of just counting the houses with type h, we use the type column for grouping the observations (i.e. rows) and then count the houses in each group.
Example 4
Find the average house price in each suburb and display the top 5 suburbs in terms of the average price.
SELECT suburb, AVG(Price) AS avg_house_price FROM location JOIN houses ON location.houseid = houses.houseid GROUP BY suburb ORDER BY avg_house_price DESC LIMIT 5;
suburb | avg_house_price
------------+--------------------
Kooyong | 2185000
Canterbury | 2180240.7407407407
Middle Park | 2082529.4117647058
Albert Park | 1941355.072463768
Brighton | 1930158
(5 rows)
It is similar to the group by operation in the previous example. In this example, we also sort the results based on the aggregated column (avg_house_price). The results are sorted in ascending order by default but it can be changed using the desc keyword.
Example 5
Up to this point, we have used the table names in the join clause interchangeably. The order of the tables did not matter. "From houses join location" is the same as "from location join houses".
The reason why order did not matter is that we have used what is called an "inner join". Only the rows that have matching values in both tables are selected.
When we do a left join, we take the entire table on the left (i.e. the first one in the join clause) and then only the matching rows from the right table.
Let’s find the average land size of the houses in each region.
SELECT regionname, ROUND(AVG(landsize)) AS avg_landsize FROM location INNER JOIN houses ON location.houseid = houses.houseid GROUP BY regionname;
regionname | avg_landsize
---------------------------+--------------
Eastern Metropolitan | 634
Western Metropolitan | 494
South-Eastern Metropolitan | 614
Southern Metropolitan | 509
Western Victoria | 656
Eastern Victoria | 2950
Northern Metropolitan | 569
Northern Victoria | 3355
The syntax is the same except that we write "left join" instead of "join". Please keep in mind that the order of the table names in the join clause matters when we use left or right join.
There is no need to explicitly mention the right join because it can be converted to a left join just by switching the table names.
Example 6
We also have the "full outer join" which includes the matching rows from both tables, and also the rows that do not match.
Let’s find the price of the houses that includes the word "park" in its address and has more than 4 rooms.
sonery=# SELECT (price/1000000) AS price_million, address FROM houses FULL OUTER JOIN location ON houses.houseid = location.houseid WHERE address LIKE '%Park%' AND rooms >4;
price_million | address
---------------+-------------
2.1 | 21 Park Rd
2.3 | 119 Park St
2 | 52 Park Rd
In our case, the type of join does not actually matter because both tables contain every house id. However, there will be cases where we need to use left or outer joins. Thus, it is important to know how they work.
Conclusion
We have done 6 examples to demonstrate how joins work. The data we need to retrieve from a relational database is typically spread out to multiple tables. Thus, it is very important to have a comprehensive understanding of SQL joins.
Thank you for reading. Please let me know if you have any feedback.