
In the previous article, we started to understand the basic SQL operations that can be performed. These operations helped us to gain some initial insights into the dataset. They ensured that we optimised the interpreter to efficiently analyse the data.
This article aims to delve a bit deeper and review more Advanced Sql operations.
DISTINCT
The distinct operation aims to display the unique values contained within a column. When reviewing a dataset multiple duplicate values can exist. Some columns may contain a few discrete values and these would produce a short list of unique values. Whereas a column with a wide range of values, such as a continuous column, would have a large number of discrete values. Being aware of this discrepancy helps to understand the cardinality of a column. For a continuous variable a method of bucketing into separate range values could provide more insight.

From code 1.1 and the following output we can see that the county column had a short list of unique values. Performing the same operation on the continuous temperature column would have produced a much larger list.
UNION
A UNION operation aims to combine the results of two or more queries into one result output. Within most databases two options exist. Firstly, the UNION statement combines all unique values from the queries provided. This aims to reduce the output into the smallest number of values. Being careful to put the queries in the correct order can reduce the chance of displaying the incorrect order of results. The second option is the UNION ALL statement, which aims to combine all values together without performing any duplicate checks.
From code 1.2 we have created two queries that are combined to display the summary values in one dataset. We need to ensure that the same columns displayed in each query are in the same order. If a column only exists in one query then the corresponding column should be set to a missing value in the other query.

The output above shows that the two queries have been stacked together and that all unique values remain.
ORDER BY
Having outputs in the correct order helps to display additional information to the end users. With the order by statement we are able to order the results in a similar manner to the group by statement. A variation of columns can be used such as; existing columns, column aliases or aggregation functions.
The code displayed in code 1.3 shows a variation of the same query with the addition of the key word DESC after the column of interest. Including this key word ensures that the results are displayed in descending order, as the default operation is for ascending order.

SQL Output 1.3 highlights the results of a descending operation being applied to the maximum temperature.
LIMIT and OFFSET
When first interacting with a dataset, we are interested in understanding what a sample of the data looks like. By reviewing a sample of the data we can gain valuable insights into how the columns are displayed. It allows the developer to produce Data Analysis ideas quickly. In addition when beginning to produce derived variables we can reduce the overhead of the operations by performed them on the smaller sample. Once we are comfortable that the operations work on the sample then they can be moved onto the full population. However, we must always be aware that a small sample may be much different to the full population, therefore other analysis is required to ensure that strange results are not produced.
From the code shown above we can see the top record being extracted each time, with a slight alteration performed using the offset.

As the Sql Output 1.4 shows the top record has been extracted after the other SQL operations were applied. This example picks out the highest temperature from the entire dataset for the filtered data.

By introducing the offset operation we are able to exclude the first two results and display the next best output. Such analysis maybe useful if you are aware that the output will contain some outlier values which might impact how the column is interpreted.
WINDOW
Aims to create a window of one or more rows from the result set. There are a wide range of methods that can be applied to this operation. We have displayed a few high level methods that can help to quickly summarise columns over the window of interest.

The row_number highlights the number of rows within the partition clause. Whereas the rank shows the relative rank compared to other values. In the output above we seen that similar values for the window operation are shown as the same rank value.
Conclusion
Using advanced SQL techniques can really help to increase the data insights that can be discovered from a dataset. Within this article we have seen how reviewing unique values helps to develop aggregate statistics. However with the addition of an order statement the highest or lowest values can really shine through. The use of union helps to stack dataset analysis which works really well with time series outputs. Introducing a limit can aid the initial discovery process as we peeked inside the dataset. Finally, the brief introduction to the window function helps to view the data in separate windows.
Thanks for reading