Every device on the internet is uniquely addressable via an Internet Protocol (IP) address. The global IP address space is overseen by the Internet Assigned Numbers Authority (IANA). Traditionally, IANA allocates address space in /8 prefix blocks for IPv4, which are subsequently assigned to Internet Service Providers and other organizations. Various databases exist to map these IP blocks to their respective owners, along with information about the originating country and city.
As the Canadian national CSIRT, we, the Canadian Centre for Cyber Security, rely heavily on referencing these databases for looking up a given IP or enhancing entire datasets through SQL JOINs. However, not all use cases necessitate precision up to the city level; sometimes, country information alone suffices.
Within a country, many network blocks are contiguous. Consolidating these into mega blocks can significantly reduce the size of a table mapping mega blocks to countries, leading to improved JOIN operations.
In this article, we will demonstrate how to summarize a geolocation table by merging contiguous network blocks.
Let’s assume our geolocation table contains the following data:
+----------+-------+---------+-----------+-----------+
| start_ip | end_ip| country | city | owner |
+----------+-------+---------+-----------+-----------+
| 1 | 2 | ca | Toronto | Telus |
| 3 | 4 | ca | Quebec | Rogers |
| 5 | 8 | ca | Vancouver | Bell |
| 10 | 14 | ca | Montreal | Telus |
| 19 | 22 | ca | Ottawa | Rogers |
| 23 | 29 | ca | Calgary | Videotron |
+----------+-------+---------+-----------+-----------+
Here, start_ip
represents the lowest number in the network block, and end_ip
represents the largest. Normally, these numbers are much larger. For example, Google’s DNS server 8.8.8.8 is represented by the number 134744072. We use simple synthetic values for illustration purposes.
To start, let’s perform a simple summarization. For example, counting the number of IP addresses assigned to each country. This can be achieved by grouping the data by country and summing the number of IPs in each network block.
SELECT
country,
SUM(end_ip - start_ip + 1) as num_ip
FROM
geo_table
GROUP BY
country
This statement groups rows by country and applies a SUM
aggregation function, calculating the total number of IPs for each country. It’s important to note that the SUM
aggregation is associative, meaning the order in which you sum does not matter, similar to additions in mathematics.
+---------+--------+
| country | num_ip |
+---------+--------+
| ca | 24 |
+---------+--------+
Now, let’s delve into the complexities of aggregating contiguous network blocks. Referring to our original table, we need to fuse together the first 3 rows. Bocks 1–2, 3–4, 5–8 should result in the mega block 1–8. We also need to fuse the last 2 rows. Blocks 19–22 and 23–29 result into 19–29. Our goal is to produce the following table:
+----------+-------+---------+
| start_ip | end_ip| country |
+----------+-------+---------+
| 1 | 8 | ca |
| 10 | 14 | ca |
| 19 | 29 | ca |
+----------+-------+---------+
Detecting contiguous blocks requires inter-row information, and the order of rows becomes crucial. Fortunately, windowed analytical functions provide a solution by offering a mechanism for inter-record referencing. These functions, such as LEAD
and LAG
, enable comparisons with the values of the previous and subsequent rows, facilitating the identification of contiguous IP blocks.
Let’s apply the LEAD
and LAG
windowing functions to our table. Notice that within the OVER
clause we still specify that our data should be grouped by country PARTITION BY country
but additionally we specify the ordering of this data ORDER BY start_ip
.
SELECT
*,
LAG(end_ip) OVER (
PARTITION BY country
ORDER BY start_ip) AS prev_end_ip,
LEAD(start_ip) OVER (
PARTITION BY country
ORDER BY start_ip) AS next_start_ip
FROM
geo_table
The resulting table view_1
is as follows:
+----------+-------+---------+-------------+---------------+
| start_ip | end_ip| country | prev_end_ip | next_start_ip |
+----------+-------+---------+-------------+---------------+
| 1 | 2 | ca | null | 3 |
| 3 | 4 | ca | 2 | 5 |
| 5 | 8 | ca | 4 | 10 |
| 10 | 14 | ca | 8 | 19 |
| 19 | 22 | ca | 14 | 23 |
| 23 | 29 | ca | 22 | null |
+----------+-------+---------+-------------+---------------+
It’s crucial to distinguish between windowing functions and simple GROUP BY
functions. In an OVER()
operation, the results of LEAD
and LAG
are appended to every row, providing context for the previous and next row’s information. This is distinct from functions in a GROUP BY
clause that reduce a group of rows into a single aggregation result.
Now that we have access to the details of both the preceding and succeeding rows, we can facilitate inter-row comparisons. This comparison is crucial for identifying contiguous IP blocks, enabling us to determine when to fuse adjacent blocks together.
Each row can fall into one of four states: 1) Remove: The block is contiguous with both the previous and next blocks. 2) Start: The block is contiguous with only the next block. 3) End: The block is contiguous with only the previous block. 4) Keep: The block is not contiguous with either the previous nor the next block.
Let’s add this state
column to our table.
SELECT
*,
CASE
WHEN (end_ip = next_start_ip - 1)
AND (start_ip = prev_end_ip + 1) THEN 'remove'
WHEN (end_ip = next_start_ip - 1) THEN 'start'
WHEN (start_ip = prev_end_ip + 1) THEN 'end'
ELSE 'keep'
END AS state
FROM
view_1
We obtain the following view_2
result:
+----------+-------+---------+-------------+---------------+-------+
| start_ip | end_ip| country | prev_end_ip | next_start_ip | state |
+----------+-------+---------+-------------+---------------+-------+
| 1 | 2 | ca | null | 3 | start |
| 3 | 4 | ca | 2 | 5 | remove|
| 5 | 8 | ca | 4 | 10 | end |
| 10 | 14 | ca | 8 | 19 | keep |
| 19 | 22 | ca | 14 | 23 | start |
| 23 | 29 | ca | 22 | null | end |
+----------+-------+---------+-------------+---------------+-------+
We can proceed to remove the rows falling between the start and end blocks, specifically those identified with state remove
.
SELECT
start_ip,
end_ip,
country,
state
FROM
view_2
WHERE
state IN ('start', 'end', 'keep')
Resulting in view_3
:
+----------+-------+---------+-------+
| start_ip | end_ip| country | state |
+----------+-------+---------+-------+
| 1 | 2 | ca | start |
| 5 | 8 | ca | end |
| 10 | 14 | ca | keep |
| 19 | 22 | ca | start |
| 23 | 29 | ca | end |
+----------+-------+---------+-------+
We are getting close to our goal! All we have to do now is merge the start
and end
rows, which contain the start_ip
and end_ip
of the mega blocks we are seeking to produce. To achieve this, we once again use a window function. This time to fetch the end_ip
from the end
row.
SELECT
*,
LEAD(end_ip) OVER (
PARTITION BY country
ORDER BY start_ip) AS next_end_ip
FROM
view_3
Result in view_4
:
+----------+-------+---------+-------+-------------+
| start_ip | end_ip| country | state | next_end_ip |
+----------+-------+---------+-------+-------------+
| 1 | 2 | ca | start | 8 |
| 5 | 8 | ca | end | 14 |
| 10 | 14 | ca | keep | 22 |
| 19 | 22 | ca | start | 29 |
| 23 | 29 | ca | end | null |
+----------+-------+---------+-------+-------------+
Notice that the rows with state start
now have a start_ip
and a next_end_ip
, the information necessary to build a mega block.
The rows with state end
are no longer needed and can be removed.
The rows with state keep
already have the correct end_ip
.
We can now determine the final_end
value of the mega blocks. Two cases are possible:
1) for a start
row, we obtain the end value from the next_end_ip
.
2) for a keep
row, we simply use the original end_ip
value.
SELECT
start_ip AS final_start,
CASE
WHEN (state = 'start') THEN next_end_ip
WHEN (state = 'keep') THEN end_ip
ELSE NULL
END AS final_end_ip
FROM
view_4
WHERE
state IN ('start', 'keep')
We thus achieve our goal of fusing contiguous IPv4 blocks into mega blocks.
+----------+-------+---------+-------+-------------+------------+----------+
| start_ip | end_ip| country | state | next_end_ip | final_start| final_end|
+----------+-------+---------+-------+-------------+------------+----------+
| 1 | 2 | ca | start | 8 | 1 | 8 |
| 10 | 14 | ca | keep | 22 | 10 | 14 |
| 19 | 22 | ca | start | 29 | 19 | 29 |
+----------+-------+---------+-------+-------------+------------+----------+
Putting all the statements above together, we obtain a final statement:
SELECT
country,
final_start,
IF(state = 'start', next_end_ip, final_end) AS final_end
FROM (
SELECT
country,
start_ip AS final_start,
end_ip AS final_end,
LEAD(end_ip) OVER (
PARTITION BY country
ORDER BY start_ip) AS next_end_ip
FROM (
SELECT
start_ip,
end_ip,
country,
CASE
WHEN (end_ip = next_start_ip - 1)
AND (start_ip = prev_end_ip + 1) THEN 'remove'
WHEN (end_ip = next_start_ip - 1) THEN 'start'
WHEN (start_ip = prev_end_ip + 1) THEN 'end'
ELSE 'keep'
END AS state
FROM (
SELECT
*,
LAG(end_ip) OVER (
PARTITION BY country
ORDER BY start_ip) AS prev_end_ip,
LEAD(start_ip) OVER (
PARTITION BY country
ORDER BY start_ip) AS next_start_ip
FROM
geo_table
)
WHERE
state IN ('start', 'end', 'keep')
)
)
WHERE
state IN ('start', 'keep')
In conclusion, SQL analytical window functions offer a robust framework for complex data analysis. They empower users to perform aggregations while retaining the context of individual rows, facilitating tasks such as running totals, averages, and percentile calculations. Moreover, these functions play a crucial role in ranking, time-series analysis, and detecting anomalies and outliers within datasets. These functions are indispensable assets in the toolkit of data practitioners.
Analytical window functions are very powerful. In this article, we only scratched the surface; for example, we did not make use of a window_frame
. A window frame allows you to further refine which rows are considered in the aggregation. Window frames are relative to the current row and can be based on the number of rows or time intervals, making these functions indispensable for a wide range of analyses. You can learn more about these features in the Spark documentation: Spark SQL – Window Operations .