The world’s leading publication for data science, AI, and ML professionals.

Graphs 101: Airline Transportation Network

A hands-on overview with a real world dataset

The United States Bureau of Transportation Statistics (BTS) publishes a wealth of dataset on all things transportation related. One of those datasets under the unassuming name of "T-100 Domestic Segment (U.S. Carriers)" is described as:

This table contains domestic non-stop segment data reported by U.S. air carriers, including carrier, origin, destination, aircraft type and service class for transported passengers, freight and mail, available capacity, scheduled departures, departures performed, aircraft hours, and load factor when both origin and destination airports are located within the boundaries of the United States and its territories.

This is essentially a large csv file where each row corresponds to an individual flight, including airline, origin and destination airports, number of passengers embarked, date and time, etc. If we take airports to be our nodes and flights to be edges then we have our very own Network!

Flight Information

When downloading the file containing all the flights for 2019, we find 391,114 unique flights covering the 365 days of the year. When we sum up all the passengers traveling between the same Origin and Destination we find that we are down to 27,302 rows. Sorting by the passenger number we find that the Top 5 airline segments in 2019 were:

with over 1.6M passengers each!

Each row of this reduced dataset now corresponds to an individual edge. This format is known as "Edge List" and is one of the most common forms to represent a Graph, just a list of edges, one per line. If you’ve ever taken a look at a typical network science tutorial or online network data repository this is probably the format of the datasets you worked with.

Using just this simple example we can already introduce a few more concepts. We don’t have any edges where the Origin is the same as the Destination (who would want to fly from back to the airport they departed from without ever touching down?). Edges of that kind of known as "Self Loops" and tend to be discarded in most analyses. Just in case, we check if we have any such edges and find over two hundred self-loops:

Some of these appear to be attributable to air tours (of the Grand Canyon, for example) while others are less obvious. In any case, we remove them from our Edge List, leaving us with E=27,072 edges.

Edges also have a well defined direction, namely, from the Origin to the Destination, making this a "Directed Network". If instead of flights we were considering, say, interstate highways then we would have an "Undirected Network" (since highways typically allow for traffic in both directions).

Our network associates a numerical value with each edge (the number of passengers who flew that route). That means that the Airline Transportation Network, as we have defined it, is a "Weighted Network". In the general case, edge weights can represent any kind of numerical property associated with the edge, such as number of passengers, distance, maximum flow capacity, etc.

Also, as we added up the numbers of passengers across all flights we have only one edge between any pairs of nodes. This means that this network, as we have constructed it, is a "Simple Graph" (no multiple edges and no self-loops). If we were interested in distinguishing between, say, flights by different airline carriers we could have allowed multiple edges (one per airline) between the same pair of nodes. In that case, we would be dealing with a "MultiGraph".

Airport Information

Now that we’ve explored the edges of our network, we turn our attention to the nodes. We extract the list of nodes in our graph by combining the sets of Origins and Destinations. We find that we have N=1,318 unique nodes. Who knew there were that many Airports in the US alone?

Nodes usually have an ID associated with them (in this case the IATA code) and a great deal of information can be associated with the specific ID used, such as location, size, color, weight, etc. In the case of airports, we can download extra information about each airport from another of the BTS Tables, the whimsically named "Master Coordinate" Table that BTS describes as:

This table contains historical (time-based) information on airports used throughout the aviation databases. It provides a list of domestic and foreign airport codes and their associated world area code, country information, state information (if applicable), city name, airport name, city market information, and latitude and longitude information.

We are particularly interested in the Latitude and Longitude coordinate for each airport, as that will make our lives easier when we try to visualize the network. The good folks over at BTS made our lives easier by providing us with Latitude and Longitude columns as well as a flag indicating if these are the most up to date coordinates available:

Who knew airports moved around so much? After we filter out to just the most recent airport coordinates, we find that have information on 6,572 airports scattered around the world.

Since our flight network only includes flights within the US, we extract the set of nodes included in our flight dataset by combining the set of airports used as Origins or as Destinations. In total, we have 1,318 airports within the US. When we filter our Airport Lat/Lon database to just those airports we are able to finally plot them on a map!

As our network is embedded in real space, we were able to easily obtain spatial coordinates for each node. This is not always the case and we must calculate adequate X and Y coordinates for each node before we can visualize the network. In network terms, this is known as a "Graph Layout" and there’s a whole family of algorithms dedicated to calculating them. We’ll cover some of them in a future post.

As we can see, the BTS database is very complete and includes airports in all US territories and islands. To make our lives easier, we restrict ourselves to the 48 contiguous US states. With this further restriction we are down to just N=962 nodes and E=22,691 edges.

We can now visualize not only the nodes, but also the edges of this reduced network:

Network Analysis

With our network fully defined, we can now start exploring it more deeply. One of the most basic characteristics of a node is its Degree, k: The number of edges it is a part of. In the case of directed networks such as ours, we distinguish between the number of incoming and outgoing connections, in/out degree, respectively. For our network we have:

Where we note two important points. First. the average in and out degrees are the same. This is always true and a direct consequence of the fact that one nodes outgoing edge must be another incoming edge. We can also easily compute the overall average degree using the expression:

In our case, the average degree is 47.27, or exactly twice the average in/out degree. Basic math hasn’t failed us yet! 🙂

Second, and perhaps more surprisingly, we note that we have several nodes with 0 in or out degree. This means that there are airports from where flights leave but none come back (or vice versa)! In total, we have 60 such airports. A quick look up of the airport codes on wikipedia reveals that these appear to be military bases and other airports that don’t usually host commercial flights. Perhaps these correspond to emergency landings?

We can also see which are the airports with the largest number of incoming or outgoing flights:

Not surprisingly, these are large airline hubs like Chicago O’Hare, Dallas Fort-Worth or Atlanta’s Hartsfield-Jackson.

From this table, we can also see that while close, the number of incoming and outgoing edges is not the same. When we plot k_in vs k_out we find a strong (although not perfect) correlation with some nodes having higher kin while others have higher values of kout. Overall, the Pearson correlation is 0.986171.

A better way to summarize the connectivity properties of a network is to plot the degree distribution, the relative frequency of each value of kin/kout:

We note that the degree distribution is fairly broad, spanning over two orders of magnitude. This kind of distribution (linear on a log-log scale) is typical of empirical networks and is sometimes referred to as "scale-free". We’ll dig a lot more into this in future posts.

You can find all the code for the analysis in this post in our companion GitHub Repository, And if you enjoyed this post, you should subscribe to the Graphs For Data Science substack!


Related Articles