
We live in the era of big data. We collect lots of data which allows to infer meaningful results and make informed business decisions. However, the raw data does not offer much unless it is processed and explored. In order to make the most out of raw data, we need a thorough exploratory Data Analysis process. Even if we build complex, well-structured machine learning models, we cannot just dump the raw data to them. The models get as good as the data we feed to them. As the amount of data increases, it gets trickier to analyze and explore the data. There comes in the power of data analysis and visualization tools.
We will explore a dataset on flight delays which is available here on Kaggle. There are two datasets, one includes flight details in Jan 2019 and the other one in Jan 2020. In this post, we will use the one in Jan 2019. We start with importing the dataset into a pandas dataframe.
import numpy as np
import pandas as pd
df = pd.read_csv("/content/Jan_2019_ontime.csv")
print(df.shape)
df.columns

The dataset contains information on more than 500k flights (rows) and 22 features (columns) on each flight. Some of the columns seem redundant or kind of recurring so we will include some of the columns in our analysis.
df = df[['DAY_OF_MONTH', 'DAY_OF_WEEK', 'OP_CARRIER_AIRLINE_ID', 'ORIGIN', 'DEST','DEP_TIME', 'DEP_DEL15', 'ARR_TIME', 'ARR_DEL15', 'CANCELLED', 'DIVERTED', 'DISTANCE']]
df.head()

For each flight, the following features are available:
- Day of the month
- Day of the week
- Carrier airline (
- Departure city and departure time (ORIGIN, DEP_TIME)
- Arrival city and arrival time (DEST, ARR_TIME)
- Delayed on departure and arrival (DEP_DEL15, ARR_DEL15)
- Cancelled, Diverted
- Distance
Missing values
We first handle missing values. As a first step, let’s check the number of missing values in each column.
df.isna().sum()

We have missing values in 4 columns. The number of missing values are close so they might be in the same rows. To confirm, we can visualize missing values using missingno module.
%matplotlib inline
import missingno as msno
msno.matrix(df)

White lines indicate missing values. Most of the missing values in those 4 columns are in the same rows. The rows with missing values might belong to cancelled or diverted flights. Let’s check "CANCELLED" and "DIVERTED" columns before handling missing values.

The number of cancelled flights is approximately the same as the number of missing values in columns about flight information.

Diverted flights also cause missing values. The missing values are because of cancelled or diverted flights. Thus, we can drop them.
df.dropna(axis=0, how='any', inplace=True)
df.isna().sum().sum()
0

CANCELLED and DIVERTED columns are full of zeros now so we can also drop them.
df.drop(['CANCELLED','DIVERTED'], axis=1, inplace=True)
General overview of delays
The columns that indicate delays are binary so a flight is either delayed (1) or not delayed (0).
df['ARR_DEL15'].value_counts()

Delays on arrival are more than delays on departure which makes sense because departure delays are likely to cause arrival delays. We can compare the delays on departure and arrivals using groupby function of pandas.
df[['ARR_DEL15','DEP_DEL15','DEST']].groupby(['DEP_DEL15',
'ARR_DEL15']).count()

As expected, arrivals are more likely to have delays if there is a delay on departure. Please note that we can use any column as the third column in the above statement (I chose ‘DEST’). The purpose of that column is just to see the counts.
Origins and destinations
ORIGIN column contains departure locations and DEST column contains destination locations. Let’s see how many origins and destinations we have in the dataset.
print("There are {} unique origin city".format(df['ORIGIN'].nunique()))
print("There are {} unique destination city".format(df['DEST'].nunique()))

We can sort the origins and destinations in terms of delay rate.
df[['ORIGIN','DEP_DEL15']].groupby('ORIGIN').agg(['mean','count']).sort_values(by=('DEP_DEL15','mean'), ascending=False)[:10]

df[['DEST','ARR_DEL15']].groupby('DEST').agg(['mean','count']).sort_values(by=('ARR_DEL15','mean'), ascending=False)[:10]

We sorted the origins in terms of departure delay ratio and destinations in terms of arrival delay ratio. The locations with high delay rates have very few flights.

Average number of flights per locations is 1635. The ones with less than 100 flights may be outliers and do not give us an accurate estimation. Let’s sort the origins that have more than 500 flights.
origin_delays = df[['ORIGIN','DEP_DEL15']].groupby('ORIGIN').agg(['mean','count'])
origin_delays[origin_delays[('DEP_DEL15','count')] > 500].sort_values(by=('DEP_DEL15','mean'), ascending=False)[:10]

The whole list changed except for "ASE" which has the highest delay rate by far. Please note that when we filter a dataframe with multilevel index, we need to use all levels. For instance, in order the filter the above dataframe based on "count", we use origin_delays[('DEP_DEL15','count')],
not origin_delays['count']
.
Day of the week and month
The dataset covers Jan 2019. We can check the delay ratio based on day of the week and day of the month. Since we are not checking the delays based on location, we can create a new column by summing delays at departure and arrival.
df['DEL15'] = df['ARR_DEL15'] + df['DEP_DEL15']

The value of "DEL15" column is 2 if both departure and arrival have delays.
Let’s compare the days of the week in terms of average delays.
df[['DAY_OF_WEEK','DEL15']].groupby('DAY_OF_WEEK').agg(['mean','count'])

Most flights are on Wednesdays and Thursdays which also have slightly more average delay rates than the other days. It is an indication that as the frequency of flights increase, the delays also increase.
Let’s also check the average delays based on the day of month column. This time we will create an informative visualization using plotly express API of plotly library but we first need to calculate average delays in each day.
day_of_month = df[['DAY_OF_MONTH','DEL15']].groupby('DAY_OF_MONTH').agg(['mean','count'])
day_of_month.columns = ['mean','count']

We can now plot it.
import plotly.express as px
fig = px.bar(day_of_month, x=day_of_month.index, y='mean', color='count', height=400, title="Delays in Jan 2019")
fig.show()

The color of the bar indicates the number of flights in each day based on the count scale on the right side. The height of the bar shows the delay ratio. There is not a strong pattern in terms of days but the days with more flights tend to have higher delay ratios.
Carrier airline
Operational issues of a carrier might also cause flight delays.

There are 17 different carries in the dataset. We first sort the carrier based on the average delays.
carrier = df[['OP_CARRIER_AIRLINE_ID','DEL15']].groupby('OP_CARRIER_AIRLINE_ID').agg(['mean','count']).sort_values(by=('DEL15','mean'), ascending=False).reset_index()
carrier.columns = ['carrier_ID','mean','count']
carrier['carrier_ID'] = carrier['carrier_ID'].astype('object')
carrier.head()

Let’s create a bar plot similar to the one we did for days.
fig = px.bar(carrier, x=carrier.index, y='mean', color='count', height=500, title="Delays of Different Airlines")
fig.show()

The lighter colors indicate more flights. We cannot see a trend that shows more flights cause more delays. For instance, the airlines represented with second, third, and fourth bars are in the lowest region in terms of the number of flights. However, they have higher delay ratios than most other carriers. Thus, the carrier is also a determining factor in delays just like the number of flights.
We have analyzed the flight delays based on location, day, and carriers. Our findings are:
- As the frequency of flights increase, we are more likely to see more delays.
- Some carriers have higher delay ratios although not having more flights than others.
- Although not very deterministic, the location may also have an effect on delays.
The dataset also includes information on time and distance of flights which might also have an effect on delays. These columns can be analyzed with similar methods. There is no limit on exploratory data analysis. Depending on our task or goal, we can approach the data from a different perspective and dig deep to explore. However, the tools used in the process are usually similar. It is very important to practice a lot in order to get well at this process.
Thank you for reading. Please let me know if you have any feedback.