Scalable Log Analytics with Apache Spark — A Comprehensive Case-Study

Data Analysis & Visualization at Scale on Semi-structured Data

Photo by Robin Pierre on Unsplash


One of the most popular and effective enterprise case-studies which leverage analytics today is log analytics. Almost every small and big organization today have multiple systems and infrastructure running day in and day out. To effectively keep their business running, organizations need to know if their infrastructure is performing to its maximum potential. This involves analyzing system and application logs and maybe even apply predictive analytics on log data. The amount of log data is typically massive, depending on the type of organizational infrastructure and applications running on it. Gone are the days when we were limited by just trying to analyze a sample of data on a single machine due to compute constraints.

Source: Doug Henschen

Powered by big data, better and distributed computing, big data processing and open-source analytics frameworks like Spark, we can perform scalable log analytics on potentially millions and billions of log messages daily. The intent of this case-study oriented tutorial is to take a hands-on approach to showcasing how we can leverage Spark to perform log analytics at scale on semi-structured log data. If you are interested in scalable SQL with Spark, feel free to check out SQL at scale with Spark.

We will be covering the following major topics in this article today.

While there are a lot of excellent open-source frameworks and tools out there for log analytics including elasticsearch, the intent of this tutorial is to showcase how Spark can be leveraged for analyzing logs at scale. In the real-world, you are free to choose your toolbox when analyzing log data. Let’s get started!

Main Objective — NASA Log Analytics

Like we mentioned before, Apache Spark is an excellent and ideal open-source framework for wrangling, analyzing and modeling on structured and unstructured data — at scale! In this tutorial, our main objective is focusing on one of the most popular case studies in the industry — log analytics. Typically, server logs are a very common data source in enterprises and often contain a gold mine of actionable insights and information. Log data comes from many sources in an enterprise, such as the web, client and compute servers, applications, user-generated content, flat files. They can be used for monitoring servers, improving business and customer intelligence, building recommendation systems, fraud detection, and much more.

Spark allows you to dump and store your logs in files on disk cheaply, while still providing rich APIs to perform data analysis at scale. This hands-on case study will show you how to use Apache Spark on real-world production logs from NASA and learn data wrangling and basic yet powerful techniques in exploratory data analysis. In this case study, we will analyze log datasets from NASA Kennedy Space Center web server in Florida. The full data set is freely available for download here.

These two datasets contain two months’ worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida. You can head over to the website and download the following files as needed (or click on the following links directly).

Make sure both the files are in the same directory as the notebook containing the tutorial which is available on my GitHub.

Setting up Dependencies

The first step is to make sure you have access to a Spark session and cluster. For this you can use your own local setup or a cloud based setup. Typically most cloud platforms will provide a Spark cluster these days and you also have free options including Databricks community edition. This tutorial assumes you already have Spark setup hence we will not be spending additional time configuring or setting up Spark from scratch.

Often pre-configured Spark setups already have the necessary environment variables or dependencies pre-loaded when you start your jupyter notebook server. In my case, I can check them using the following commands in my notebook.


This shows me that my cluster is running Spark 2.4.0 at the moment. We can also check if sqlContext is present using the following code.

<pyspark.sql.context.SQLContext at 0x7fb1577b6400>

Now in case you don’t have these variables pre-configured and get an error, you can load them up and configure them using the following code. Besides this we also load up some other libraries for working with dataframes and regular expressions.

Working with regular expressions will be one of the major aspects of parsing log files. Regular expressions are a really powerful pattern matching technique which can be used to extract and find patterns in semi-structured and unstructured data.

Source: xkcd

Regular expressions can be extremely effective and powerful, yet they can sometimes be overwhelming or confusing. Not to worry though, with more practice, you can really leveraging its maximum potential. The following example showcases a way of using regular expressions in Python.

<_sre.SRE_Match object; span=(0, 25), match="I'm searching for a spark"> 0 25
<_sre.SRE_Match object; span=(25, 36), match=' in PySpark'> 25 36

Let’s move on to the next part of our analysis.

Loading and Viewing the NASA Log Dataset

Given that our data is stored in the following mentioned path (in the form of flat files), let’s load it into a DataFrame. We’ll do this in steps. The following code get’s us the log data file names in our disk.

['NASA_access_log_Jul95.gz', 'NASA_access_log_Aug95.gz']

Now, we’ll use or to read the text file. This will produce a DataFrame with a single string column called value.

|-- value: string (nullable = true)

This allows us to see the schema for our log data which apparently looks like text data which we shall inspect soon. You can view the type of data structure holding our log data using the following code.


We will be using Spark DataFrames throughout our tutorial. However if you want, you can also convert a dataframe into an RDD if needed, Spark’s original data structure (resilient distributed datasets).

base_df_rdd = base_df.rdd

Let’s now take a peek at the actual log data in our dataframe., truncate=False)

This definitely looks like standard server log data which is semi-structured and we will definitely need to do some data processing and wrangling before this can be useful. Do remember accessing data from RDDs is slightly different as seen below.


Now that we have loaded up and viewed our log data, let’s process and wrangle it.

Data Wrangling

In this section, we will try and clean and parse our log dataset to really extract structured attributes with meaningful information from each log message.

Log Data Understanding

If you’re familiar with web server logs, you’ll recognize that the above displayed data is in Common Log Format.

The fields are: remotehost rfc931 authuser [date] "request" status bytes

We will need to use some specific techniques to parse, match and extract these attributes from the log data.

Data Parsing and Extraction with Regular Expressions

Next, we have to parse our semi-structured log data into individual columns. We’ll use the special built-in regexp_extract() function to do the parsing. This function matches a column against a regular expression with one or more capture groups and allows you to extract one of the matched groups. We’ll use one regular expression for each field we wish to extract.

You must have heard or used a fair bit of regular expressions by now. If you find regular expressions confusing (and they certainly can be), and you want to learn more about them, we recommend checking out the RegexOne web site. You might also find Regular Expressions Cookbook, by Goyvaerts and Levithan, to be useful as a reference.

Let’s take a look at the total number of logs we are working with in our dataset.

print((base_df.count(), len(base_df.columns)))
(3461613, 1)

Looks like we have a total of approximately 3.46 million log messages. Not a small number! Let’s extract and take a look at some sample log messages.

Extracting host names

Let’s try and write some regular expressions to extract the host name from the logs.


Extracting timestamps

Let’s now try and use regular expressions to extract the timestamp fields from the logs

['01/Jul/1995:00:00:01 -0400',
'01/Jul/1995:00:00:06 -0400',
'01/Jul/1995:00:00:09 -0400',
'01/Jul/1995:00:00:14 -0400',
'01/Jul/1995:00:00:15 -0400',
'01/Jul/1995:00:00:15 -0400']

Extracting HTTP Request Method, URIs and Protocol

Let’s now try and use regular expressions to extract the HTTP request methods, URIs and Protocol patterns fields from the logs.

[('GET', '/history/apollo/', 'HTTP/1.0'),
('GET', '/shuttle/countdown/', 'HTTP/1.0'),
('GET', '/shuttle/countdown/count.gif', 'HTTP/1.0'),
('GET', '/images/NASA-logosmall.gif', 'HTTP/1.0')]

Extracting HTTP Status Codes

Let’s now try and use regular expressions to extract the HTTP status codes from the logs.

['200', '200', '200', '304', ..., '200', '200']

Extracting HTTP Response Content Size

Let’s now try and use regular expressions to extract the HTTP response content size from the logs.

['6245', '3985', '4085', '0', ..., '1204', '40310', '786']

Putting it all together

Let’s now try and leverage all the regular expression patterns we previously built and use the regexp_extract(...) method to build our dataframe with all the log attributes neatly extracted in their own separate columns.

Finding Missing Values

Missing and null values are the bane of data analysis and machine learning. Let’s see how well our data parsing and extraction logic worked. First, let’s verify that there are no null rows in the original dataframe.


All good! Now, if our data parsing and extraction worked properly, we should not have any rows with potential null values. Let’s try and put that to test!


Ouch! Looks like we have over 33K missing values in our data! Can we handle this?

Do remember, this is not a regular pandas dataframe which you can directly query and get which columns have null. Our so-called big dataset is residing on disk which can potentially be present in multiple nodes in a spark cluster. So how do we find out which columns have potential nulls?

Finding Null Counts

We can typically use the following technique to find out which columns have null values.

(Note: This approach is adapted from an excellent answer on StackOverflow.)

Well, looks like we have one missing value in the status column and everything else is in the content_size column. Let's see if we can figure out what's wrong!

Handling nulls in HTTP status

Our original parsing regular expression for the status column was:

regexp_extract('value', r'\s(\d{3})\s', 1).cast('integer')
.alias( 'status')

Could it be that there are more digits making our regular expression wrong? or is the data point itself bad? Let’s try and find out!

Note: In the expression below, ~ means "not".


Let’s look at what this bad record looks like!

Looks like a record with a lot of missing information! Let’s pass this through our log data parsing pipeline.

Looks like the record itself is an incomplete record with no useful information, the best option would be to drop this record as follows!

Handling nulls in HTTP content size

Based on our previous regular expression, our original parsing regular expression for the content_size column was:

regexp_extract('value', r'\s(\d+)$', 1).cast('integer')

Could there be missing data in our original dataset itself? Let’s try and find out! We first try to find out the records in our base dataframe with potential missing content sizes.


The number seems to match the number of missing content size values in our processed dataframe. Let’s take a look at the top ten records of our data frame having missing content sizes.


It is quite evident that the bad raw data records correspond to error responses, where no content was sent back and the server emitted a “-" for the content_size field.

Since we don’t want to discard those rows from our analysis, let’s impute or fill them to 0.

Fix the rows with null content_size

The easiest solution is to replace the null values in logs_df with 0 like we discussed earlier. The Spark DataFrame API provides a set of functions and fields specifically designed for working with null values, among them:

  • fillna(), which fills null values with specified non-null values.
  • na, which returns a DataFrameNaFunctions object with many functions for operating on null columns.

There are several ways to invoke this function. The easiest is just to replace all null columns with known values. But, for safety, it’s better to pass a Python dictionary containing (column_name, value) mappings. That’s what we’ll do. A sample example from the documentation is depicted below

>>>{'age': 50, 'name': 'unknown'}).show()
|age|height| name|
| 10| 80| Alice|
| 5| null| Bob|
| 50| null| Tom|
| 50| null|unknown|

Now we use this function and fill all the missing values in the content_size field with 0!

Look at that, no missing values!

Handling Temporal Fields (Timestamp)

Now that we have a clean, parsed DataFrame, we have to parse the timestamp field into an actual timestamp. The Common Log Format time is somewhat non-standard. A User-Defined Function (UDF) is the most straightforward way to parse it.

Let’s now use this function to parse our time column in our dataframe.

Things seem to be looking good! Let’s verify this by checking the schema of our dataframe.

|-- host: string (nullable = true)
|-- method: string (nullable = true)
|-- endpoint: string (nullable = true)
|-- protocol: string (nullable = true)
|-- status: integer (nullable = true)
|-- content_size: integer (nullable = false)
|-- time: timestamp (nullable = true)

Let’s now cache logs_df since we will be using it extensively for our data analysis section in the next part!


Data Analysis on our Web Logs

Now that we have a DataFrame containing the parsed and cleaned log file as a data frame, we can perform some interesting exploratory data analysis (EDA) to try and get some interesting insights!

Content Size Statistics

Let’s compute some statistics about the sizes of content being returned by the web server. In particular, we’d like to know what are the average, minimum, and maximum content sizes.

We can compute the statistics by calling .describe() on the content_size column of logs_df. The .describe() function returns the count, mean, stddev, min, and max of a given column.

Alternatively, we can use SQL to directly calculate these statistics. You can explore many useful functions within the pyspark.sql.functions module in the documentation.

After we apply the .agg() function, we call toPandas() to extract and convert the result into a pandas dataframe which has better formatting on Jupyter notebooks.

We can validate the results and see they are the same as expected.

HTTP Status Code Analysis

Next, let’s look at the status code values that appear in the log. We want to know which status code values appear in the data and how many times. We again start with logs_df, then group by the status column, apply the .count() aggregation function, and sort by the status column.

Total distinct HTTP Status Codes: 8

Looks like we have a total of 8 distinct HTTP status codes. Let’s take a look at their occurrences in the form of a frequency table.

Looks like status code 200 OK is the most frequent code which is a good sign that things have been working normally most of the time. Let’s visualize this.

HTTP Status Code occurrences

Not too bad! But several status codes are almost not visible due to the huge skew in the data. Let’s take a log transform and see if things improve.

The results definitely look good and seem to have handled the skewness, let’s verify this by visualizing this data.

HTTP Status Code occurrences — Log Transformed

This definitely looks much better and less skewed!

Analyzing Frequent Hosts

Let’s look at hosts that have accessed the server frequently. We will try to get the count of total accesses by each host and then sort by the counts and display only the top ten most frequent hosts.

This looks good but let’s inspect the blank record in row number 9 more closely.

host_sum_pd_df = host_sum_df.toPandas()

Looks like we have some empty strings as one of the top host names! This teaches us a valuable lesson to not just check for nulls but also potentially empty strings when data wrangling.

Display the Top 20 Frequent EndPoints

Now, let’s visualize the number of hits to endpoints (URIs) in the log. To perform this task, we start with our logs_df and group by the endpointcolumn, aggregate by count, and sort in descending order like the previous question.

Not surprisingly GIFs, the home page and some CGI scripts seem to be the most accessed assets.

Top Ten Error Endpoints

What are the top ten endpoints requested which did not have return code 200 (HTTP Status OK)? We create a sorted list containing the endpoints and the number of times that they were accessed with a non-200 return code and show the top ten.

Looks like GIFs (animated\static images) are failing to load the most. Do you know why? Well given that these logs are from 1995 and given the internet speed we had back then, I’m not surprised!

Total number of Unique Hosts

What were the total number of unique hosts who visited the NASA website in these two months? We can find this out with a few transformations.


Number of Unique Daily Hosts

For an advanced example, let’s look at a way to determine the number of unique hosts in the entire log on a day-by-day basis. This computation will give us counts of the number of unique daily hosts.

We’d like a DataFrame sorted by increasing day of the month which includes the day of the month and the associated number of unique hosts for that day.

Think about the steps that you need to perform to count the number of different hosts that make requests each day. Since the log only covers a single month, you can ignore the month. You may want to use the dayofmonthfunction in the pyspark.sql.functions module (which we have already imported as F.

host_day_df : A DataFrame with two columns

There will be one row in this DataFrame for each row in logs_df. Essentially, we are just transforming each row of logs_df. For example, for this row in logs_df: - - [01/Aug/1995:00:35:41 -0400] "GET /shuttle/missions/sts-73/news HTTP/1.0" 302 -

your host_day_df should have: 1

host_day_distinct_df : This DataFrame has the same columns as host_day_df, but with duplicate (day, host) rows removed.

daily_unique_hosts_df : A DataFrame with two columns:

This gives us a nice dataframe showing the total number of unique hosts per day. Let’s visualize this!

Unique Hosts per Day

Average Number of Daily Requests per Host

In the previous example, we looked at a way to determine the number of unique hosts in the entire log on a day-by-day basis. Let’s now try and find the average number of requests being made per Host to the NASA website per day based on our logs. We’d like a DataFrame sorted by increasing day of the month which includes the day of the month and the associated number of average requests made for that day per Host.

We can now visualize the average daily requests per host.

Average Daily Requests per Host

Looks like Day 13 got the maximum number of requests per host.

Counting 404 Response Codes

Create a DataFrame containing only log records with a 404 status code (Not Found). We make sure to cache() the not_found_df dataframe as we will use it in the rest of the examples here. How many 404 records do you think are in the logs?

Total 404 responses: 20899

Listing the Top Twenty 404 Response Code Endpoints

Using the DataFrame containing only log records with a 404 response code that we cached earlier, we will now print out a list of the top twenty endpoints that generate the most 404 errors. Remember, top endpoints should be in sorted order.

Listing the Top Twenty 404 Response Code Hosts

Using the DataFrame containing only log records with a 404 response code that we cached earlier, we will now print out a list of the top twenty hosts that generate the most 404 errors. Remember, top hosts should be in sorted order.

Gives us a good idea which hosts end up generating the most 404 errors for the NASA webpage.

Visualizing 404 Errors per Day

Let’s explore our 404 records temporally (by time) now. Similar to the example showing the number of unique daily hosts, we will break down the 404 requests by day and get the daily counts sorted by day in errors_by_date_sorted_df.

Let’s visualize the total 404 errors per day now.

Total 404 Error per Day

Top Three Days for 404 Errors

Based on the earlier plot, what are the top three days of the month having the most 404 errors? We can leverage our previously created errors_by_date_sorted_df for this.

Visualizing Hourly 404 Errors

Using the DataFrame not_found_df we cached earlier, we will now group and sort by hour of the day in increasing order, to create a DataFrame containing the total number of 404 responses for HTTP requests for each hour of the day (midnight starts at 0). Then we will build a visualization from the DataFrame.

Total 404 Error per Hour

Looks like total 404 errors occur the most in the afternoon and the least in the early morning. We can now reset the maximum rows displayed by pandas to the default value since we had changed it earlier to display a limited number of rows.

pd.set_option('max_rows', def_mr)


We took a hands-on approach to data wrangling, parsing, analysis and visualization at scale on a very common yet essential case-study on Log Analytics. While the data we worked on here may not really be traditionally ‘Big Data’ from a size or volume perspective, the techniques and methodologies are generic enough to scale on larger volumes of data. I hope this case-study gives you a good idea about how open-source frameworks like Apache Spark can be easily leveraged to work with structured and semi-structured data at scale!

All the code and analyses accompanying this article are available in my GitHub Repository.

You can find a step-by-step approach in this Jupyter Notebook.

I solve real-world problems leveraging data science, artificial intelligence, machine learning and deep learning. I also do some consulting, research and mentoring in my spare time. If you need focused consulting, training sessions, want me to speak at events or if you want to publish an article on TDS, feel free to reach out to me on LinkedIn.