The world’s leading publication for data science, AI, and ML professionals.

Processing SCADA Alarm Data Offline with ELK

Open-Source Tools for Industrial Automation

Steag, Germany, via Wikimedia Commons
Steag, Germany, via Wikimedia Commons

This article kicks off a series where we will use open-source Data Science tools to analyse alarm and event logs produced by SCADA systems. But first, what are SCADA systems and why do their alarms need to be analysed?

Industrial control systems (ICS, a.k.a SCADA systems) are used to control manufacturing facilities such as power stations, water treatment plants, bakeries, breweries etc. An industrial automation system will consist of sensors, controllers and a HMI (Human Machine Interface).

SCADA systems generate large volumes of alarm and event data (the system analysed in preparing this article generated over 57 million alarms and events in 12 months). Data science tools are ideally suited to dealing with these large datasets and this article explores the use of the ELK stack for SCADA alarm data analysis.

What is SCADA?

SCADA is a generic term for a computer based industrial control system. A SCADA (Supervisory Control and Data Acquisition) system can be used to monitor and control industrial processes which can include water treatment, manufacturing, power stations, food and beverage etc. In fact, any industrial process with automated manufacturing equipment.

Typical System Architecture

The purpose of an industrial control or SCADA system is to monitor or control a physical process. The lowest layer of our system, the I/O (input output) layer comprises sensors and actuators.

Sensors measure physical parameters like temperature, pressure, level etc. while actuators are devices like motors, valves, or solenoids etc. that affect a physical movement.

Sensors and actuators communicate with process controllers or PLCs (Programable Logic Controllers). Process controllers are generally proprietary computer hardware used to run software to control process parameters.

SCADA servers communicate with the process controllers using industrial protocols and perform a number of functions which include,

  • Maintaining a database of sensor and actuator parameter values (typically referred to as point) values.
  • Storing historical values of process parameters (trends)
  • Creating alarms and events based on parameter values compared to thresholds.
  • Providing a HMI (Human Machine Interface) which are typically graphical process representations including dynamic sensor and actuator data. The HMI allows process operators to interact with the control system (starting and stopping equipment and processes, acknowledging alarms etc.).
Image by the Author using resources from Flaticon.com
Image by the Author using resources from Flaticon.com

HMI (Human Machine Interface)

The HMI of the SCADA systems allows operators to monitor the process and remotely operate equipment. An important part of the HMI is the Alarm Systems which alerts process operators to abnormal process conditions that require their intervention.

hhdgomez, via Wikimedia Commons
hhdgomez, via Wikimedia Commons

What are Alarms and Events?

Alarms

The purpose of alarms is to alert operators to unusual or dangerous process conditions that require intervention.

Each measured parameter in an ICS will typically have thresholds configured, and when the parameter crosses the threshold value, an event, known as an alarm, is generated.

The most common way to implement a process alarm is to compare a process variable x(t) to a constant high and/or low threshold value x ₜₚ [1],

Image by the Author
Image by the Author

Alarm systems are part of all modern SCADA systems. These systems provide features that allow for alarm event annunciation, acknowledgment, display, and filtering.

Events

Events are similar to alarms, the key distinction being that alarms are unexpected and might require corrective actions while events do not typically require any operator action and are captured to support audits, maintenance and incident investigations.

Some SCADA systems have separate alarm and event systems while others store events in the alarm handling system with a low priority.

Why is Alarm Management Important?

Alarm management and alarm floods¹ [2], in particular, have been a problem for several years [3]. Responses to the problem of alarm floods have led to the development of several national and international standards [4–6].

Poorly performing alarm systems have been linked to many significant process industry disasters; Esso Longford, Three Mile Island, Texaco Milford Haven, and Deepwater Horizon, to name but a few [7]. In addition to significant incidents, there is evidence to suggest that poor alarm system performance leads to significant on-going losses to facility owners. Honeywell process solutions have estimated improvement to alarm system performance at a typical 100,000-BPD refinery would yield annual savings of $5.89 million [8].

It has been suggested [9] that the primary cause of alarm floods is a process state change because alarm systems are typically designed for steady-state conditions. During a process state change, the first one or two alarms are critical and indicate the root cause; subsequent alarms are consequential and merely noise. This noise can be a significant distraction for the process operator to the extent that the root cause of the event cannot be determined.

Image by the Author using the following images; (Left) United States Department of Energy, Public domain, via [[Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Deepwater.Horizon.Response.jpg)](https://commons.wikimedia.org/wiki/File:Milford_Haven_oil_refinery_seen_from_PembrokeDock-geograph.org.uk-_845073.jpg) (Centre) Colin Bell, Milford Haven oil refinery seen from Pembroke Dock, via Wikimedia Commons (Right) United States Coast Guard, Public domain, via Wikimedia Commons
Image by the Author using the following images; (Left) United States Department of Energy, Public domain, via [[Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Deepwater.Horizon.Response.jpg)](https://commons.wikimedia.org/wiki/File:Milford_Haven_oil_refinery_seen_from_PembrokeDockgeograph.org.uk-_845073.jpg) (Centre) Colin Bell, Milford Haven oil refinery seen from Pembroke Dock, via Wikimedia Commons (Right) United States Coast Guard, Public domain, via Wikimedia Commons

¹ Defined in multiple standards to be a 10-minute period in which more than 10 new alarms are triggered (per console operator).

What is ELK?

The ELK (Elasticsearch, Logstash and Kibana) stack is an open-source data collection and visualisation stack that is primarily used to collect and monitor server and application logs although many other use cases can be accommodated such as business intelligence and web analytics. The principal components of the stack are,

Elasticsearch

An open source, full-text search, and analysis engine, based on the Apache Lucene search engine.

Elasticsearch supports very large databases through sharding and clustering.

Logstash

A log collection, transformation, and aggregation tool.

Kibana

Web based analysis and visualisation tool for Elasticsearch.

Beats

Lightweight data forwarding agents or log shippers.

Installing ELK

Windows

  1. Download Elasticsearch and Kibana (we will not be needing Logstash for the analysis we are preforming here) from the downloads page
  2. Unzip the files in a folder of your choice (the files are quite large, particularly Kibana so this could take some time)
  3. Ensure that you have Java installed and the ES_JAVA_HOME environment variable is set (note: Elasticsearch will use the JAVA_HOME variable if it is set on your system but it is deprecated so at some time in the future this option will not work)
  4. Start Elasticsearch by opening a command prompt, navigating to the folder where you unzipped the installation files, change directory to bin and run elasticsearch.bat
  5. Similarly, start Kibana by opening a command prompt, navigating to the folder where you unzipped the installation files, change directory to bin and run kibana.bat
  6. Open a web browser at the following address, http://localhost:5601/ and you should see the Kibana home page,
Image by the Author
Image by the Author

Other Platforms

For installation on other platforms, see.

Preparing and Loading the Alarm Data

Data Format

The alarm log files that I have available have been generated by the Schneider Electric CitectSCADA product.

The files are named, AlarmLog.NNN where NNNis a zero padded integer from 001 to 365 in our case with one file per day.

The files are space delimited text files, below is an example of the first five records of a typical file,

Image by the Author
Image by the Author

Data Preparation

Data preparation is straight forward involving transforming the space delimited file to CSV and cleaning some invalid records.

A Python script is used for the initial transformation and cleaning.

The first step is to import the required libraries,

  • pandas – a data analysis library which we will use to read and write data files and use dataframes to manipulate data
  • glob – to retrieve filenames of input data files.
  • datetime – for date and time value manipulation
  • tqdm – a progress bar utility to display progress of long running jobs (like processing multiple GB of data files)
import pandas as pd
import glob
from datetime import datetime
from tqdm import tqdm

The widths array is defined and contains the character width of each of the fixed width fields in the input file. The names array contains the field names which will be included in the CSV file.

widths = [
    12, #date
    12, #time
    4,  #pirority
    20, #status
    12, #state
    32, #tag
    64, #description
]
names = [
    'date',
    'time',
    'priority',
    'status',
    'state',
    'tag',
    'description'
]

In the code below we process all of the raw log files in the folder ./1-raw/, create CSV files and write them to ./2-csv/.

A collection of the raw input file paths is obtained with the glob function.

We use a for loop to iterate over each file. Since this is a long running process, the tqdm function is used in the for loop to display a progress bar.

Image by the Author
Image by the Author

For each file,

  • Read the text file using the pandas read_fwf function. This function reads in a space delimited (fixed width) file and returns a dataframe
  • The log files contained some invalid records with a "0" values from the date and/or time fields. Records with zero values for date or time are removed.
  • An ISO format timestamp field is created from the separate data and time fields and added to the dataframe
  • The CSV file is written to disk
files = glob.glob("./1-raw/AlarmLog.*")
for f in tqdm(files):
    d = "./2-csv/" + f.split("")[1] + ".csv"
    log = pd.read_fwf(f, widths=widths, names=names)
# Some errors present in the log files where 0 values present for date 
    # and time columns
    log = log.drop(log[log.date=="0"].index)
    log = log.drop(log[log.time=="0"].index)
log['@timestamp'] = log.apply(
        lambda row: datetime.strptime(
            row.date + " " + row.time, '%d/%m/%Y %I:%M:%S %p').isoformat(),    axis=1)
    log.drop(['date','time'], axis=1).to_csv(d,index=False)

Load into Kibana

Data can be loaded directly into Kibana from the user interface (http://localhost:5601). This was used to test the pre-processing scripts before processing a full 12 months of data (which was over 8GB and requires another method to upload the data which will be described later).

To manually upload data through Kibana, first open the left side menu by clicking on the pancake icon. From the side menu select Analytics > Machine Learning.

Image by the Author
Image by the Author

Click on the Select file button,

Image by the Author
Image by the Author

Select or drag one of the generated CSV files to the target on the screen below,

Image by the Author
Image by the Author

Press Import to select the default options,

Image by the Author
Image by the Author

The next screen requires that we enter an index name into which the data will be loaded. Enter an index name and press Import

Image by the Author
Image by the Author

After a few seconds the data will be loaded, press the View index in Discover tile,

Image by the Author
Image by the Author

We have just uploaded alarm events for a single day. In the discover app we see a summary of the data we have uploaded which covers the time range from June 12 00:00:01 to June 13 00:00:02. The summary chart has sliced the data into 30-minute intervals.

Image by the Author
Image by the Author

If we hover our mouse over any of the bars, we see can see the count of log entries in that bucket. Below we see that from 11:30 to 12:00 there were 2,765 alarm events.

Image by the Author
Image by the Author

Kibana has a query language (KQL, Kibana Query Language) that allows us to apply complex filters to the data in our ElasticSearch database. Below we have applied a filter using KQL that limits our records to only priority 1 alarms for "Media Filter 5".

Image by the Author
Image by the Author

Delete the Data

We have proven that out CSV data uploaded correctly. As we have 365 files we do not wish to upload them into the database manually. We will remove the manually added data now and create a Python script to programmatically load the entire directory of data.

In the search bar enter ‘index’ and click on ‘Data / Index Management’ from the drop-down list.

Image by the Author
Image by the Author

Select the checkbox next to the index that we need to delete.

Image by the Author
Image by the Author

Open the drop-down list from the Manage index button and select Delete index.

Image by the Author
Image by the Author

Load Data to Elastic Search using Python

Previously we processed the 365 Citect AlarmLog files into CSV. These files are quite large (365 files totalling 6.15 GB). In the previous section we loaded a single day of data into ElasticSearch to verify the file format. We will now create a Python script to load the entire data set.

Loading data directly to ElasticSearch using Python is fairly straightforward as ElasticSearch has provided an official client which can be installed using pip,

python -m pip install elasticsearch

The code is repeated below and consists of,

  1. Creating an ‘alarm-log’ index in ElasticSearch
  2. Opening each CSV file in the data folder and uploading it to ElasticSearch using the bulk upload function. The body parameter defines the mapping of the CVS column names to ElasticSearch data types.
import pandas as pd
import glob
from tqdm import tqdm
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
indexName = "alarm-log"
body = {
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "properties": {
        '@timestamp': {'type': 'date'},
        'description': {'type': 'text'},
        'priority': {'type': 'long'},
        'state': {'type': 'keyword'},
        'status': {'type': 'keyword'},
        'tag': {'type': 'keyword'}
    }
  }
}
client = Elasticsearch()
client.indices.delete(index=indexName)
client.indices.create(index=indexName,body=body)
files = glob.glob("./2-csv/AlarmLog.*")
for f in tqdm(files):
    df = pd.read_csv(f)
    documents = df.to_dict(orient='records')
    bulk(client, documents, index=indexName, raise_on_error=True)

First Analysis

Structure of Log Entries

We will use the Kibana, Discover tool to examine the structure of the index that we have created using the previous Python script to bulk load our alarm entries.

Open Kibana (http://localhost:5601), open the left menu and select Discover from the Analytics group.

Image by the Author
Image by the Author

Select alarm-log as the index pattern. Once we have done this all of the attributes of the alarm-log index are displayed in the panel below the index selector.

Image by the Author
Image by the Author

Adhoc Analysis

The data that we have collected from the SCADA system contains both alarms and events. Alarms have a priority of 1–3 while events are priority 4 and 5. Process alarms have a lifecycle that results in multiple events being generated and logged due to state change and operator acknowledgement (this will be explained in more detail in a future post where we will combine multiple alarm lifecycle events into a single alarm record which will include information of the entire alarm lifecycle and allow for more detailed analysis.

The KQL query below, filters the alarms events to only display alarm initiation,

Image by the Author
Image by the Author
Image by the Author
Image by the Author

Dashboard

In additional to allowing you to query and analyse your data interactively, Discover allows for queries to be saved and displayed in Dashboards using various visualisations. In Kibana terminology the saved visualisations are known and a Lens.

The dashboard below contains four Lens components (described in clockwise order starting at the top left),

  • Metric – displays a single number, in this case the count of alarms in the selected period
  • Vertical Bar Chart – in this case displaying the daily alarm count
  • Treemap – in this case displaying the top ten alarm count by tag and the alarm priority (1–3)
  • Horizontal Bar Chart – in this case displaying the top ten alarm counts per tag (an alarm tag is the unique id which is used to identify the device which generates the alarm)
Image by the Author
Image by the Author

Insights

Now that we have created our dashboard, what does it tell us?

Daily Alarm Count

From the daily alarm count chart, we can see that there are between 500 and 3,000 alarms per day. As the SCADA system that generated these alarm logs is monitored by between 1 and 2 operators, this represents an unrealistic workload for the operator(s) to manage.

Image by the Author
Image by the Author

Top 10

The Top 10 chart is used to identify alarms that are triggering frequently. We can see that the AIT0411203_aH alarm was activated approximately 18,000 times in 90 days (approx. 200 times per day). This alarm is clearly misconfigured and requires investigation and remediation.

Image by the Author
Image by the Author

Priority Distribution

From the Treemap we can see that the ‘Other’ alarms are distributed 22% Priority 1, 61% Priority 2 and 16% Priority 3. This indicates that the system most likely has too many Priority 1 and 2 alarms configured (this will be explained more in a future post on alarm system management and performance metrics).

Image by the Author
Image by the Author

Conclusion

Elk has proven to be a useful tool to analyse the alarm event log data available from SCADA systems. While the insights gained in this initial analysis have been useful, it has been limited by the fact that we are using the raw alarm records.

More advanced analysis will be possible when the raw alarm events are converted into alarm records covering the alarm lifecycle prior to being loaded into ELK.

In future articles we will explore,

References

  1. Jiandong, W., et al., An Overview of Industrial Alarm Systems: Main Causes for Alarm Overloading, Research Status, and Open Problems. IEEE Transactions on Automation Science and Engineering, 2016. 13(2): p. 1045–1061.
  2. Vancamp, K., Alarm Management By the Numbers. Chemical Engineering, 2016. 123(3): p. 50–55.
  3. ASM, ASM Consortium.
  4. ISA, ANSI/ISA‐18.2‐2009: Management of Alarm Systems for the Process Industries. 2009: International Society of Automation.
  5. EEMUA, EEMUA-191: Alarm systems – a guide to design, management and procurement. 2013: Engineering Equipment and Materials Users Association London, UK.
  6. IEC, IEC 62682 Management of Alarm Systems for the Process Industries. IEC, Geneva, Switzerland. 2014: International Electrotechnical Commission. 24–26.
  7. Goel, P., A. Datta, and M.S. Mannan, Industrial alarm systems: Challenges and opportunities. Journal of Loss Prevention in the Process Industries, 2017. 50(PA): p. 23–36.
  8. Ayral, T., J. Bilyk, and K. Brown, Case history: Quantifying the benefits of alarm management. Hydrocarbon Processing, 2013. 92(2013).
  9. Beebe, D., S. Ferrer, and D. Logerot, The Connection of Peak Alarm Rates to Plant Incidents and What You Can Do to Minimize. Process Safety Progress, 2013. 32(1): p. 72–77.
  10. Lucke, M., et al., Advances in alarm data analysis with a practical application to online alarm flood classification. Journal of Process Control, 2019. 79: p. 56–71.

Read Further

Thanks for reading, hope you enjoyed this article.

To explore further,

To support medium authors, consider a subscription.

Remember to click on the subscribe and follow button,

Image by the Author
Image by the Author

Related Articles