The basics of deploying Logstash pipelines to Kubernetes

Published in

Towards Data Science

11 min readJan 20, 2019

Do you have a long list of the things you want to learn on a list in your brain?

I did, until I wrote that list down on a piece of paper and decided to do something about it. Towards the end of 2018 I started to wrap up things I’d been learning and decided to put some structure into my learning for 2019.

2018 had been an interesting year, I’d moved jobs 3 times and felt like my learning was all over the place. One day I was learning Scala and the next I was learning Hadoop. Looking back, I felt like I didn't gain much ground.

I decided to do something about it going into 2019. I wrote the 12 months of the year down on one side of a piece of paper and what I wanted to learn that month on the other side, with the idea that my learning for that one month would be focused on a particular product/stack/language/thing.

The Elastic stack, previously referred to as ELK, was at the top of this list for a few reasons.

I’ve worked with the Elastic stack before, more specifically ElasticSearch and Kibana but I felt like there was so much more I could learn from these two products. I also wanted to gain an understanding of Logstash and see what problems it could help me solve.

When I start learning something new I set a bunch of small, achievable objectives. One of the objectives I’d written was to have a fully functional, operating Logstash pipeline running in Kubernetes, ingesting data from somewhere, perform some action on it and then send it to ElasticSearch.

I’m a huge fan of Kubernetes. There has been a huge shift in the past few years to containerize applications and I fully embrace this shift. I had no interest in running this pipeline I was building locally, its was Kubernetes or bust!

I had a bit of a Google and struggled to find anything concrete regarding deployment, best practices etc thus led me to writing a basic article about how to get a basic Filebeat, Logstash and ElasticSearch pipeline running in Kubernetes.

So what’s Filebeat? It’s a shipper that runs as an agent and forwards log data onto the likes of ElasticSearch, Logstash etc.

Say you are running Tomcat, Filebeat would run on that same server and read the logs generated by Tomcat and send them onto a destination, more cases than not that destination is ElasticSearch or Logstash. FileBeat can also run in a DaemonSet on Kubernetes to ship Node logs into ElasticSearch which I think is really cool. Fluentd also does this but that’s for another day.

For more information about Filebeat check it out here.

Right, so in our scenario we have Filebeat reading a log of some sort and its sending it to Logstash, but was is Logstash?

Logstash is a server side application that allows us to build config-driven pipelines that ingest data from a multitude of sources simultaneously, transform it and then send it to your favorite destination.

We can write a configuration file that contains instructions on where to get the data from, what operations we need to perform on it such as filtering, grok, formatting and where the data needs to be sent to. We use this configuration in combination with the Logstash application and we have a fully functioning pipeline. The beautiful thing about Logstash is that it can consume from a wide range of sources including RabbitMQ, Redis and various Databases among others using special plugins. We can then stash that data in S3, HFDS and many more! And this is all driven by a single config file and a bunch of plugins…amazing isn't it!

For more information about Logstash check it out here.

Lets get on to some code and exciting stuff!

I was following the Logstash tutorial on the Elastic site and had come across the perfect candidate for my pipeline…with some small modifications.

I’d followed the tutorial step by step, FileBeat was running, it was reading the log file mentioned in the tutorial, it was all good. I had to config the Filebeat configuration file filebeat.yml to point at the Kubernetes NodePort I’d exposed, that’s covered a littler later and I also moved the FileBeat log provided into the FileBeat application folder.

filebeat.inputs:- type: logenabled: truepaths:- logstash-tutorial.logoutput.logstash:hosts: ["localhost:30102"]

Just Logstash and Kubernetes to configure now. Lets have a look at the pipeline configuration.

Every configuration file is split into 3 sections, input, filter and output. They’re the 3 stages of most if not all ETL processes.

We specify where our data is coming from firstly, in our case we are using the Beats plugin and specify the port to receive beats on.

So what’s this Beats plugin? It enables Logstash to receive events from applications in the Elastic Beats framework. As we are running FileBeat, which is in that framework, the log lines which FileBeats reads can be received and read by our Logstash pipeline.

Next we specify filters. The filter sections is optional, you don't have to apply any filter plugins if you don't want to. If that’s the case, data will be sent to Logstash and then sent on to the destination with no formatting, filtering etc. In our case we are using the Grok plugin. The Grok plugin is one of the more cooler plugins. It enables you to parse unstructured log data into something structured and queryable.

Grok is looking for patterns in the data it’s receiving, so we have to configure it to identify the patterns that interest us. Grok comes with some built in patterns. The pattern we are using in this case is %{COMBINEDAPACHELOG}which can be used when Logstash is receiving log data from Apache HTTP.

Lastly, we specify our outputs. This is where our data will end up once it has been filtered. You can specify more than one output if your data needs to go to multiple places. In the example the data is being output to ElasticSearch but also printed to the console, just to be on the safe side. Inside the ElasticSearch block we specify the ElasticSearch cluster URL and the Index name which is a String made up of a pattern made up of metadata.

Now that we’ve walked through the config of our pipeline we can move onto Kubernetes.

What we have to do first of all is create a ConfigMap. A ConfigMap allows us to store key-value pairs of configuration data that is accessible by our Pods. So we could have a ConfigMap that would store a directory full of configuration files of it could store a single configuration file.

First we create a ConfigMap. We are naming it apache-log-pipeline and referencing the pipeline configuration file from earlier.

> kubectl create configmap apache-log-pipeline --from-file apache-log-es.conf

We can examine the ConfigMap we’ve created by running kubectl with describe.

> kubectl describe cm/apache-log-pipeline

If the command has been run correctly, you should see the key of apache-log-pipeline and the value of the configuration file from earlier. If that’s the case you’re doing great!

Name:         apache-log-pipeline
Namespace:    default
Labels:       <none>
Annotations:  <none>Data
====
apache-log-es.conf:
----
input {
    beats {
        port => "5044"
    }
}
filter {
    grok {
        match => { "message" => "%{COMBINEDAPACHELOG}"}
    }
}
output {
    elasticsearch {
        hosts => ["http://elasticsearch:9200"]
        index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
    }
    stdout {
        codec => rubydebug
      }
}Events:  <none>

So now we have our ConfigMap, we need to put together a Deployment for our Logstash service and reference the apache-log-pipeline as a mounted volume.

Lets walk through some parts of the Deployment. A part from the usual stuff we specify 2 ports for the container, 5044 and 9600. Port number 5044 is used to receive beats from the Elastic Beats framework, in our case FileBeat and port number 9600 allows us retrieve runtime metrics about Logstash. More information about that here.

We also specify that we want to the mount the config volume and which path we want to mount it to, /usr/share/logstash/pipeline. We mount the volume into this particular directory because this is the directory that Logstash reads configurations from by default. This allows us to to just run logstash as the command, as opposed to specifying a flag of where the configuration file is.

We then have our volume which is called apache-log-pipeline-config and it’s a type of configMap. For more information on Volumes have a look here. We specify the ConfigMap we wish to use, apache-log-pipeline is the one we created earlier. As our ConfigMap is made up of key-value pairs, we add the key which contains our pipeline configuration, apache-log-es.conf.

We then go on to the Service. The only real talking point here is that we are using a NodePort. This is for two reasons, FileBeat needs to speak to Logstash which is running in Kubernetes so we need a port for this to be done on, I’ve specified this to be 30102 as the filebeat.yml needs configuring with this port number into order to send beats to Logstash. Second reason, I wanted to check out the Logstash Monitoring API which uses port 9600 as mentioned earlier.

So with our deployment and service ready to rock and roll we can deploy it.

> kubectl create -f apache-log-pipeline.yaml

If the Pod has been created correctly, you should be able to get Pods and see it running.

> kubectl get pods================================================NAME                                   READY   STATUS    RESTARTS   AGE
apache-log-pipeline-5cbbc5b879-kbkmb   1/1     Running   0          56s
================================================

The Pod has been created correctly, but is it actually up and running? The quickest way to tell is by tailing the logs of the Pod.

> k logs -f pod/apache-log-pipeline-5cbbc5b879-kbkmb

If the pipeline is running correctly the last log line you should see says that the Logstash API has been created successfully.

[2019-01-20T11:12:03,409][INFO ][logstash.agent] Successfully started Logstash API endpoint {:port=>9600}

When I initially built this pipeline I came across two errors. The first was when the configuration file wasn’t in a readable format due to the formatting, the error was printed and visible when tailing the Pod logs. The second was when ConfigMap wasn’t mounted correctly and the pipeline would run, stop then restart, again this was printed and visible by tailing the logs.

So, we have a fully functioning Logstash pipeline running in Kubernetes.

But its not actually doing anything. What we need to do now is run Filebeat. To make sure this worked correctly I had two Terminal windows open, one tailing the logs of the Pod and the other for the FileBeat command I was about to run.

sudo ./filebeat -e -c filebeat.yml -d "publish" -strict.perms=false

When this command is run, Filebeat will come to life and read the log file specified in in the filebeat.yml configuration file. The other flags are talked about in the tutorial mentioned at the beginning at the article.

In the pipeline configuration file we included the stdout plugin so messages received are printed to the console. With this in mind, we should see messages printed out in the Terminal window tailing the Pods logs and you should see something similar in the window running the Filebeat command.

{
     "@timestamp" => 2019-01-20T11:35:36.042Z,
        "request" => "/style2.css",
     "prospector" => {
        "type" => "log"
    },
       "response" => "200",
    "httpversion" => "1.1",
         "offset" => 18005,
          "bytes" => "4877",
           "tags" => [
        [0] "beats_input_codec_plain_applied"
    ],
      "timestamp" => "04/Jan/2015:05:24:57 +0000",
       "clientip" => "81.220.24.207",
       "referrer" => "\"http://www.semicomplete.com/blog/geekery/ssl-latency.html\"",
        "message" => "81.220.24.207 - - [04/Jan/2015:05:24:57 +0000] \"GET /style2.css HTTP/1.1\" 200 4877 \"http://www.semicomplete.com/blog/geekery/ssl-latency.html\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.73.11 (KHTML, like Gecko) Version/7.0.1 Safari/537.73.11\"",
          "ident" => "-",
          "agent" => "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.73.11 (KHTML, like Gecko) Version/7.0.1 Safari/537.73.11\"",
           "beat" => {
         "version" => "6.5.4",
            "name" => "local",
        "hostname" => "local"
    },
           "auth" => "-",
       "@version" => "1",
           "host" => {
        "name" => "local"
    },
           "verb" => "GET",
          "input" => {
        "type" => "log"
    },
         "source" => "/filebeat-6.5.4-darwin-x86_64/logstash-tutorial.log"
}

If we’ve seen the messages printed in the console we can almost guarantee that the message have been delivered into ElasticSearch. There are two ways to check this, call the ElasticSearch API with some parameters or use Kibana.

I’m a big fan of Kibana, so that’s the route we’re heading down.

Fire up Kibana and head to the Discover section.

In our pipeline configuration, more specifically the ElasticSearch output, we specify the Index that’s to be created to be a pattern made up of metadata which includes the Filebeat version and the date. We use this index pattern to retrieve the data from ElasticSearch.

In this example the Index that I defined was called filebeat-6.5.4–2019.01.20 as this was the Index that was created by Logstash.

Next, we configure the Time Filter field. This field is used when we want to filter our data by time. Some logs will have multiple time fields so that’s why we have to specify it.

Once these steps have been carried out we should be able to view the logs. If we go back into the Discover section once we have defined the Index the logs should be visible.

If the logs are visible, give yourself a pat on the back! Well done and good effort! Have a celebratory dab if you want :)

So a couple of cool Kibana related things before I wrap up.

We can monitor our Logstash pipelines from the Monitoring section of Kibana. We can get insights into event rates such as emitted and received. We also get Node information such as CPU utilization and JVM metrics.

Last but not least is the Logs section of Kibana. I think this is my favourite section of Kibana at the moment. It allows you to view streaming logs in near-real time and look back at historical logs.

To see the Logs section in action, head into the Filebeat directory and run sudo rm data/registry, this will reset the registry for our logs. Once this has been done we can start Filebeat up again.

sudo ./filebeat -e -c filebeat.yml -d "publish" -strict.perms=false

If you place the Terminal you’re running Filebeat in next to the browser you have Kibana in you’ll see the logs streaming in near-real time, cool eh?

As the title says, this is a basic article. It guides you on how to get something up and working quickly so there is bound to be improvements/changes that could be made to make this better on all fronts.

Many thanks as always for reading my articles, it’s really appreciated. Any thoughts, comments or questions drop me a tweet.

Cheers 👍🏻

Danny

https://twitter.com/danieljameskay

The basics of deploying Logstash pipelines to Kubernetes

Written by Danny Kay