The world’s leading publication for data science, AI, and ML professionals.

Troubleshooting OpenShift Clusters and Workloads

Collection of commands that every OpenShift user should know

If you are cluster admin, cluster operator or the only developer in the team that actually knows what’s going on in your OpenShift cluster, then you will know, that some terrible things happen from time to time. It’s inevitable and it’s better to be prepared for the moment when shit hits the fan. So, this is the collection of commands that should be part of your arsenal, when it comes to debugging broken deployments, resource consumption, missing privileges, unreachable workloads and more…

Original Photo by Nathan Anderson on Unsplash
Original Photo by Nathan Anderson on Unsplash

Don’t Use The Web Console

Yea, it looks nice, it’s easy to navigate, it makes it easy to perform some tasks… and it will also become unreachable if there’s problem with router, or deployment loses its Endpoint or if operator has some issues…

If you get little too used to doing everything using web console, you might end up not being able to solve problems with CLI at the moment when every minute matters. Therefore my recommendation is to get comfortable with the oc tool and at least for the duration of this article forget that the web console even exists.

Monitoring Node Resources

It’s good to check memory and CPU available on your worker nodes from time to time and especially if your pods are stuck in Pending state or are being OOMKilled. You can do that with:

Which displays CPU and Memory stats of nodes. In case you want to filter out master nodes and see only workers, then you can use -l node-role.kubenetes.io/worker, like so:

Troubleshooting Node Units

If you are running bare metal cluster, then there is a good chance, that you will eventually run into problems related to things running on nodes themselves. To see what’s happening with specific systemd units (e.g. crio or kubelet) running on worker nodes, you can use:

This command retrieves logs from specific unit. So, running with -u crio gives us the following:

When logs are not good enough and you need to actually poke around inside the worker node, you can run oc debug nodes/node-name. This would give you shell inside this specific node, by creating privileged pod on the node. Example of this kind of a interactive session:

In the session above we use crictl to inspect containers running directly on the worker node. This is the place where we could start/restart/delete some containers or system services if needed. Needless to say, be very careful when touching things running on the nodes themselves.

As a side note, if you are running your cluster on some managed public cloud, then most likely you will not have permission for direct access to nodes, as that would be security issues, so the last command might fail on you.

Monitoring Cluster Updates

When you decide, that it’s time to update you cluster to the newer version you will probably want to monitor the progress. Alternatively, if some operators are breaking without any clear reason, you might also want to check up on your clusterversion operator to see whether it’s progressing towards newer version, which might be reason for temporary service degradation:

Both commands above retrieve version and information on whether the cluster is currently upgrading or what’s the state of cluster operators in general.

All The Ways to Debug Pods

The thing that’s going to break most often is of course pod and/or deployment ( DeploymentConfig). There’s quite a few commands that can give you insight into what is wrong with you application, so starting from the most high-level ones:

oc status is the easiest way to get overview of resources deployed in project including their relationships and state as shown above.

Next one, you already know – oc describe. The reason I mention it is Events: section at the bottom, which shown only events related to this specific resource, which is much nicer then trying to find anything useful in output of oc events.

Another common command is oc logs. One thing you might not know about it though, is that it can target specific container of the pod, using -c argument. For example:

This might come in handy if you are debugging single container in multi-container pod and want filter only the relevant logs.

Now, for the little less known command(s). oc debug was already shown in the section about debugging nodes, but it can be used to debug deployments or pods too:

Unlike the example with nodes, this won’t give you shell to the running pod, but rather create exact replica of the existing pod in debug mode. Meaning, that labels will be stripped and the command changed to /bin/sh.

One reason you might need to debug pod in OpenShift is issue with security policies. In that case you can add --as-root to the command, to stop it from crashing during startup.

Nice thing about this command is that it can be used with any OpenShift resource that creates pod, for example Deployment, Job, ImageStreamTag, etc.

Running Ad-hoc Commands Inside Pods and Containers

Even though creating debugging pods can be very convenient, sometimes you just need to poke around in the actual pods. You can use oc exec for that. These are the variants you could take advantage of:

First of the commands above runs one-off command inside podname with extra options if necessary. The second one will get you shell into specific container in the pod, though you should probably use shorthand for that – oc rsh.

One command that you use could use for troubleshooting, but should never use in production environments is oc cp, which copies files to or from pod. This command can be useful if you need to get some file out of container so you can analyze it further. Another use case would be to copy files into the pod (container) to quickly fix some issue during testing, before fixing it properly in Docker image ( Dockerfile) or source code.

Inspect Broken Images

I think that was enough for debugging pods and containers, but what about debugging application images? For that you should turn to skopeo:

First command can inspect image repository, which can be useful for example in case image can’t be pulled, which might happen if the tag doesn’t exist or the image name got misspelled. Second one just gives you list of available tags without you needing to open the registry website, which is pretty convenient.

Gathering All Information Available

When everything else fails, you might try running oc adm must-gather to get all the information available from the cluster, that could be useful for debugging. The file produced by this command can be used for your own debugging or can be sent to Red Hat support in case you need assistance.

Debugging Unreachable Applications

It’s not uncommon (at least for me), that applications/deployments seem to work fine but they cannot reach each other. There are couple of reasons why that might be. Let’s take following scenario – you have an application and database. Both are running just fine, but for some reason your application can’t communicate with database. Here’s one way you could go about troubleshooting this:

Snippet above assumes that we already have the application and database running, as well as their respective Services. We can start debugging by trying to access the database from the application. First thing we need for that though, is IP of database, which we lok up using the first command. Next, we create carbon copy of the application using oc debug and try reaching the database pod with curl, which is successful.

After that we repeat the test other way around and we can see curl times out, meaning that database cannot reach application IP. We then check previously created Services, nothing weird there. Finally, we check Endpoints and we can see that application pod doesn’t have one. This is most likely caused by misconfiguration of respective Service, as shown by last command, where we clearly have wrong selector. After fixing this mistake (with oc edit svc/...), Endpoint gets automatically created and application is reachable.

Fix Missing Security Context Constraints

If your pod is failing for any kind of issue related to copying/accessing files, running binaries, modifying resources on node, etc, then it’s most likely problem with Security Context Constraints. Based on the specific error you are getting, you should be able to determine the right SCC for your pod. If it’s not clear though, then there are a few pointers that might help you decide:

If your pod can’t run because of UID/GID it’s using, then you can check UID and GID range for each SCC:

If these fields are set to <none> though, you should go look at project annotations:

These annotations tell you that effective UID of your pod will be in range 1001490000 +/- 10000. if that doesn’t satisfy your needs, you would have to set spec.securityContext.runAsUser: SOME_UID to force specific UID. If your pod fails after these changes, then you got to switch SCC or modify it to have different UID range.

One neat trick to determine which SCC is needed by Service Account to be able to run a pod, is to use oc adm policy scc-subject-review command:

What this command does, is check whether user or Service Account can create pod passed in using YAML representation. When output of this commands shows <none>, then it means that resource is not allowed. If any name of SCC is displayed instead, like for example anyuid above, then it means that this resource can be created thanks to this SCC.

To use this command with some Service Account instead of user, add -z parameter, e.g – oc adm policy scc-subject-review -z builder.

When output of this commands shows anything but <none>, then you know that you are good to go.

Conclusion

Biggest takeaway from this article should be for you, that if something doesn’t work in your OpenShift cluster, then it’s probably RBAC, if not, then it’s SCC. If that’s also not the case, then it’s networking (DNS). In all seriousness, I hope at least some of these command will save you some time next time you need to troubleshoot something in OpenShift. Also it’s good to know the more common ones by heart, because you never know when you’re really gonna need it. 😉


_This article was originally posted at martinheinz.dev_

Deploy Any Python Project to Kubernetes

Save Your Linux Machine From Certain Death

Analyzing Docker Image Security


Related Articles