on container lifetimes

My project uses containerization extensively and during development I’ve formed some opinions. I’ve learned that my views on process management are different than what seems to be the accepted standard.

The stack I’m working with consists of Keycloak for authentication, MySQL for database storage, Apache for HTTP, and Django to run the backend, with React on the frontend.

I’ve recently been working on orchestration, the “infrastructure as code” part of making my project run with templatable YAML files. I have a reasonably fast computer that I do primary development on, but even with that speed it takes a while to build containers and do other tasks.

The direction of containerization seems to be towards a tear it down and build a new one philosophy instead of trying to reuse existing container instances.

My deployment/alpha stack is a pair of 10 year old enterprise servers, lots of cores/threads but at the expense of single threaded performance. The startup time of Keycloak is in the neighborhood of 30s. It takes a good 15s before MySQL shows up in Docker DNS and Keycloak can connect.

The direction of containerization seems to be towards a tear it down and build a new one philosophy instead of trying to reuse existing container instances. This is even more obvious by some of the choices the Keycloak team made surrounding SSL certificates. They only run the certificate-to-keystore conversion script at container creation, subsequent restarts bypass this step. That means they expect you to tear down the container instance and build it back up every time an SSL certificate gets renewed.

I think the concept of tearing down a relatively heavyweight process (Java, JBOSS, etc) for convenience of orchestration is wasteful. Let’s talk some numbers to help illustrate my point:

The aforementioned Keycloak container takes about 30s to start, if I have 100 endpoints, let’s figure out what the total time to rotate certificates is with container redeployment. Keycloak itself does not require any persistent storage, so we can destroy and recreate the container without concern for persistent volumes.

If you use K8s, you could use kubectl-rollout-restart to simply perform a rolling restart. If you use RollingUpdate, k8s adds and subtracts pods, making rolling updates of Keycloak more difficult. However, if you use Recreate, it will terminate an existing instance and create a new.

Keycloak itself is “stateless”, it requires a stateful data store that is persistent. In my case I use MySQL, so I have a stack defined with Keycloak being dependent on MySQL. Dealing with the design pattern of ephemeral, stateless pods, you run into challenges in orchestrating simple things like “sudo restart my keycloak service”.

Let’s say you use k8s and do a rolling restart and it takes 30s for the container to reach the healthy state. If you have 100 containers it will take almost 1 hour (3000 seconds) to do a full rolling restart and deploy a new certificate. That may seem like nothing, but what if someone forgets to renew your certificate and you suddenly start serving an expired certificate?

You might think my expired certificate scenario is a straw man, but I can assure you that far worse has happened in production environments! Letting the certificate expire results in a hard outage that requires a hard restart of the entire deployment. Did your organization ever test what happens when you do a hard restart? How does your infrastructure handle starting all instances simultaneously? Was that part of your DR plan?

Now let’s look at the alternative: Dynamically reload the SSL certificate at runtime.

Keycloak doesn’t come with the ability to reload certificates out of the box, you have to search high and low to find the magic incantation you must utter to JBOSS cli to make it reload the Java keystore. Then you have to deal with the fact that the keystore is only generated when the container is created.

I’ll detail my solution in another post, but the essence is that you need some glue to detect a certificate change, recreate the keystore, then execute some admin commands via the JBOSS cli to get Keycloak to reload the keystore. The end result is that it takes about 2 seconds to reload the keystore and there is zero downtime or interruption of services.

With that same 100 instances of Keycloak I can do a full SSL certificate rotation in less than 20 seconds across the entire deployment with zero interruption to services.

I believe in the principle of “aim small, miss small”. If you design your system to be frugal and support online updates, when you scale that system to very large deployments, you aren’t wasting time solving low-level problems. Worry about the small things as well as the big things or you might be refactoring your entire application in a different (language|runtime|platform|OS|stack).

By Phantom

Coder, sysadmin, maker, human

Leave a Reply