Leader Election for Distributed Workloads in Kubernetes13 min read

Introduction

The Challenge

Using Kubernetes, running application workloads resiliently and reliably is easy, if the workload is stateless. Let’s think of a file format converter micro service or a simple web app serving static content to users. It becomes more complicated if the individual instances of your distributed application need to share a common state.

Such a common state could be for example an agreement between all application instances about whom of the instances is acting as the leading instance of the application cluster. Some example scenarios in which distributed apps need to agree on a single leader are:

  • Task Allocation: All members of the app cluster work on different pieces of a bigger task in order to parallelize work. This could be an analytics application generating reports or a use case in which a large amounts of files needs to be scanned for security vulnerabilities.
  • Exclusive Processing: There are actions which could cause inconsistencies or inefficiencies if they are executed by all of the cluster members instead of only one. Examples would be sending a notifcation to end users or triggering a costly API call.
  • Data Replication: A single instance needs to act as a single-point-of-truth to keep track of data being synchronized between multiple data stores. A typical example would be replication of data in a database cluster.

Typically people solve these issues with one of the following concepts:

  • Distributed Databases: Works, but you will either have to subscribe a reliable service from a cloud vendor or operate by yourself (naaaah …).
  • Shared File Systems: A network file system can be used by multiple clients to share state for example in text files. But: You need to implement a file-based locking mechanism all by yourself to avoid chaos. Don’t do this. I’ve rarely seen that working reliably.
  • Concensus Algorithms: People implement a battle-proven concenus algorithm like Paxos or Raft. If you ask me: These algorithms are very powerful, because you can achieve consistent leader election without an external data store, but in most cases this would be shooting a sparrow with cannons.

How Does Kubernetes Come Into Play?

For operating of applications on Kubernetes clusters, the individual instances of your app will be so-called Pods. These Pods can commuinicate with each other through a Services, but they can also access the Kubernetes cluster’s API, if they have the required permissions assigned.

Why would the Pods want to talk to the Kubernetes API? The reason is the key-value store backing Kubernetes: A core component of every Kubernetes cluster out there is the kube-apiserver, which uses etcd as a consistent and highly-available key value store. It is basically the brain of the cluster storing all configuration and state of the Pods running the cluster.

How does this help us with our need to implement a distributed system? We can use is it as a state store to manage leader election: The Kubernetes API guarantees sequential consistency. If multiple instances (Pods) try to compete for a resource (in our case “registering for leadership”), we will never face a situation, in which an instance overwrites a registration of another instance without their knowledge. In computer science terms: We can avoid split-brain situations.

A Proof-of-Concept Using Kubectl

Before we will start to create actual code and run Pods, let’s illustrate this concept using some simple kubectl commands. What is going to happen?

  • A ConfigMap will be used to store the fact about who is the leader. Think of it as a blackboard into which someone enters their name to indicate that its their turn to take out the trash or buy candy for the team.
  • We will pretend to have two independently acting Pods (john and jane) trying to write their name into the ConfigMap in order to state their leadership.
  • We want to achieve that John and Jane autonomously attempt to register themselves as leaders. At the same time we want to ensure, that they do not overwrite each other’s registration. The latter could i.e. happen, if John and Jane see an empty ConfigMap at nearly the same point in time, but Jane is quicker and enters her name first, which would then be overwritten by John shortly after, because he is unaware that Jane forestalled him.

Ok, let’s go!

Step 1: We will prepare a ConfigMap known to John and Jane. Let’s call it blackboard.

kubectl create configmap blackboard --from-literal=leader=''

As you can see, we will start with a state in which no one is the leader.

All members of the cluster will now attempt to register for leadership at nearly the exact same point in time. John will be the fastest one.

Step 2: John’s first action is to check if someone has already registered for leadership.

JOHN> kubectl get configmap blackboard -o yaml

The output says:

apiVersion: v1
data:
  leader: ""
kind: ConfigMap
metadata:
  creationTimestamp: "2022-04-30T17:48:32Z"
  name: blackboard
  namespace: default
  resourceVersion: "31719"
  uid: d0ec5f7f-aae1-4fd0-83fb-8bbddf5082c7

Good for John! John is able to register for leadership, as no one else has done so before him what a surprise for us. As you can see in the output, Kubernetes returns a resourceVersion along with the ConfigMap. This resourceVersion will be important for us, because John will later on post it along with the API request, when he attempts to enter his name in the ConfigMap blackboard.

Step 3: Jane somehow is a little quicker than John. Whilst John found out that the leadership slot is free, Jane got the same notice, but we will pretend that she is faster to react on that information.

JANE> kubectl get configmap blackboard -o yaml

The resource content returned to Jane shows the same resourceVersion:

[...]
  leader: ""
[...]
  resourceVersion: "31719"
[...]

Step 4: Now both instances John and Jane both have an in-memory state indicating that they are free to register for leadership. As Jane is moving faster, her attempt to register for leadership comes in before John’s, because John’s request was let’s say stuck in some traffic jam in a network interface (typical daily problems of actors in a distributed system). To register, Jane takes the previous output she received and simply sets the empty string for property leader to her name:

apiVersion: v1
data:
  leader: "jane"
kind: ConfigMap
metadata:
  name: blackboard
  namespace: default
  resourceVersion: "31719"

This file is saved as jane1.yaml. It is vital, that we keep the resourceVersion property with its original value. Our whole approach is based upon the concept of the resourceVersion.

If we would now run the widely-known kubectl apply command, Jane would be registered without consideration of changes to the ConfigMap after the point in time at which Jane had her look at the blackboard. To avoid such an indifferent overwrite of the ConfigMap, we will instead use the kubectl replace command:

JANE> kubectl replace -f jane1.yaml

The returned message configmap/blackboard replaced indicates that Jane successfully registered for leadership. This means that the resourceVersion has not changed since she initially requested the ConfigMap.

Step 5: We remember: John has neither a clue that Jane has also attempted to register for leadership, nor does he know that Jane was successful in doing so before him. Now John attempts to register for leadership the same way that Jane did. File john1.yaml contains:

apiVersion: v1
data:
  leader: "john"
kind: ConfigMap
metadata:
  name: blackboard
  namespace: default
  resourceVersion: "31719"

And we will run the same replacement as Jane did:

JOHN> kubectl replace -f john1.yaml

Surprise, surprise – but what message does John get?

Error from server (Conflict): error when replacing "john1.yaml":
Operation cannot be fulfilled on configmaps "blackboard":
the object has been modified; please apply your changes to the latest version and try again

Oopsie. It works as designed. John cannot register, because his state of knowledge was deprecated and would have overwritten Jane’s rightful registration, because the rule here is: First come, first served.

What was all the fuzz about the resourceVersion? Jane has attempted the registration before John, hence the older resourceVersion provided in John’s later request did not match with the actual current resourceVersion in the resource, which got incremented by Jane’s faster ConfigMap change. John’s attempt to register fails.

(At this point we are essentialy done, but to make it entirely clear, let’s check what the ConfigMap looks like after all this now.)

Step 6: John wants to know who the leader is. This is important to him, because he will need to accept orders or even request them from the leader:

JOHN> kubectl get configmap blackboard -o yaml

Here he sees that Jane is the leader:

apiVersion: v1
data:
  leader: jane
kind: ConfigMap
metadata:
  creationTimestamp: "2022-04-30T17:48:32Z"
  name: blackboard
  namespace: default
  resourceVersion: "32387"
  uid: d0ec5f7f-aae1-4fd0-83fb-8bbddf5082c7

The resourceVersion is “32387” instead of the earlier returned “31719”. This is the reason why his attempt to enter his name in the ConfigMap was rejected.

How to Replace Resources With API Calls

We have demonstrated a working principle within Kubernetes to implement leader election based on a “First-Come-First-Served” paradigm for an application workload. The demonstration was conducted using kubectl. We all love kubectl, but however, for a real application, we would ideally implement this using direct communication with the Kubernetes API, instead of using kubectl.

The crucial part of the whole procedure is to not apply a resource and thus overwrite existing resources, but to command a replacement of resources, which includes the comparison of the resourceVersion field.

Let’s build a quick example in which we test the required API calls from a Pod within the cluster instead of using kubectl from outside the cluster.

Create a ServiceAccount

To read from and write to ConfigMaps from within an Pod, we need to assign the Pod a ServiceAccount that binds Roles allowing the specific API access. The following manifests creates the required Role and uses a RoleBinding to bind the role to a ServiceAccount:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: blackboard
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: blackboard
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  resourceNames: ["blackboard"]
  verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: blackboard
subjects:
- kind: ServiceAccount
  name: blackboard
roleRef:
  kind: Role
  name: blackboard
  apiGroup: rbac.authorization.k8s.io

Create the resources by applying the file: kubectl apply -f rbac.yaml

With this ServiceAccount, we can allow a Pod to make API requests that encompass reading from and updating a ConfigMap called blackboard.

Make sure to create a new and empty ConfigMap:

kubectl delete configmap blackboard
kubectl create configmap blackboard

Run a Pod

We will use the docker.io/ubuntu:latest container image, which we will need to access the Kubernetes API. The most important part is to configure the previously created ServiceAccount for the Pod:

apiVersion: v1
kind: Pod
metadata:
  name: blackboard-test
spec:
  serviceAccountName: blackboard
  containers:
    - name: blackboard
      image: "docker.io/ubuntu:latest"
      command: [ "/bin/bash", "-c", "--" ]
      args: [ "while true; do sleep 30; done;" ]

Run this Pod using: kubectl apply -f pod.yaml

Setup API Access

Let’s enter the Pod:

kubectl exec -it blackboard-test -- bash

For our temporary testing purposes, we need to install jq in the running container to have it easier manipulating API output and also cURL to be able to make API requests at all:

apt update
apt install -y curl jq

For your actual application container, you would use the respective HTTP and JSON libraries for the programming langiage you use to implement the procedure.

To be able to test API calls from within the Pod, we need to setup API access like described in the official Kubernetes documentation. Run these commands:

# Point to the internal API server hostname
APISERVER=https://kubernetes.default.svc

# Path to ServiceAccount token
SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount

# Read this Pod's namespace
NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)

# Read the ServiceAccount bearer token
TOKEN=$(cat ${SERVICEACCOUNT}/token)

# Reference the internal certificate authority (CA)
CACERT=${SERVICEACCOUNT}/ca.crt

Consider that you have to re-run the commands above each time you create a new shell to the Pod.

Test Modifying the ConfigMap

Now we send a GET request to the K8s API that equals a kubectl get configmap blackboard -o json command:

curl \
  --cacert ${CACERT} \
  --header "Authorization: Bearer ${TOKEN}" \
  -X GET ${APISERVER}/api/v1/namespaces/${NAMESPACE}/configmaps/blackboard \
  | tee /tmp/state1.json

The command we use will print out the currently empty ConfigMap, but also store it for further usage in a JSON file on disk. We should see output similar to this one:

{
  "kind": "ConfigMap",
  "apiVersion": "v1",
  "metadata": {
    "name": "blackboard",
    "namespace": "default",
    "uid": "b904d494-74aa-44e6-8e6d-9e89c6db6a6a",
    "resourceVersion": "52444",
    "creationTimestamp": "2022-05-21T13:16:57Z"
  }
}

We know that the ConfigMap is empty, because there is no data map in the JSON output.

To demonstrate the capabilities of using the kubectl replace by making an API call, we will know modify the ConfigMap and add a leader entry. We could do this using a text editor, but to make this eaiser to reproduce, we we will modify the JSON output using jq:

jq \
  --arg leader "jane" \
  '. += { data: { leader: $leader } }' \
  /tmp/state1.json \
  | tee /tmp/state2.json

Our new manifest in /tmp/state2.json now looks like this:

{
  "kind": "ConfigMap",
  "apiVersion": "v1",
  "metadata": {
    "name": "blackboard",
    "namespace": "default",
    "uid": "b904d494-74aa-44e6-8e6d-9e89c6db6a6a",
    "resourceVersion": "52444",
    "creationTimestamp": "2022-05-21T13:16:57Z"
  },
  "data": {
    "leader": "jane"
  }
}

Now let’s push this into the cluster:

curl \
  --cacert ${CACERT} \
  --header "Authorization: Bearer ${TOKEN}" \
  --header "Content-Type: application/json" \
  -X PUT ${APISERVER}/api/v1/namespaces/${NAMESPACE}/configmaps/blackboard \
  --data @/tmp/state2.json

Now let’s retry the same procedure for John, also using the “old” state of the Configmap which we stored in /tmp/state1.json previously:

jq \
  --arg leader "john" \
  '. += { data: { leader: $leader } }' \
  /tmp/state1.json \
  | tee /tmp/state3.json

And now we will try to send this payload to the Kubernetes API. The HTTP PUT operation is the way to go in order to replace instead of to apply:

curl \
  --cacert ${CACERT} \
  --header "Authorization: Bearer ${TOKEN}" \
  --header "Content-Type: application/json" \
  -X PUT ${APISERVER}/api/v1/namespaces/${NAMESPACE}/configmaps/blackboard \
  --data @/tmp/state3.json

As expected, the API refuses to apply this modification due to a mismatch of the resourceVersion values:

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "Operation cannot be fulfilled on configmaps \"blackboard\": the object has been modified; please apply your changes to the latest version and try again",
  "reason": "Conflict",
  "details": {
    "name": "blackboard",
    "kind": "configmaps"
  },
  "code": 409
}

And that’s all about it!

Conclusion

In this post we demonstrated how we can make use of built-in Kubernetes capabilities to extend workloads with simple leader election capabilities. I personally had to use it in a scenario that demanded a registration of Pods in a ConfigMap to allow appropriate communication between the application instances in the Pods.

If you plan to deploy your application to Kubernetes, it may be preferable to implement a leader election functionality leveraging the Kubernetes API, instead of implementing a conseus algorithm or having to operate your own instance of a key-value store in or outside the cluster.

Kubernetes can do this for you!

Image sources for this page:

Leave a Reply

CAPTCHA


The following GDPR rules must be read and accepted:
This form collects your name, email and content so that I can keep track of the comments placed on the website. Your current IP address will also be collected in order to prevent spam comments from automated bots. For more info check the privacy policy where you can educate yourself on where, how and why your data is stored.