how to recover from master failure with kubeadm

Try Stack Overflow for Business

Our new business plan for private Q&A offers single sign-on and advanced features. Get started by May 31 for 2 months free.

Learn more

I set up a Kubernetes cluster with a single master node and two worker nodes using kubeadm , and I am trying to figure out how to recover from node failure.

When a worker node fails, recovery is straightforward: I create a new worker node from scratch, run kubeadm join , and everything's fine.

However, I cannot figure out how to recover from master node failure (without interrupting the deployments running on the worker nodes). Do I need to backup and restore the original certificates or can I just run kubeadm init to create a new master from scratch? How do I join the existing worker nodes?

kubeadm init will definitely not work out of the box, as that will create a new cluster altogether, credentials, ip space, etc.

At a minimum, restoring the master node will require a backup of your etcd data. This typically lives in /var/lib/etcd directory.

You will also need the kubeadm config from the cluster kubeadm config view should output this. (upward of v1.8)

The step-by-step to restore a master node really isn't so clean cut, which is why they introduce HA - High Availability. This is a much safer way of maintaining redundancy and uptime. Particularly because restoring anything from etcd can be a real pain (in my humble opinion and experience).

If I may go a bit off topic from your question, if you are still getting started with Kubernetes and not deeply invested in kubeadm, i would suggest you consider creating your cluster with kops instead. It supports HA already and I found kops to be more robust and easier to use to either kubeadm and kube-aws (the coreos cluster builder). https://kubernetes.io/docs/getting-started-guides/kops/

As per your mention about Master's backup , actually if you mean backup procedures (like traditional/legacy backups tools/techs) isn't mentioned directly in the official documentation (as i know), but you can take your precautions by some Options/Workarounds :

Setup HA Masters (only for GCE)
Set up High-Availability Kubernetes Masters

Setup HA etcd cluster / Master Load Balancer
Setting-up-an-ha-etcd-cluster
Set up master Load Balancer
Operating etcd clusters for Kubernetes

OS file Systems Snapshot/backup

I ended up writing a Kubernetes CronJob backing up the etcd data. If you are interested: I wrote a blog post about it: https://labs.consol.de/kubernetes/2018/05/25/kubeadm-backup.html

In addition to that you may want to backup all of /etc/kubernetes/pki to avoid issues with secrets (tokens) having to be renewed.

For example, kube-proxy uses a secret to store a token and this token becomes invalid if only the etcd certificate is backed up.

I tried following these steps and "simulate" a master crash with


    kubeadm reset

, but after I do the steps to restore and


    kubeadm init

with the appropriate flags, the network seems to be broken. I can delete and then recreate the calico pods, so everything seems to be working, but the pods cannot reach anything over the network... Have you encountered this kind of issue before? – EyfI Aug 30 '18 at 8:34 I am using flannel and it works fine. I don't know much about calico, but there could be multiple reasons for this. Maybe calico updates the etcd state very frequently and cannot deal with resetting etcd to the state of the previous backup. Or maybe calico stores state somewhere else on the master, outside of etcd. – fabstab Sep 2 '18 at 19:00 Thank you for the response. Looks like it's not a calico issue, I looked into the logs and most pods were spitting out unauthorized errors. This led me to this: github.com/rancher/rancher/issues/8388 and I tried the described steps of deleting serviceaccount tokens for calico and other components that had these errors and it helped. I'm not sure why this happens though, and it gets annoying since I had to do this for quite a few apps. I don't suppose you know what might be causing such trouble? – EyfI Sep 3 '18 at 8:24 wow you made that! I thought that was very cool when I saw it, (found it before this post). Btw another cool thing to checkout is Heptio Ark. (Unique way of backing up etcd and pv's) – neokyle Oct 8 '18 at 3:51

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question . Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers .

Where is Flanneld configuration that Kubernetes (installed by Kubeadm) use?

Recovering from Kubernetes node failure running Cassandra

kubernetes HA joing work node error - cluster CA found in cluster-info configmap is invalid: public key

Setting up kubeadm cluster with ubuntu and pi

Restarting master and losting pods on nodes

Deploy Neo4j cluster with kubernetes kubeadm

Trying to join worker node to master master status ready worker status not ready

Node join a master key-value pair missing

tcp 10.0.2.15:6443: getsockopt: connection refused on Debian 9 VMs

Why can I not remove my permissions on kubeadm?

Try Stack Overflow for Business

Related