Try Stack Overflow for Business
Our new business plan for private Q&A offers single sign-on and advanced features.
Get started by May 31 for 2 months free.
Learn more
I set up a Kubernetes cluster with a single master node and two worker nodes using
kubeadm
, and I am trying to figure out how to recover from node failure.
When a worker node fails, recovery is straightforward: I create a new worker node from scratch, run
kubeadm join
, and everything's fine.
However, I cannot figure out how to recover from master node failure (without interrupting the deployments running on the worker nodes). Do I need to backup and restore the original certificates or can I just run
kubeadm init
to create a new master from scratch? How do I join the existing worker nodes?
kubeadm init
will definitely not work out of the box, as that will create a new cluster altogether, credentials, ip space, etc.
At a minimum, restoring the master node will require a backup of your etcd data. This typically lives in /var/lib/etcd directory.
You will also need the kubeadm config from the cluster
kubeadm config view
should output this. (upward of v1.8)
The step-by-step to restore a master node really isn't so clean cut, which is why they introduce HA - High Availability. This is a much safer way of maintaining redundancy and uptime. Particularly because restoring anything from etcd can be a real pain (in my humble opinion and experience).
If I may go a bit off topic from your question, if you are still getting started with Kubernetes and not deeply invested in kubeadm, i would suggest you consider creating your cluster with kops instead. It supports HA already and I found kops to be more robust and easier to use to either kubeadm and kube-aws (the coreos cluster builder).
https://kubernetes.io/docs/getting-started-guides/kops/
As per your mention about Master's backup , actually if you mean backup procedures (like traditional/legacy backups tools/techs) isn't mentioned directly in the official documentation (as i know), but you can take your precautions by some Options/Workarounds :
Setup HA Masters (only for GCE)
Set up High-Availability Kubernetes Masters
Setup HA etcd cluster / Master Load Balancer
Setting-up-an-ha-etcd-cluster
Set up master Load Balancer
Operating etcd clusters for Kubernetes
OS file Systems Snapshot/backup
I ended up writing a Kubernetes
CronJob
backing up the etcd data. If you are interested: I wrote a blog post about it:
https://labs.consol.de/kubernetes/2018/05/25/kubeadm-backup.html
In addition to that you may want to backup all of
/etc/kubernetes/pki
to avoid issues with secrets (tokens) having to be renewed.
For example, kube-proxy uses a secret to store a token and this token becomes invalid if only the etcd certificate is backed up.
–
–
–
–
Thanks for contributing an answer to Stack Overflow!
-
Please be sure to
answer the question
. Provide details and share your research!
But
avoid
…
-
Asking for help, clarification, or responding to other answers.
-
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our
tips on writing great answers
.
site design / logo © 2019 Stack Exchange Inc; user contributions licensed under
cc by-sa 3.0
with
attribution required
.
rev 2019.5.9.33641