相关文章推荐
留胡子的电影票  ·  Install and Set Up ...·  1 周前    · 
冷冷的棒棒糖  ·  wpf listbox item ...·  1 年前    · 
坏坏的西瓜  ·  保存 SSIS 包(SQL Server ...·  1 年前    · 
俊秀的牙膏  ·  【c# ...·  1 年前    · 

Try Stack Overflow for Business

Our new business plan for private Q&A offers single sign-on and advanced features. Get started by May 31 for 2 months free.

Learn more

I set up a Kubernetes cluster with a single master node and two worker nodes using kubeadm , and I am trying to figure out how to recover from node failure.

When a worker node fails, recovery is straightforward: I create a new worker node from scratch, run kubeadm join , and everything's fine.

However, I cannot figure out how to recover from master node failure (without interrupting the deployments running on the worker nodes). Do I need to backup and restore the original certificates or can I just run kubeadm init to create a new master from scratch? How do I join the existing worker nodes?

kubeadm init will definitely not work out of the box, as that will create a new cluster altogether, credentials, ip space, etc.

At a minimum, restoring the master node will require a backup of your etcd data. This typically lives in /var/lib/etcd directory.

You will also need the kubeadm config from the cluster kubeadm config view should output this. (upward of v1.8)

The step-by-step to restore a master node really isn't so clean cut, which is why they introduce HA - High Availability. This is a much safer way of maintaining redundancy and uptime. Particularly because restoring anything from etcd can be a real pain (in my humble opinion and experience).

If I may go a bit off topic from your question, if you are still getting started with Kubernetes and not deeply invested in kubeadm, i would suggest you consider creating your cluster with kops instead. It supports HA already and I found kops to be more robust and easier to use to either kubeadm and kube-aws (the coreos cluster builder). https://kubernetes.io/docs/getting-started-guides/kops/

As per your mention about Master's backup , actually if you mean backup procedures (like traditional/legacy backups tools/techs) isn't mentioned directly in the official documentation (as i know), but you can take your precautions by some Options/Workarounds :

  • Setup HA Masters (only for GCE)
    Set up High-Availability Kubernetes Masters

  • Setup HA etcd cluster / Master Load Balancer
    Setting-up-an-ha-etcd-cluster
    Set up master Load Balancer
    Operating etcd clusters for Kubernetes

  • OS file Systems Snapshot/backup

  • I ended up writing a Kubernetes CronJob backing up the etcd data. If you are interested: I wrote a blog post about it: https://labs.consol.de/kubernetes/2018/05/25/kubeadm-backup.html

    In addition to that you may want to backup all of /etc/kubernetes/pki to avoid issues with secrets (tokens) having to be renewed.

    For example, kube-proxy uses a secret to store a token and this token becomes invalid if only the etcd certificate is backed up.

    I tried following these steps and "simulate" a master crash with kubeadm reset , but after I do the steps to restore and kubeadm init with the appropriate flags, the network seems to be broken. I can delete and then recreate the calico pods, so everything seems to be working, but the pods cannot reach anything over the network... Have you encountered this kind of issue before? EyfI Aug 30 '18 at 8:34 I am using flannel and it works fine. I don't know much about calico, but there could be multiple reasons for this. Maybe calico updates the etcd state very frequently and cannot deal with resetting etcd to the state of the previous backup. Or maybe calico stores state somewhere else on the master, outside of etcd. fabstab Sep 2 '18 at 19:00 Thank you for the response. Looks like it's not a calico issue, I looked into the logs and most pods were spitting out unauthorized errors. This led me to this: github.com/rancher/rancher/issues/8388 and I tried the described steps of deleting serviceaccount tokens for calico and other components that had these errors and it helped. I'm not sure why this happens though, and it gets annoying since I had to do this for quite a few apps. I don't suppose you know what might be causing such trouble? EyfI Sep 3 '18 at 8:24 wow you made that! I thought that was very cool when I saw it, (found it before this post). Btw another cool thing to checkout is Heptio Ark. (Unique way of backing up etcd and pv's) neokyle Oct 8 '18 at 3:51

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question . Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers .

    site design / logo © 2019 Stack Exchange Inc; user contributions licensed under cc by-sa 3.0 with attribution required . rev 2019.5.9.33641