Backup and Restore etcd in Kubernetes Cluster for CKA v1.19

The final module of the Cluster Architecture, Installation, and Configuration is Implement etcd backup and restore. Let’s quickly perform the actions we need to complete this step for the exam.

Perform a Backup of etcd

While it’s still early and details of the CKA v1.19 environment aren’t known yet, I’m anticipating a small change to how etcd backup and restore is performed. If you’ve been preparing for the CKA before the September 2020 change to Kubernetes v1.19, you may know be familiar with the environment variable export ETCDCTL_API=3 to ensure you’re using version 3 of etcd’s API, which has the backup and restore capability. However, Kubernetes v1.19 ships with etcd 3.4.9 and in etcd 3.4.x, the default API version is 3 so this process is no longer necessary! If etcdctl version returns a version lower than 3.4.x, you will still need to set the API version to 3 for performing backup and restore operations.

Get The Info You Need First

When you type the etcd backup command, you’re going to need to specify the location of a few certificates and a key. Let’s grab that really quick! Get the name of our etcd pod:

kubectl get pods -A

Get the details of our etcd pod:

kubectl describe pods etcd-controlplane -n kube-system

The output that we’re interested in is under the Command section. You will need to copy the locations of:

  • cert-file
  • key-file
  • trusted-ca-file
  • listen-client-urls
Command: 
etcd
--advertise-client-urls=https://172.17.0.54:2379
--cert-file=/etc/kubernetes/pki/etcd/server.crt
--client-cert-auth=true --data-dir=/var/lib/etcd
--initial-advertise-peer-urls=https://172.17.0.54:2380
--initial-cluster=master=https://172.17.0.54:2380
--key-file=/etc/kubernetes/pki/etcd/server.key
--listen-client-urls=https://127.0.0.1:2379,https://172.17.0.54:2379
--listen-metrics-urls=http://127.0.0.1:2381
--listen-peer-urls=https://172.17.0.54:2380
--name=master
--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
--peer-client-cert-auth=true
--peer-key-file=/etc/kubernetes/pki/etcd/peer.key
--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
--snapshot-count=10000
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

Now here’s the fun part: the names of the options that etcd uses isn’t the same that etcdctl uses for the backup. They’re close enough to match up. Here’s how they map:

etcd optionsetcdctl options
–cert-file–cert
–key-file–key
–trusted-ca-file–cacert
–listen-client-urls–endpoints

Your backup command should look like this:

etcdctl snapshot save etcd.db \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--endpoints=https://127.0.0.1:2379 \
--key=/etc/kubernetes/pki/etcd/server.key

Output:

{"level":"info","ts":1603021662.1152575,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"etcd.db.part"}{"level":"info","ts":"2020-10-18T11:47:42.129Z","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1603021662.1302097,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
{"level":"info","ts":"2020-10-18T11:47:42.173Z","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
{"level":"info","ts":1603021662.198739,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"1.8 MB","took":0.083223978}
{"level":"info","ts":1603021662.199425,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"etcd.db"}
Snapshot saved at etcd.db

That’s it! The etcd database is backed up and we’re ready to restore!

Perform a Restore of etcd

With a restore, we’re going to specify those same 4 parameters from the backup operation but add a few more that are needed to initialize the restore as a new etcd store:

etcdctl snapshot restore etcd.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--name=controlplane \
--data-dir /var/lib/etcd-from-backup \
--initial-cluster=controlplane=https://127.0.0.1:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://127.0.0.1:2380 \

What are these extra parameters doing?

  • Giving the etcd cluster a new name
  • Restoring the etcd snapshot to the /var/lib/etcd-from-backup directory
  • Re-initializing the etcd cluster token since we are creating a new cluster
  • Specifying the IP:Port for etcd-to-etcd communication

Output:

{"level":"info","ts":1603021679.8156757,"caller":"snapshot/v3_snapshot.go:296","msg":"restoring snapshot","path":"etcd.db","wal-dir":"/var/lib/etcd-from-backup/member/wal","data-dir":"/var/lib/etcd-from-backup","snap-dir":"/var/lib/etcd-from-backup/member/snap"}
{"level":"info","ts":1603021679.8793259,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"7581d6eb2d25405b","local-member-id":"0","added-peer-id":"e92d66acd89ecf29","added-peer-peer-urls":["https://127.0.0.1:2380"]}
{"level":"info","ts":1603021679.9166896,"caller":"snapshot/v3_snapshot.go:309","msg":"restored snapshot","path":"etcd.db","wal-dir":"/var/lib/etcd-from-backup/member/wal","data-dir":"/var/lib/etcd-from-backup","snap-dir":"/var/lib/etcd-from-backup/member/snap"}

For the CKA exam, this is all that’s necessary to complete the task! However, in production you wouldn’t want to stop here since this process doesn’t modify the existing etcd pod in anyway. The restore process will expand the snapshot into the directory specified and make some changes to the data to represent the new name and cluster token but that’s it!

In production, you would modify the etcd pod’s manifest in /etc/kubernetes/manifests/etcd.yaml to utilize the new data directory and the initial cluster token. Upon saving the file, the etcd pod will be destroyed and recreated and after a minute or so, the pod will be running and ready. You will also be able to check it by viewing the logs with kubectl logs <etcd-pod-name>.

21 thoughts on “Backup and Restore etcd in Kubernetes Cluster for CKA v1.19

  1. Pingback: CKA 2020 Curriculum for Kubernetes v1.19 | Brandon Willmott

  2. Pingback: Important Directories to Know for Kubernetes CKA Exam | Brandon Willmott

  3. Satya

    Hi
    Can we execute restore from the edge node, if I have “db” file on edge node, instead of ssh to master node?

    Reply
    1. Brandon Willmott Post author

      Hi Satya, while I think it’s technically possible, I don’t think that will pass for the exam since etcd should only run on control plane nodes, they will likely be looking to see if the file was restored to the control plane node.

      Reply
      1. Thomas

        Hello Satya,

        I remember that there was no context given. So, I don’t know the master node. How can I solve the task without a context?

        Thank you,
        Thomas

  4. Biswanath Roy

    I have the similar question regarding this ETCD restore . I went to the master node and then edited the etcd.yaml file under /etc/kubernetes/manifest and then moved this file to a different location to stop the static pod. Then i ran the command kubectl get pods -n kube-system but could not find any pods in the exam . Am i doing something wrong ?

    Reply
    1. Brandon Willmott Post author

      Hi Biswanath, the etcd restore step on the exam can be performed without modifying the existing etcd pod. I know it seems counterintuitive to a restore operation in production, but on the exam they’re just looking to see the backup file restored to a new directory.

      Reply
  5. Gabriel

    Hi Brandon, in the exam CKA what is the bet and easy way to make a restore.
    I am confuse, I need to restore on another data-dir location? can I use the same

    Reply
    1. Brandon Willmott Post author

      Hi Gabriel, yes you will need to restore the backup to a different directory with –data-dir. You don’t need to create the directory first. In fact, etcdctl won’t restore to a directory that already exists. Use the full etcdctl snapshot restore command in the post and you’ll be perfect!

      Reply
      1. Deepak

        Hi Brandon ,

        Regarding restore , do we have to run backup and restore commands from the edge node/jump server or we have to ssh to master node of that cluster in order to take the backup.
        And also how come the test script will verify that restore is done properly if we have not restarted the etcd service with the restore data dir

      2. Brandon Willmott Post author

        Hi Deepak, you will need to ssh to a master/control plane node. From my understanding of the etcdctl restore process, it reports that it successfully expanded the snapshot to a data directory. In production you would need to tell etcd about the new data-dir but the exam doesn’t ask you to do that.

  6. Jose

    Hi Brandon,

    Please excuse me if this is obvious, but new with kubernetes. Where is this snapshot actually saved? I ran find on my system and found this location, but not sure if this is the right place /var/lib/docker/overlay2/c166fe1d0b3eee9275f84845c7ae3c0014648ebbddf27e37303358058a45a5a5/diff/etcd.db

    Is it safe to move this file or off-load it?

    For restore, suppose I’ve created a snapshot, located the file and off-loaded to external storage. Then disaster happens and I’ve lost all etcd members and the the dbs. When I want to restore, where do I put the backup file so that I can be restored or how to tell etcdctl to restore from a specific path?

    Thanks

    Reply
      1. Emil

        Hi,
        I had exam today and the situation is:
        1. controlplane has not installed etcctl and there is no folder for backup
        2. node-1 contains backup data and corresponding certs
        3. certs located on node-1 are not proper to backup active etcctl

        I don’t know how to resolve this question in proper way 😦

  7. Srinivas

    They are asking to backup and restore on student node , not on control plane node. They started this question recently. Quite confusing.

    Reply
    1. Halil

      Exactly, you are right guys. They asked at the student node me too and I confused soo much. Also, they have endpoint ip is a localhost ip in the question. So, I don’t understand. Is the student node has an etcd? I opened the case at LF but they say, there isn’t any mistake in the question.

      Reply
      1. Bilal

        Exactly, just gave the exam and the same scenario. The Certs are provided in question, but the certs within the existing etcdl pod are definition are quite different. I proceeded with the backup and restore process as per normal on student node itself. But after restore with the certs provided, the etcdl pod on control plane started crashing. And between the two nodes it’s difficult to understand which endpoint to use? Controlplane node IP or student-node?

  8. samrj

    Can you please explain more about these parameters, the naming(controlplane, etcd-cluster-1), port(2380) used??
    –initial-cluster=controlplane=https://127.0.0.1:2380 \
    –initial-cluster-token=etcd-cluster-1 \
    –initial-advertise-peer-urls=https://127.0.0.1:2380 \

    Reply
  9. CNCF

    etcd manifest must be changed after restoring to another folder, also if etcdctl is missing on the host feel free to use the one available in the etcd container

    Reply

Leave a reply to Brandon Willmott Cancel reply