相关文章推荐
谦和的马克杯  ·  Android ...·  1 年前    · 
从容的生姜  ·  Python ...·  1 年前    · 
踏实的长颈鹿  ·  部署Docker环境 - 知乎·  1 年前    · 
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have implemented a http health check and a separate http liveness check for my pod. For both, I see that Kubernetes works as expected if my pod delays before responding. However, when they respond immediately with a status 500, Kubernetes treats that as a success response. This is after the pod is up and running OK - before the checks start returning status 500.

In fact, I see that returning status 500 actually resets the failure count, so it caused my pod to be treated as healthy again.

Question is whether I am doing something wrong? How to get Kubernetes to do its stuff when my pod is unhealthy?

$ k version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:00:47Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

To investigate this problem, I have added test endpoints to my pod so that I can change the behaviour at runtime: pass (200), fail (500), delay fail (wait 15 seconds, then return 500). And I separated the health and liveness endpoints.

From describe pod:

Liveness:   exec [curl http://localhost:30030/livez] delay=10s timeout=1s period=10s #success=1 #failure=5
Readiness:  exec [curl http://localhost:30030/healthz] delay=10s timeout=1s period=10s #success=1 #failure=3

I tested the endpoints by exec into the pod and curl the endpoints from there (details below).
Then I cycled both the liveness check and the health check through the 3 modes and monitored the Kubernetes response.
Health Check: expect pod to be restarted after failing health check 5 times in a row.
Liveness Check: describe the service and expect IP address of the pod to be removed from the list of endpoints.

Success case:

bash-4.4$ curl http://localhost:30030/unfailhealth
unfailhealth: REMOVE force all health checks to fail, was failHealth=false, delayFailHealth=false
bash-4.4$ curl http://localhost:30030/healthz -v
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
< HTTP/1.1 200 OK
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 3
< ETag: W/"3-CftlTBfMBbEe9TvTWqcB9tVQ6OE"
< Date: Fri, 05 Feb 2021 13:30:59 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
* Connection #0 to host localhost left intact

Failure case:

bash-4.4$ curl http://localhost:30030/failhealth
failhealth: force all health checks to fail, was failHealth=true, delayFailHealth=false
bash-4.4$ curl http://localhost:30030/healthz -v
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
< HTTP/1.1 500 Internal Server Error
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 26
< ETag: W/"1a-yI5D4Rtao1KH34GZVYKKvxZoEVo"
< Date: Fri, 05 Feb 2021 13:29:14 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
FAKE HEALTH CHECK FAILURE
* Connection #0 to host localhost left intact

Delayed failure case:

bash-4.4$ curl http://localhost:30030/delayfailhealth
delayfailhealth: force all health checks to sleep 15sec, then fail, was failHealth=false, delayFailHealth=true
bash-4.4$ date; curl http://localhost:30030/healthz -v
Fri Feb  5 13:33:08 UTC 2021
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
< HTTP/1.1 500 Internal Server Error
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 47
< ETag: W/"2f-n+Ix8oU/09OT9+cpPVm1/EejE9Y"
< Date: Fri, 05 Feb 2021 13:33:23 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
FAKE HEALTH CHECK FAILURE - AFTER 15 SEC DELAY
* Connection #0 to host localhost left intact

Test Results

Default to SUCCESS for both health and liveness endpoints,return status 200 -> pod starts and works OK.

Set liveness check to FAIL, return status 500 -> no change, pod IP still in service, requests still dispatched to the pod.
Set liveness check to DELAY before responding (then 500) -> pod is removed from Kubernetes service (yippee)
Set liveness check to FAIL (quickly) again -> pod is restored to the service (treated like success).

Set health check to FAIL (return status 500) -> no effect, pod continues without restart.
Set health check to DELAY before responding (then 500) -> pod is restarted after 5 failed probes

Thanks for any help with this. I guess I can change my code to delay before responding in the failure case but that seems like a workaround.

(a) Pod liveness and readiness have httpGet: available which avoids the need to spawn curl for that action, thus avoiding simple bugs such as (b) running curl without -f will cause it to exit 0 no matter what the server response code is (c) this is not a programming question and thus belongs on ServerFault.com – mdaniel Feb 5, 2021 at 17:01 (a), (b) Understood - testing that now. (c) I followed the directions in the Kubernetes documentation to ask the question here kubernetes.io/docs/tasks/debug-application-cluster/… > The Kubernetes team will also monitor posts tagged Kubernetes. If there aren't any existing questions that help, please ask a new one! – Dave Deasy Feb 5, 2021 at 18:46

Problem solved thanks to comment from @mdaniel. Expanding it here because it took me a while to fully understand the comment.

The problem was in the configuration of the health and liveness checks in the pod spec.

        readinessProbe:
          exec:
            command:
            - curl
            - http://localhost:30030/healthz
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

This relies on the output of the curl command in the exec clause.
Curl always exits with code 0. Use curl -f if you want to use curl. Then it will exit with non-zero in case of error.

But better to use httpGet in the pod spec, like this

        readinessProbe:
          httpGet:
            path: /healthz
            port: 30030
            scheme: HTTP
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

I tested both and both work. I will go with httpGet as suggested - the right tool for the job.

Note that the reason for using exec/curl instead of httpGet was that the pod uses TLS which prevents http from the Kubernetes pod. Ref. https://medium.com/cloud-native-the-gathering/kubernetes-liveness-probe-for-scratch-image-with-istio-mtls-enabled-90543e4bae34

Thanks!

I'm glad it was something simple; be aware you can accept your own answer to indicate that this answer solved your problem :-) For those who insist on using curl -f, they will benefit from using curl -sf also, which keeps the chatty curl responses out of the kubectl describe output – mdaniel Feb 5, 2021 at 20:31

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.