Kubernetes http health check not working as expected - 500 response is ignored

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have implemented a http health check and a separate http liveness check for my pod. For both, I see that Kubernetes works as expected if my pod delays before responding. However, when they respond immediately with a status 500, Kubernetes treats that as a success response. This is after the pod is up and running OK - before the checks start returning status 500.

In fact, I see that returning status 500 actually resets the failure count, so it caused my pod to be treated as healthy again.

Question is whether I am doing something wrong? How to get Kubernetes to do its stuff when my pod is unhealthy?

$ k version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:00:47Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
To investigate this problem, I have added test endpoints to my pod so that I can change the behaviour at runtime: pass (200), fail (500), delay fail (wait 15 seconds, then return 500).
And I separated the health and liveness endpoints.
From describe pod:
Liveness:   exec [curl http://localhost:30030/livez] delay=10s timeout=1s period=10s #success=1 #failure=5
Readiness:  exec [curl http://localhost:30030/healthz] delay=10s timeout=1s period=10s #success=1 #failure=3
I tested the endpoints by exec into the pod and curl the endpoints from there (details below).

Then I cycled both the liveness check and the health check through the 3 modes and monitored the Kubernetes response.

Health Check: expect pod to be restarted after failing health check 5 times in a row.

Liveness Check: describe the service and expect IP address of the pod to be removed from the list of endpoints.
Success case:
bash-4.4$ curl http://localhost:30030/unfailhealth
unfailhealth: REMOVE force all health checks to fail, was failHealth=false, delayFailHealth=false
bash-4.4$ curl http://localhost:30030/healthz -v
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
< HTTP/1.1 200 OK
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 3
< ETag: W/"3-CftlTBfMBbEe9TvTWqcB9tVQ6OE"
< Date: Fri, 05 Feb 2021 13:30:59 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
* Connection #0 to host localhost left intact
Failure case:
bash-4.4$ curl http://localhost:30030/failhealth
failhealth: force all health checks to fail, was failHealth=true, delayFailHealth=false
bash-4.4$ curl http://localhost:30030/healthz -v
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
< HTTP/1.1 500 Internal Server Error
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 26
< ETag: W/"1a-yI5D4Rtao1KH34GZVYKKvxZoEVo"
< Date: Fri, 05 Feb 2021 13:29:14 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
FAKE HEALTH CHECK FAILURE
* Connection #0 to host localhost left intact
Delayed failure case:
bash-4.4$ curl http://localhost:30030/delayfailhealth
delayfailhealth: force all health checks to sleep 15sec, then fail, was failHealth=false, delayFailHealth=true
bash-4.4$ date; curl http://localhost:30030/healthz -v
Fri Feb  5 13:33:08 UTC 2021
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
< HTTP/1.1 500 Internal Server Error
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 47
< ETag: W/"2f-n+Ix8oU/09OT9+cpPVm1/EejE9Y"
< Date: Fri, 05 Feb 2021 13:33:23 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
FAKE HEALTH CHECK FAILURE - AFTER 15 SEC DELAY
* Connection #0 to host localhost left intact
Test Results
Default to SUCCESS for both health and liveness endpoints,return status 200 -> pod starts and works OK.
Set liveness check to FAIL, return status 500 -> no change, pod IP still in service, requests still dispatched to the pod.

Set liveness check to DELAY before responding (then 500) -> pod is removed from Kubernetes service (yippee)

Set liveness check to FAIL (quickly) again -> pod is restored to the service (treated like success).
Set health check to FAIL (return status 500) -> no effect, pod continues without restart.

Set health check to DELAY before responding (then 500) -> pod is restarted after 5 failed probes
Thanks for any help with this.  I guess I can change my code to delay before responding in the failure case but that seems like a workaround.
                (a) Pod liveness and readiness have httpGet: available which avoids the need to spawn curl for that action, thus avoiding simple bugs such as (b) running curl without -f will cause it to exit 0 no matter what the server response code is (c) this is not a programming question and thus belongs on ServerFault.com
– mdaniel
                Feb 5, 2021 at 17:01
                (a), (b) Understood - testing that now.      (c) I followed the directions in the Kubernetes documentation to ask the question here kubernetes.io/docs/tasks/debug-application-cluster/… > The Kubernetes team will also monitor posts tagged Kubernetes. If there aren't any existing questions that help, please ask a new one!
– Dave Deasy
                Feb 5, 2021 at 18:46
Problem solved thanks to comment from @mdaniel.  Expanding it here because it took me a while to fully understand the comment.
The problem was in the configuration of the health and liveness checks in the pod spec.
        readinessProbe:
          exec:
            command:
            - curl
            - http://localhost:30030/healthz
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
This relies on the output of the curl command in the exec clause.

Curl always exits with code 0.
Use curl -f if you want to use curl.  Then it will exit with non-zero in case of error.
But better to use httpGet in the pod spec, like this
        readinessProbe:
          httpGet:
            path: /healthz
            port: 30030
            scheme: HTTP
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
I tested both and both work.  I will go with httpGet as suggested - the right tool for the job.
Note that the reason for using exec/curl instead of httpGet was that the pod uses TLS which prevents http from the Kubernetes pod.
Ref. https://medium.com/cloud-native-the-gathering/kubernetes-liveness-probe-for-scratch-image-with-istio-mtls-enabled-90543e4bae34
Thanks!
                I'm glad it was something simple; be aware you can accept your own answer to indicate that this answer solved your problem :-) For those who insist on using curl -f, they will benefit from using curl -sf also, which keeps the chatty curl responses out of the kubectl describe output
– mdaniel
                Feb 5, 2021 at 20:31
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.