Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am struggling to find any answer to this in the Kubernetes documentation. The scenario is the following:

  • Kubernetes version 1.4 over AWS
  • 8 pods running a NodeJS API (Express) deployed as a Kubernetes Deployment
  • One of the pods gets restarted for no apparent reason late at night (no traffic, no CPU spikes, no memory pressure, no alerts...). Number of restarts is increased as a result of this.
  • Logs don't show anything abnormal (ran kubectl -p to see previous logs, no errors at all in there)
  • Resource consumption is normal, cannot see any events about Kubernetes rescheduling the pod into another node or similar
  • Describing the pod gives back TERMINATED state, giving back COMPLETED reason and exit code 0. I don't have the exact output from kubectl as this pod has been replaced multiple times now.
  • The pods are NodeJS server instances, they cannot complete , they are always running waiting for requests.

    Would this be internal Kubernetes rearranging of pods? Is there any way to know when this happens? Shouldn't be an event somewhere saying why it happened?

    Update

    This just happened in our prod environment. The result of describing the offending pod is:

    Container ID: docker://7a117ed92fe36a3d2f904a882eb72c79d7ce66efa1162774ab9f0bcd39558f31 Image: 1.0.5-RC1 Image ID: docker://sha256:XXXX Ports: 9080/TCP, 9443/TCP State: Running Started: Mon, 27 Mar 2017 12:30:05 +0100 Last State: Terminated Reason: Completed Exit Code: 0 Started: Fri, 24 Mar 2017 13:32:14 +0000 Finished: Mon, 27 Mar 2017 12:29:58 +0100 Ready: True Restart Count: 1

    Update 2

    Here it is the deployment.yaml file used:

    apiVersion: "extensions/v1beta1"
    kind: "Deployment"
    metadata:
      namespace: "${ENV}"
      name: "${APP}${CANARY}"
      labels:
        component: "${APP}${CANARY}"
    spec:
      replicas: ${PODS}
      minReadySeconds: 30
      revisionHistoryLimit: 1
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxUnavailable: 1
          maxSurge: 1
      template:
        metadata:
          labels:
            component: "${APP}${CANARY}"
        spec:
          serviceAccount: "${APP}"
    ${IMAGE_PULL_SECRETS}
          containers:
          - name: "${APP}${CANARY}"
            securityContext:
              capabilities:
                  - IPC_LOCK
            image: "134078050561.dkr.ecr.eu-west-1.amazonaws.com/${APP}:${TAG}"
            - name: "KUBERNETES_CA_CERTIFICATE_FILE"
              value: "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
            - name: "NAMESPACE"
              valueFrom:
                fieldRef:
                  fieldPath: "metadata.namespace"
            - name: "ENV"
              value: "${ENV}"
            - name: "PORT"
              value: "${INTERNAL_PORT}"
            - name: "CACHE_POLICY"
              value: "all"
            - name: "SERVICE_ORIGIN"
              value: "${SERVICE_ORIGIN}"
            - name: "DEBUG"
              value: "http,controllers:recommend"
            - name: "APPDYNAMICS"
              value: "true"
            - name: "VERSION"
              value: "${TAG}"
            ports:
            - name: "http"
              containerPort: ${HTTP_INTERNAL_PORT}
              protocol: "TCP"
            - name: "https"
              containerPort: ${HTTPS_INTERNAL_PORT}
              protocol: "TCP"
    

    The Dockerfile of the image referenced in the above Deployment manifest:

    FROM ubuntu:14.04
    ENV NVM_VERSION v0.31.1
    ENV NODE_VERSION v6.2.0
    ENV NVM_DIR /home/app/nvm
    ENV NODE_PATH $NVM_DIR/v$NODE_VERSION/lib/node_modules
    ENV PATH      $NVM_DIR/v$NODE_VERSION/bin:$PATH
    ENV APP_HOME /home/app
    RUN useradd -c "App User" -d $APP_HOME -m app
    RUN apt-get update; apt-get install -y curl
    USER app
    # Install nvm with node and npm
    RUN touch $HOME/.bashrc; curl https://raw.githubusercontent.com/creationix/nvm/${NVM_VERSION}/install.sh | bash \
        && /bin/bash -c 'source $NVM_DIR/nvm.sh; nvm install $NODE_VERSION'
    ENV NODE_PATH $NVM_DIR/versions/node/$NODE_VERSION/lib/node_modules
    ENV PATH      $NVM_DIR/versions/node/$NODE_VERSION/bin:$PATH
    # Create app directory
    WORKDIR /home/app
    COPY . /home/app
    # Install app dependencies
    RUN npm install
    EXPOSE 9080 9443
    CMD [ "npm", "start" ]
    

    npm start is an alias for a regular node app.js command that starts a NodeJS server on port 9080.

    Please post your deployment or rc yaml file. Also, have a look at this troubleshooting guide: kubernetes.io/docs/tasks/debug-application-cluster/… – jaxxstorm Mar 27, 2017 at 12:23 It is posted. At this point we cannot modify the pods running to print the termination message, but definitely will do that in future deployments. – David Fernandez Mar 27, 2017 at 14:18 Sorry @jaxxstorm, I did not realise that the Deployment manifest did not contain the actual Dockerfile, it just references the image to use. I've just posted the Dockerfile. Thank you. – David Fernandez Mar 27, 2017 at 14:46

    Check the version of docker you run, and whether the docker daemon was restarted during that time.

    If the docker daemon was restarted, all the container would be terminated (unless you use the new "live restore" feature in 1.12). In some docker versions, docker may incorrectly reports "exit code 0" for all containers terminated in this situation. See https://github.com/docker/docker/issues/31262 for more details.

    If this is still relevant, we just had a similar problem in our cluster.

    We have managed to find more information by inspecting the logs from docker itself. ssh onto your k8s node and run the following:

    sudo journalctl -fu docker.service

    I hade similar problem when we upgraded to version 2.x Pos get restarted even after the Dags ran successfully.

    I later resolved it after a long time of debugging by overriding the pod template and specifying it in the airflow.cfg file.

    [kubernetes]
    pod_template_file = {{ .Values.airflow.home }}/pod_template.yaml
    # pod_template.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: dummy-name
    spec:
      serviceAccountName: default
      restartPolicy: Never
      containers:
        - name: base
          image: dummy_image
          imagePullPolicy: IfNotPresent
          ports: []
          command: []
            

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.