General Troubleshooting

TKGI API is Slow or Times Out

Symptom

When you run TKGI CLI commands, the TKGI API times out or is slow to respond.

Explanation

The TKGI API VM requires more resources.

Solution

Navigate to https://YOUR-OPS-MANAGER-FQDN/ in a browser to log in to the Ops Manager Installation Dashboard.

Select the Tanzu Kubernetes Grid Integrated Edition tile.

Select the Resource Config page.

For the TKGI API job, select a VM Type with greater CPU and memory resources.

Click Save .

Click the Installation Dashboard link to return to the Installation Dashboard.

Click Review Pending Changes . Review the changes that you made. For more information, see Reviewing Pending Product Changes .

Click Apply Changes .

All Cluster Operations Fail

Symptom

All TKGI CLI cluster operations fail including attempts to create or delete clusters with tkgi create-cluster and tkgi delete-cluster .

The output of tkgi cluster CLUSTER-NAME contains Last Action State: error , and the output of bosh -e ENV-ALIAS -d SERVICE-INSTANCE vms indicates that the Process State of at least one deployed node is failing .

Explanation

If any deployed control plane or worker nodes run out of disk space in /var/vcap/store , all cluster operations such as the creation or deletion of clusters will fail.

Diagnostics

To confirm that there is a disk space issue, check recent BOSH activity for any disk space error messages.

Log in to the BOSH Director and run bosh tasks . The output from bosh tasks provides details about the tasks that the BOSH Director has run. See Using BOSH Diagnostic Commands in Tanzu Kubernetes Grid Integrated Edition for more information about logging in to the BOSH Director.

In the BOSH command output, locate a task that attempted to perform a cluster operation, such as cluster creation or deletion.

To retrieve more information about the task, run the following command:

bosh -e MY-ENVIRONMENT task TASK-NUMBER

Where:


    MY-ENVIRONMENT

is the name of your BOSH environment.


    TASK-NUMBER

is the number of the task that attempted to create the cluster.

For example:

$ bosh -e tkgi task 23

In the output, look for the following text string:

no space left on device
  Check the health of your deployed Kubernetes clusters by following the procedure in Verifying Deployment Health.
  
  In the output of bosh -e ENV-ALIAS -d SERVICE-INSTANCE vms, look for any nodes that display failing as their Process State. For example:
 
 Instance                                     Process State  AZ       IPs         VM CID                                   VM Type  Active

master/3a3adc92-14ce-4cd4-a12c-6b5eb03e33d6  failing        az-1     10.0.11.10  vm-09027f0e-dac5-498e-474e-b47f2cda614d  small    true

  Make a note of the plan assigned to the failing node.
  
Solution 
  In the Tanzu Kubernetes Grid Integrated Edition tile, locate the plan assigned to the failing node.
  
  In the plan configuration, select a larger VM type for the plan’s control plane or worker nodes or both.
 For more information about scaling existing clusters by changing the VM types, see Scale Vertically by Changing Cluster Node VM Sizes in the TKGI Tile.
  
Cluster Creation Fails
 
Symptom 
When creating a cluster, you run tkgi cluster CLUSTER-NAME to monitor the cluster creation status. In the command output, the value for Last Action State is error. 
Explanation 
There was an error creating the cluster. 
Diagnostics 
  Log in to the BOSH Director and run bosh tasks. The output from bosh tasks provides details about the tasks that the BOSH Director has run. See Using BOSH Diagnostic Commands in Tanzu Kubernetes Grid Integrated Edition for more information about logging in to the BOSH Director.
  
  In the BOSH command output, locate the task that attempted to create the cluster.
  
  To retrieve more information about the task, run the following command:
 bosh -e MY-ENVIRONMENT task TASK-NUMBER
 Where: 
   
MY-ENVIRONMENT is the name of your BOSH environment. 
   TASK-NUMBER is the number of the task that attempted to create the cluster. 
   For example: 
$ bosh -e tkgi task 23
  
BOSH logs are used for error diagnostics but if the issue you see in the BOSH logs is related to using or managing Kubernetes, you should consult the Kubernetes Documentation for troubleshooting that issue. 
For troubleshooting failed BOSH tasks, see the BOSH documentation. 
Cluster Deletion Fails
 
Symptom 
When deleting a cluster in a large-scale NSX-T environment, TKGI delete-cluster becomes stuck. 
Explanation 
A TKGI-API process has timed out and cluster deletion is stuck. 
Solution 
To avoid the TKGI-API process time out, increase the TKGI Operation Timeout: 
 SSH to the TKGI Control Plane VM. 
 Change directory to /var/vcap/jobs/pks-nsx-t-osb-proxy. 
  Run the following command:
 time ./bin/ncp_cleanup test-read  ROUTER-ID
 Where ROUTER-ID is your NSX-T Tier-0 Router ID. 

 For example:
 
 pivotal-container-service/88d4bf76-d3967d53b4c4:/var/vcap/jobs/pks-nsx-t-osb-proxy# time ./bin/ncp_cleanup test-read 8dc31113-64e8-40bb-83fb-1af75857d5ae real 1m28.057s user 0m13.121s sys 0m0.629s 
  
 Collect the returned real value. 
 Add 30 seconds to the real value and convert the sum from minutes-seconds to seconds, rounding up. For example, sum, convert, and round 1m28.057s to 120. 
 Convert the summed value to milliseconds. This is your calculated Operation Timeout value. 
 Configure the TKGI Operation Timeout field on the TKGI Tile with your calculated Operation Timeout value. For more information on configuring the TKGI Operation Timeout field, see Networking in Installing TKGI on vSphere with NSX-T. 
Cannot Re-Create a Cluster that Failed to Deploy
 
Symptom 
After cluster creation fails, you cannot re-run tkgi create-cluster to attempt creating the cluster again. 
Explanation 
Tanzu Kubernetes Grid Integrated Edition does not automatically clean up the failed BOSH deployment. Running tkgi create-cluster using the same cluster name creates a name clash error in BOSH. 
Solution 
Log in to the BOSH Director and delete the BOSH deployment manually, then retry the tkgi delete-cluster operation. After cluster deletion succeeds, re-create the cluster. 
  Log in to the BOSH Director and obtain the deployment name for cluster you want to delete. For instructions, see Using BOSH Diagnostic Commands in Tanzu Kubernetes Grid Integrated Edition.
  
  Run the following BOSH command:
 bosh -e MY-ENVIRONMENT delete-deployment -d DEPLOYMENT-NAME
 Where: 
   
MY-ENVIRONMENT is the name of your BOSH environment. 
   DEPLOYMENT-NAME is the name of your BOSH deployment. 
   
Note: If necessary, you can append the –force flag to delete the deployment.


   
    
     Run the following TKGI command:
    
    tkgi delete-cluster CLUSTER-NAME

    
     Where
     
      CLUSTER-NAME
     
     is the name of your Tanzu Kubernetes Grid Integrated Edition cluster.
    
    
     
      Note
     
     : Use only lowercase characters in your TKGI-provisioned Kubernetes cluster names if you manage your clusters with Tanzu Mission Control (TMC). Clusters with names that include an uppercase character cannot be attached to TMC.
    
    
    
   
   
    
     To re-create the cluster, run the following TKGI command:
    
    tkgi create-cluster CLUSTER-NAME

    
     Where
     
      CLUSTER-NAME
     
     is the name of your Tanzu Kubernetes Grid Integrated Edition cluster.
     

    
    
     
      Note
     
     : Use only lowercase characters when naming your cluster if you manage your clusters with Tanzu Mission Control (TMC). Clusters with names that include an uppercase character cannot be attached to TMC.
    
    
    
   
   
    
     Windows Stemcell for vSphere Creation Fails with Login Issue
    
   
   
    
     
      Symptom
     
    
    
     The
     
      stembuild construct
     
     command fails with error:
     
      Cannot complete login due to an incorrect user name or password.
     
    
    
     
      Explanation
     
    
    
     Your vCenter login contains special characters, or you have
     
      GOVC
     
     environment variables set locally.
    
    
     
      Solution
     
    
   
   
    
     For special characters, see
    
    
     Authentication Error with Special Characters in stembuild Commands
    
    , in the TAS for VMs [Windows] documentation.
   
   
    For
    
     GOVC
    
    variables, follow the steps to unset the variables in
    
     Step 4: Construct the BOSH Stemcell
    
    , in the TAS for VMs [Windows] documentation.
   
   
    
     Cannot Access Add-On Features or Functions
    
   
   
    
     
      Symptom
     
    
    
     You cannot access a feature or function provided by a Kubernetes add-on.
    
    
     For example, pods cannot resolve DNS names, and error messages report the service
     
      CoreDNS
     
     is invalid. If
     
      CoreDNS
     
     is not deployed, the cluster typically fails to start.
    
    
     
      Explanation
     
    
    
     Kubernetes features and functions are provided by Tanzu Kubernetes Grid Integrated Edition add-ons. DNS resolution, for example, is provided by the
     
      CoreDNS
     
     service.
    
    
     To activate these add-ons, Ops Manager must run scripts after deploying Tanzu Kubernetes Grid Integrated Edition. You must configure Ops Manager to automatically run these post-deploy scripts.
    
    
     
      Solution
     
    
    
     Perform the following steps to configure Ops Manager to run post-deploy scripts to deploy the missing add-ons to your cluster.
    
    
     
      Navigate to
      
       https://YOUR-OPS-MANAGER-FQDN/
      
      in a browser to log in to the Ops Manager Installation Dashboard.
     
    
    
     
      Click the
      
       BOSH Director
      
      tile.
     
    
    
     
      Select
      
       Director Config
      
      .
     
    
    
     
      Select
      
       Enable Post Deploy Scripts
      
      .
     
     
      
       Note
      
      : This setting activates post-deploy scripts for all tiles in your Ops Manager installation.
     
     
     
    
    
     
      Click
      
       Save
      
      .
     
    
    
     
      Click the
      
       Installation Dashboard
      
      link to return to the Installation Dashboard.
     
    
   
   
    
    
    
     
      Click
      
       Review Pending Changes
      
      . Review the changes that you made. For more information, see
     
     
      Reviewing Pending Product Changes
     
     .
    
   
   
    
     Click
     
      Apply Changes
     
     .
    
   
   
    
     After Ops Manager finishes applying changes, enter
     
      tkgi delete-cluster
     
     on the command line to delete the cluster. For more information, see
     
      Deleting Clusters
     
     .
    
   
   
    
     On the command line, enter
     
      tkgi create-cluster
     
     to recreate the cluster. For more information, see
     
      Creating Clusters
     
     .
    
   
   
    
     Resurrecting VMs Causes Incorrect Permissions in vSphere HA
    
   
   
    
     
      Symptoms
     
    
    
     Output resulting from the
     
      bosh vms
     
     command alternates between showing that the VMs are
     
      failing
     
     and showing that the VMs are
     
      running
     
     . The operator must run the
     
      bosh vms
     
     command multiple times to see this cycle.
    
    
     
      Explanation
     
    
    
     The VMs’ permissions are altered during the restarting of the VM so operators have to reset permissions every time the VM reboots or is redeployed.
    
   
   
    
     VMs cannot be successfully resurrected if the resurrection state of your VM is set to
     
      off
     
     or if the vSphere HA restarts the VM before BOSH is aware that the VM is down. For more information about VM resurrection, see
    
    
     Resurrection
    
    in the BOSH documentation.
   
   
    
     Solution
    
   
   
    Run the following command on all of your control plane and worker VMs:
   
   bosh -environment BOSH-DIRECTOR-NAME -deployment DEPLOYMENT-NAME ssh INSTANCE-GROUP-NAME -c "sudo /var/vcap/jobs/kube-controller-manager/bin/pre-start; sudo /var/vcap/jobs/kube-apiserver/bin/post-start"
Where: 
 BOSH-DIRECTOR-NAME is your BOSH Director name. 
 DEPLOYMENT-NAME is the name of your BOSH deployment. 
 INSTANCE-GROUP-NAME is the name of the BOSH instance group you are referencing. 
The above command, when applied to each VM, gives your VMs the correct permissions. 
 Worker Node Hangs Indefinitely
 
Symptoms 
After making your selection in the Upgrade all clusters errand section, the worker node might hang indefinitely. For more information about monitoring the Upgrade all clusters errand using the BOSH CLI, see Upgrade the TKGI Tile in Upgrading Tanzu Kubernetes Grid Integrated Edition (Flannel Networking). 
Explanation 
During the Tanzu Kubernetes Grid Integrated Edition tile upgrade process, worker nodes are cordoned and drained. This drain is dependent on Kubernetes being able to unschedule all pods. If Kubernetes is unable to unschedule a pod, then the drain hangs indefinitely. Kubernetes might be unable to unschedule the node if the PodDisruptionBudget object has been configured to permit zero disruptions and only a single instance of the pod has been scheduled. 
In your spec file, the .spec.replicas configuration sets the total amount of replicas that are available in your app. PodDisruptionBudget objects specify the amount of replicas, proportional to the total, that must be available in your app, regardless of downtime. Operators can configure PodDisruptionBudget objects for each app using their spec file. 
Some apps deployed using Helm charts might have a default PodDisruptionBudget set. For more information on configuring PodDisruptionBudget objects using a spec file, see Specifying a PodDisruptionBudget in the Kubernetes documentation. 
If .spec.replicas is configured correctly, you can also configure the default node drain behavior to prevent cluster upgrades from hanging or failing. 
Solution 
To resolve this issue, do one of the following: 
  Configure .spec.replicas to be greater than the PodDisruptionBudget object.

 When the number of replicas configured in .spec.replicas is greater than the number of replicas set in the PodDisruptionBudget object, disruptions can occur. 

 For more information, see How Disruption Budgets Work in the Kubernetes documentation.
 For more information about workload capacity and uptime requirements in Tanzu Kubernetes Grid Integrated Edition, see Prepare to Upgrade in Upgrading Tanzu Kubernetes Grid Integrated Edition (Antrea and Flannel Networking).
  
  Configure the default node drain behavior by doing the following: 
   
Navigate to Ops Manager Installation > Tanzu Kubernetes Grid Integrated Edition > Plans. 
    Set the default node drain behavior by configuring the following fields: 

       Field 
       Instructions 
       Node Drain Timeout 
        Enter a timeout in minutes for the node to drain pods. You must enter a valid integer between 0 and 1440. If you set this value to 0, the node drain does not terminate. 
       Pod Shutdown Grace 
       Enter a timeout in seconds for the node to wait before it forces the pod to terminate. You must enter a valid integer between -1 and 86400. If you set this value to -1, the timeout is set to the default timeout specified by the pod. 
       Force node to drain even if it has running pods not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet. 
       If you activate this configuration, the node still drains when pods are not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet. 
       Force node to drain even if it has running DaemonSet-managed pods. 
       If you activate this configuration, the node still drains when pods are managed by a DeamonSet. 
       Force node to drain even if it has running running pods using emptyDir. 
       If you activate this configuration, the node still drains when pods are using an emptyDir volume. 
       Force node to drain even if pods are still running after timeout. 
       If you activate this configuration and then during the timeout pods fail to drain on the worker node, the node forces running pods to terminate and the upgrade or scale continues. 
    
 
Warning: If you select Force node to drain even if pods are still running after timeout, the node halts all running workloads on pods. Before enabling this configuration, set Node Drain Timeout to greater than 0.
 Warning: If you deselect Force node to drain even if it has running DaemonSet-managed pods with Enable Metric Sink Resources, Enable Log Sink Resources, or Enable Node Exporter selected, the upgrade will fail as all options deploy a DaemonSet in the pks-system namespace.
  
   Navigate to Ops Manager Installation Dashboard > Review Pending Changes, select Upgrade all clusters errand, and Apply Changes. The new behavior takes effect during the next upgrade, not immediately after applying your changes. 
   
Note: You can also use the TKGI CLI to configure node drain behavior. To configure the default node drain behavior with the TKGI CLI, run tkgi update-cluster with an action flag. You can view the current node drain behavior with tkgi cluster –details. For more information, see Configure Node Drain Behavior in  Upgrade Preparation Checklist for Tanzu Kubernetes Grid Integrated Edition v1.9. Warning: Do not use tkgi update-cluster on clusters configured with a network profile CNI configuration.
  
 Cannot Authenticate to an OpenID Connect-Enabled Cluster
 
Symptom 
When you authenticate to an OpenID Connect-enabled cluster using an existing kubeconfig file, you see an authentication or authorization error. 
Explanation 
users.user.auth-provider.config.id-token and users.user.auth-provider.config.refresh-token contained in the kubeconfig file for the cluster might have expired. 
Solution 
  Upgrade the TKGI CLI to v1.2.0 or later.
 
 To download the TKGI CLI, navigate to VMware Tanzu Network. For more information, see Installing the TKGI CLI.
  
  Obtain a kubeconfig file that contains the new tokens by running the following command:
 tkgi get-credentials CLUSTER-NAME
 Where CLUSTER-NAME is the name of your cluster.
 
 For example:
 
 $ tkgi get-credentials tkgi-example-cluster
Fetching credentials for cluster tkgi-example-cluster. Context set for cluster tkgi-example-cluster.
You can now switch between clusters by using: $kubectl config use-context &ltcluster-name&gt 
 
Note: If your operator has configured Tanzu Kubernetes Grid Integrated Edition to use a SAML identity provider, you must include an additional SSO flag to use the above command. For information about the SSO flags, see the section for the above command in TKGI CLI. For information about configuring SAML, see Connecting Tanzu Kubernetes Grid Integrated Edition to a SAML Identity Provider 
  
  Connect to the cluster using kubectl.
  
If you continue to see an authentication or authorization error, verify that you have sufficient access permissions for the cluster. 
Cannot Access Apps Deployed to Clusters That Utilize Websocket
 
Symptom 
Your NSX-T LB disconnects the sessions for your apps deployed to clusters utilizing websocket. These apps are inaccessible or non-functional. 
Explanation 
Tanzu Kubernetes Grid Integrated Edition on vSphere with NSX-T fully supports websocket. The most likely cause for this behavior is a connectivity issue specific to supporting websocket. 
Solution 
Review your configuration for a source for the connectivity issues: 
 Review the connectivity to the NSX-T LB instance. 
 Confirm the devices between your NSX-T LB and app are not blocking websocket. 
 Login Failed Error: Credentials were rejected
 
Symptom 
TKGI login command fails with an error “Credentials were rejected, please try again.” 
Explanation 
You might experience this issue when a large number of pods are running continuously in your Tanzu Kubernetes Grid Integrated Edition deployment. As a result, the persistent disk on the TKGI Database VM runs out of space. 
Solution 
 Check the total number of pods in your Tanzu Kubernetes Grid Integrated Edition deployments. 
 If there are a large number of pods such as over 1,000 pods, then check the amount of available persistent disk space on the TKGI Database VM. 
 If available disk space is low, increase the amount of persistent disk storage on the TKGI Database VM depending on the number of pods in your Tanzu Kubernetes Grid Integrated Edition deployment. Refer to the table in the following section. 
Storage Requirements for Large Numbers of Pods 
If you expect the cluster workload to run a large number of pods continuously, then increase the size of persistent disk storage allocated to the TKGI Database VM as follows: 
 Login Failed Errors Due to Server State
 
Symptom 
You encounter an error similar to one of the following when running a kubectl or cluster command: 
 “Error: You must be logged in to the server (Unauthorized)” 
 “Error: You are not currently authenticated. Please log in to continue” 
Explanation 
You might experience this issue when your authentication server or a host has the incorrect time. 
Workaround 
  To refresh your credentials, run the following:
 pks get-credentials
Solution 
 To resolve the problem permanently, correct the time on the server with the incorrect time. 
 Error: Failed Jobs
 
Symptom 
In stdout or log files, you see an error message referencing post-start scripts failed or Failed Jobs. 
Explanation 
After deploying Tanzu Kubernetes Grid Integrated Edition, Ops Manager runs scripts to start a number of jobs. You must configure Ops Manager to automatically run these post-deploy scripts. 
Solution 
Perform the following steps to configure Ops Manager to run post-deploy scripts. 
  Navigate to https://YOUR-OPS-MANAGER-FQDN/ in a browser to log in to the Ops Manager Installation Dashboard.
  
  Click the BOSH Director tile.
  
  Select Director Config.
  
  Select Enable Post Deploy Scripts. 
Note: This setting activates post-deploy scripts for all tiles in your Ops Manager installation.
  
  Click Save.
  
  Click the Installation Dashboard link to return to the Installation Dashboard.
  
  Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.
  
  Click Apply Changes.
  
  (Optional) If it is a new deployment of Tanzu Kubernetes Grid Integrated Edition, follow the steps below: 
   
On the command line, enter tkgi delete-cluster to delete the cluster. For more information, see Deleting Clusters. 
   Enter tkgi create-cluster to recreate the cluster. For more information, see Creating Clusters. 
 Error: No Such Host
 
Symptom 
In stdout or log files, you see an error message that includes lookup vm-WORKER-NODE-GUID on IP-ADDRESS: no such host. 
Explanation 
This error occurs on GCP when the Ops Manager Director tile uses 8.8.8.8 as the DNS server. When this IP range is in use, the control plane node cannot locate the route to the worker nodes. 
Solution 
Use the Google internal DNS range, 169.254.169.254, as the DNS server. 
 Error: FailedMount
 
Symptom 
In Kubernetes log files, you see a Warning event from kubelet with FailedMount as the reason. 
Explanation 
A persistent volume fails to connect to the Kubernetes cluster worker VM. 
Diagnostics 
 In your cloud provider console, verify that volumes are being created and attached to nodes. 
 From the Kubernetes cluster control plane node, check the controller manager logs for errors attaching persistent volumes. 
 From the Kubernetes cluster worker node, check kubelet for errors attaching persistent volumes. 
 Error: Plan Not Found
 
Symptom 
Plan not found error when an active plan is deactivated. 
Explanation 
You might receive the error “plan UUID not found” if, after creating a cluster using a plan (such as Plan 1), you then deactivate the plan (Plan 1) from the TKGI Tile in Ops Manager and then Save and Apply Changes with the Upgrade all clusters errand selected. 
Ops Manager does not have capability to check clusters that are using a particular plan. Only when user saves the plan, the deployment process will check whether a plan can be deactivated. The error message "plan is displayed in the Ops Manager logs. 
Solution 
 Do not deactivate a plan that is in use by or more clusters. 
 Run the command tkgi cluster my-cluster --details to view what plan the cluster is using.