Symptom
When you run TKGI CLI commands, the TKGI API times out or is slow to respond.
Explanation
The TKGI API VM requires more resources.
Solution
Navigate to
https://YOUR-OPS-MANAGER-FQDN/
in a browser to log in to the Ops Manager Installation Dashboard.
Select the
Tanzu Kubernetes Grid Integrated Edition
tile.
Select the
Resource Config
page.
For the
TKGI API
job, select a
VM Type
with greater CPU and memory resources.
Click
Save
.
Click the
Installation Dashboard
link to return to the Installation Dashboard.
Symptom
All TKGI CLI cluster operations fail including attempts to create or delete clusters with
tkgi create-cluster
and
tkgi delete-cluster
.
The output of
tkgi cluster CLUSTER-NAME
contains
Last Action State: error
, and the output of
bosh -e ENV-ALIAS -d SERVICE-INSTANCE vms
indicates that the
Process State
of at least one deployed node is
failing
.
Explanation
If any deployed control plane or worker nodes run out of disk space in
/var/vcap/store
, all cluster operations such as the creation or deletion of clusters will fail.
Diagnostics
To confirm that there is a disk space issue, check recent BOSH activity for any disk space error messages.
BOSH logs are used for error diagnostics but if the issue you see in the BOSH logs is related to using or managing Kubernetes, you should consult the Kubernetes Documentation for troubleshooting that issue.
Symptom
When deleting a cluster in a large-scale NSX-T environment, TKGI delete-cluster becomes stuck.
Explanation
A TKGI-API process has timed out and cluster deletion is stuck.
Solution
To avoid the TKGI-API process time out, increase the TKGI Operation Timeout:
SSH to the TKGI Control Plane VM.
Change directory to /var/vcap/jobs/pks-nsx-t-osb-proxy.
Run the following command:
time ./bin/ncp_cleanup test-read ROUTER-ID
Where ROUTER-ID is your NSX-T Tier-0 Router ID.
For example:
pivotal-container-service/88d4bf76-d3967d53b4c4:/var/vcap/jobs/pks-nsx-t-osb-proxy# time ./bin/ncp_cleanup test-read 8dc31113-64e8-40bb-83fb-1af75857d5ae real 1m28.057s user 0m13.121s sys 0m0.629s
Collect the returned real value.
Add 30 seconds to the real value and convert the sum from minutes-seconds to seconds, rounding up. For example, sum, convert, and round 1m28.057s to 120.
Convert the summed value to milliseconds. This is your calculated Operation Timeout value.
Symptom
After cluster creation fails, you cannot re-run tkgi create-cluster to attempt creating the cluster again.
Explanation
Tanzu Kubernetes Grid Integrated Edition does not automatically clean up the failed BOSH deployment. Running tkgi create-cluster using the same cluster name creates a name clash error in BOSH.
Solution
Log in to the BOSH Director and delete the BOSH deployment manually, then retry the tkgi delete-cluster operation. After cluster deletion succeeds, re-create the cluster.
Run the following TKGI command:
tkgi delete-cluster CLUSTER-NAME
Where
CLUSTER-NAME
is the name of your Tanzu Kubernetes Grid Integrated Edition cluster.
Note
: Use only lowercase characters in your TKGI-provisioned Kubernetes cluster names if you manage your clusters with Tanzu Mission Control (TMC). Clusters with names that include an uppercase character cannot be attached to TMC.
To re-create the cluster, run the following TKGI command:
tkgi create-cluster CLUSTER-NAME
Where
CLUSTER-NAME
is the name of your Tanzu Kubernetes Grid Integrated Edition cluster.
Note
: Use only lowercase characters when naming your cluster if you manage your clusters with Tanzu Mission Control (TMC). Clusters with names that include an uppercase character cannot be attached to TMC.
Symptom
The
stembuild construct
command fails with error:
Cannot complete login due to an incorrect user name or password.
Explanation
Your vCenter login contains special characters, or you have
GOVC
environment variables set locally.
Solution
For special characters, see
Authentication Error with Special Characters in stembuild Commands
, in the TAS for VMs [Windows] documentation.
For
GOVC
variables, follow the steps to unset the variables in
Step 4: Construct the BOSH Stemcell
, in the TAS for VMs [Windows] documentation.
Symptom
You cannot access a feature or function provided by a Kubernetes add-on.
For example, pods cannot resolve DNS names, and error messages report the service
CoreDNS
is invalid. If
CoreDNS
is not deployed, the cluster typically fails to start.
Explanation
Kubernetes features and functions are provided by Tanzu Kubernetes Grid Integrated Edition add-ons. DNS resolution, for example, is provided by the
CoreDNS
service.
To activate these add-ons, Ops Manager must run scripts after deploying Tanzu Kubernetes Grid Integrated Edition. You must configure Ops Manager to automatically run these post-deploy scripts.
Solution
Perform the following steps to configure Ops Manager to run post-deploy scripts to deploy the missing add-ons to your cluster.
Navigate to
https://YOUR-OPS-MANAGER-FQDN/
in a browser to log in to the Ops Manager Installation Dashboard.
Click the
BOSH Director
tile.
Select
Director Config
.
Select
Enable Post Deploy Scripts
.
Note
: This setting activates post-deploy scripts for all tiles in your Ops Manager installation.
Click
Save
.
Click the
Installation Dashboard
link to return to the Installation Dashboard.
Click
Review Pending Changes
. Review the changes that you made. For more information, see
Reviewing Pending Product Changes
.
Click
Apply Changes
.
After Ops Manager finishes applying changes, enter
tkgi delete-cluster
on the command line to delete the cluster. For more information, see
Deleting Clusters
.
On the command line, enter
tkgi create-cluster
to recreate the cluster. For more information, see
Creating Clusters
.
Symptoms
Output resulting from the
bosh vms
command alternates between showing that the VMs are
failing
and showing that the VMs are
running
. The operator must run the
bosh vms
command multiple times to see this cycle.
Explanation
The VMs’ permissions are altered during the restarting of the VM so operators have to reset permissions every time the VM reboots or is redeployed.
VMs cannot be successfully resurrected if the resurrection state of your VM is set to
off
or if the vSphere HA restarts the VM before BOSH is aware that the VM is down. For more information about VM resurrection, see
Resurrection
in the BOSH documentation.
Solution
Run the following command on all of your control plane and worker VMs:
bosh -environment BOSH-DIRECTOR-NAME -deployment DEPLOYMENT-NAME ssh INSTANCE-GROUP-NAME -c "sudo /var/vcap/jobs/kube-controller-manager/bin/pre-start; sudo /var/vcap/jobs/kube-apiserver/bin/post-start"
Where:
BOSH-DIRECTOR-NAME is your BOSH Director name.
DEPLOYMENT-NAME is the name of your BOSH deployment.
INSTANCE-GROUP-NAME is the name of the BOSH instance group you are referencing.
The above command, when applied to each VM, gives your VMs the correct permissions.
Symptoms
After making your selection in the Upgrade all clusters errand section, the worker node might hang indefinitely. For more information about monitoring the Upgrade all clusters errand using the BOSH CLI, see Upgrade the TKGI Tile in Upgrading Tanzu Kubernetes Grid Integrated Edition (Flannel Networking).
Explanation
During the Tanzu Kubernetes Grid Integrated Edition tile upgrade process, worker nodes are cordoned and drained. This drain is dependent on Kubernetes being able to unschedule all pods. If Kubernetes is unable to unschedule a pod, then the drain hangs indefinitely. Kubernetes might be unable to unschedule the node if the PodDisruptionBudget object has been configured to permit zero disruptions and only a single instance of the pod has been scheduled.
In your spec file, the .spec.replicas configuration sets the total amount of replicas that are available in your app. PodDisruptionBudget objects specify the amount of replicas, proportional to the total, that must be available in your app, regardless of downtime. Operators can configure PodDisruptionBudget objects for each app using their spec file.
Some apps deployed using Helm charts might have a default PodDisruptionBudget set. For more information on configuring PodDisruptionBudget objects using a spec file, see Specifying a PodDisruptionBudget in the Kubernetes documentation.
If .spec.replicas is configured correctly, you can also configure the default node drain behavior to prevent cluster upgrades from hanging or failing.
Solution
To resolve this issue, do one of the following:
Configure .spec.replicas to be greater than the PodDisruptionBudget object.
When the number of replicas configured in .spec.replicas is greater than the number of replicas set in the PodDisruptionBudget object, disruptions can occur.
For more information, see How Disruption Budgets Work in the Kubernetes documentation.
For more information about workload capacity and uptime requirements in Tanzu Kubernetes Grid Integrated Edition, see Prepare to Upgrade in Upgrading Tanzu Kubernetes Grid Integrated Edition (Antrea and Flannel Networking).
Configure the default node drain behavior by doing the following:
Navigate to Ops Manager Installation > Tanzu Kubernetes Grid Integrated Edition > Plans.
Set the default node drain behavior by configuring the following fields:
Field
Instructions
Node Drain Timeout
Enter a timeout in minutes for the node to drain pods. You must enter a valid integer between 0 and 1440. If you set this value to 0, the node drain does not terminate.
Pod Shutdown Grace
Enter a timeout in seconds for the node to wait before it forces the pod to terminate. You must enter a valid integer between -1 and 86400. If you set this value to -1, the timeout is set to the default timeout specified by the pod.
Force node to drain even if it has running pods not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet.
If you activate this configuration, the node still drains when pods are not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet.
Force node to drain even if it has running DaemonSet-managed pods.
If you activate this configuration, the node still drains when pods are managed by a DeamonSet.
Force node to drain even if it has running running pods using emptyDir.
If you activate this configuration, the node still drains when pods are using an emptyDir volume.
Force node to drain even if pods are still running after timeout.
If you activate this configuration and then during the timeout pods fail to drain on the worker node, the node forces running pods to terminate and the upgrade or scale continues.
Warning: If you select Force node to drain even if pods are still running after timeout, the node halts all running workloads on pods. Before enabling this configuration, set Node Drain Timeout to greater than 0.
Warning: If you deselect Force node to drain even if it has running DaemonSet-managed pods with Enable Metric Sink Resources, Enable Log Sink Resources, or Enable Node Exporter selected, the upgrade will fail as all options deploy a DaemonSet in the pks-system namespace.
Navigate to Ops Manager Installation Dashboard > Review Pending Changes, select Upgrade all clusters errand, and Apply Changes. The new behavior takes effect during the next upgrade, not immediately after applying your changes.
Note: You can also use the TKGI CLI to configure node drain behavior. To configure the default node drain behavior with the TKGI CLI, run tkgi update-cluster with an action flag. You can view the current node drain behavior with tkgi cluster –details. For more information, see Configure Node Drain Behavior in Upgrade Preparation Checklist for Tanzu Kubernetes Grid Integrated Edition v1.9. Warning: Do not use tkgi update-cluster on clusters configured with a network profile CNI configuration.
Symptom
When you authenticate to an OpenID Connect-enabled cluster using an existing kubeconfig file, you see an authentication or authorization error.
Explanation
users.user.auth-provider.config.id-token and users.user.auth-provider.config.refresh-token contained in the kubeconfig file for the cluster might have expired.
Solution
Upgrade the TKGI CLI to v1.2.0 or later.
To download the TKGI CLI, navigate to VMware Tanzu Network. For more information, see Installing the TKGI CLI.
Obtain a kubeconfig file that contains the new tokens by running the following command:
tkgi get-credentials CLUSTER-NAME
Where CLUSTER-NAME is the name of your cluster.
For example:
$ tkgi get-credentials tkgi-example-cluster
Fetching credentials for cluster tkgi-example-cluster. Context set for cluster tkgi-example-cluster.
You can now switch between clusters by using: $kubectl config use-context <cluster-name>
Note: If your operator has configured Tanzu Kubernetes Grid Integrated Edition to use a SAML identity provider, you must include an additional SSO flag to use the above command. For information about the SSO flags, see the section for the above command in TKGI CLI. For information about configuring SAML, see Connecting Tanzu Kubernetes Grid Integrated Edition to a SAML Identity Provider
Connect to the cluster using kubectl.
If you continue to see an authentication or authorization error, verify that you have sufficient access permissions for the cluster.
Symptom
Your NSX-T LB disconnects the sessions for your apps deployed to clusters utilizing websocket. These apps are inaccessible or non-functional.
Explanation
Tanzu Kubernetes Grid Integrated Edition on vSphere with NSX-T fully supports websocket. The most likely cause for this behavior is a connectivity issue specific to supporting websocket.
Solution
Review your configuration for a source for the connectivity issues:
Review the connectivity to the NSX-T LB instance.
Confirm the devices between your NSX-T LB and app are not blocking websocket.
Symptom
TKGI login command fails with an error “Credentials were rejected, please try again.”
Explanation
You might experience this issue when a large number of pods are running continuously in your Tanzu Kubernetes Grid Integrated Edition deployment. As a result, the persistent disk on the TKGI Database VM runs out of space.
Solution
Check the total number of pods in your Tanzu Kubernetes Grid Integrated Edition deployments.
If there are a large number of pods such as over 1,000 pods, then check the amount of available persistent disk space on the TKGI Database VM.
If available disk space is low, increase the amount of persistent disk storage on the TKGI Database VM depending on the number of pods in your Tanzu Kubernetes Grid Integrated Edition deployment. Refer to the table in the following section.
Storage Requirements for Large Numbers of Pods
If you expect the cluster workload to run a large number of pods continuously, then increase the size of persistent disk storage allocated to the TKGI Database VM as follows:
Symptom
You encounter an error similar to one of the following when running a kubectl or cluster command:
“Error: You must be logged in to the server (Unauthorized)”
“Error: You are not currently authenticated. Please log in to continue”
Explanation
You might experience this issue when your authentication server or a host has the incorrect time.
Workaround
To refresh your credentials, run the following:
pks get-credentials
Solution
To resolve the problem permanently, correct the time on the server with the incorrect time.
Symptom
In stdout or log files, you see an error message referencing post-start scripts failed or Failed Jobs.
Explanation
After deploying Tanzu Kubernetes Grid Integrated Edition, Ops Manager runs scripts to start a number of jobs. You must configure Ops Manager to automatically run these post-deploy scripts.
Solution
Perform the following steps to configure Ops Manager to run post-deploy scripts.
Navigate to https://YOUR-OPS-MANAGER-FQDN/ in a browser to log in to the Ops Manager Installation Dashboard.
Click the BOSH Director tile.
Select Director Config.
Select Enable Post Deploy Scripts.
Note: This setting activates post-deploy scripts for all tiles in your Ops Manager installation.
Click Save.
Click the Installation Dashboard link to return to the Installation Dashboard.
Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.
Click Apply Changes.
(Optional) If it is a new deployment of Tanzu Kubernetes Grid Integrated Edition, follow the steps below:
On the command line, enter tkgi delete-cluster to delete the cluster. For more information, see Deleting Clusters.
Enter tkgi create-cluster to recreate the cluster. For more information, see Creating Clusters.
Symptom
In stdout or log files, you see an error message that includes lookup vm-WORKER-NODE-GUID on IP-ADDRESS: no such host.
Explanation
This error occurs on GCP when the Ops Manager Director tile uses 8.8.8.8 as the DNS server. When this IP range is in use, the control plane node cannot locate the route to the worker nodes.
Solution
Use the Google internal DNS range, 169.254.169.254, as the DNS server.
Symptom
In Kubernetes log files, you see a Warning event from kubelet with FailedMount as the reason.
Explanation
A persistent volume fails to connect to the Kubernetes cluster worker VM.
Diagnostics
In your cloud provider console, verify that volumes are being created and attached to nodes.
From the Kubernetes cluster control plane node, check the controller manager logs for errors attaching persistent volumes.
From the Kubernetes cluster worker node, check kubelet for errors attaching persistent volumes.
Symptom
Plan not found error when an active plan is deactivated.
Explanation
You might receive the error “plan UUID not found” if, after creating a cluster using a plan (such as Plan 1), you then deactivate the plan (Plan 1) from the TKGI Tile in Ops Manager and then Save and Apply Changes with the Upgrade all clusters errand selected.
Ops Manager does not have capability to check clusters that are using a particular plan. Only when user saves the plan, the deployment process will check whether a plan can be deactivated. The error message "plan is displayed in the Ops Manager logs.
Solution
Do not deactivate a plan that is in use by or more clusters.
Run the command tkgi cluster my-cluster --details to view what plan the cluster is using.