Note:
If you plan to upgrade the Cloud Pak for Data cluster, you must bring the WML operator out of maintenance mode by setting the field ignoreForMaintenance
to false
in wml-cr
.
Configuring runtime definition for a specific GPU node fails
Applies to: 5.1.0 and later
When you configure the runtime definition to use a specific GPU node with the nodeaffinity
property, the runtime definition fails.
As a workaround, you must enable the MIG configuration for all GPU nodes if MIG is enabled for even a single GPU node. You must also use the Single
profile type for all the GPU nodes. Mixed
profiling is not supported.
To learn more about single and mixed profiling strategies, see NVIDIA documentation.
Online Backup with Netapp causes Watson Machine Learning to enter InMaintenance
Mode
Applies to: 5.1.1
Problem description: After performing a Netapp backup, Watson Machine Learning enter the InMaintenance
mode. You might receive the following message:
wml WmlBase wml-cr 2025-02-08T02:15:55Z zen 5.1.1 5.1.1 5.1.1-1625 100% Completed wml install/upgrade/restart The last reconciliation was completed successfully. InMaintenance
Root cause: The issue is caused by the pre-hooks and post-hooks configuration in the backup-meta, which puts the Watson Machine Learning CR into maintenance mode during the backup process. However, the Watson Machine Learning CR eventually
reconciles and reaches the Completed
state, but this may take longer than the default timeout value of 1800s.
Workaround: No changes are required to the configmap
. If you encounter this issue, please wait a little longer (more than 1800s) for the Watson Machine Learning CR to reconcile and reach the Completed state. The Watson
Machine Learning CR will automatically transition to the Completed
state once the reconciliation is complete.
Hyperparameter tuning fails when using 2 parallel jobs
Applies to: 5.1.1
When running a hyperparameter tuning workload using 2 parallel jobs, the workload may fail.
Try running your hyperparameter tuning workload using a single job.
Limitations for AutoAI experiments
AutoAI file gets pushed to the Git repository in default Git projects
After you create an AutoAI experiment in a default Git project, you create a commit and see a file that includes your experiment name in the list of files that can be committed. There are no consequences to including this file in your commit.
The AutoAI experiment will not appear in the asset list for any other user who pulls the file into their local clone using Git. Additionally, other users won’t be prevented from creating an AutoAI experiment with the same name.
Maximum number of feature columns in AutoAI experiments
The maximum number of feature columns for a classification or regression experiment is 5000.
No support for Cloud Pak for Data authentication with storage volume connection
You cannot use a storage volume connection with the 'Cloud Pak for Data authentication' option enabled as a data source in an AutoAI experiment. AutoAI does not currently support the user authentication token. Instead, disable the 'Cloud Pak
for Data authentication' option in the storage volume connection to use the connection as a data source in your AutoAI experiment.
Limitations for Watson Machine Learning
Deep Learning experiments with storage volumes in a Git enterprise project are not supported
If you create a Git project with assets in storage volumes, then create a Deep Learning experiment, running the experiment fails. This use case is not currently supported.
Deep Learning jobs are not supported on IBM Power (ppc64le) or Z (s390x) platforms
If you submit a Deep Learning training job on IBM Power (ppc64le) or Z (s390x) platform, the job fails with an InvalidImageName
error. This is the expected behavior as Deep Learning jobs are not supported on IBM Power (ppc64le)
or Z (s390x) platforms.
Deploying a model on an s390x cluster might require retraining
Training an AI model on a different platform such as x86/ppc and deploying the AI model on s390x using Watson Machine Learning might fail because of an endianness issue. In such cases, retrain and deploy the existing AI model on the s390x
platform to resolve the problem.
Limits on size of model deployments
Limits on the size of models you deploy with Watson Machine Learning depend on factors such as the model framework and type. In some instances, when you exceed a threshold, you will be notified with an error when you try to store a model in
the Watson Machine Learning repository, for example: OverflowError: string longer than 2147483647 bytes
. In other cases, the failure might be indicated by a more general error message, such as The service is experiencing some downstream errors, please re-try the request
or There's no available attachment for the targeted asset
. Any of these results indicate that you have exceeded the allowable size limits for that type of deployment.
Automatic mounting of storage volumes is not supported by online and batch deployments
You cannot use automatic mounts for storage volumes with Watson Machine Learning online and batch deployments. Watson Machine Learning does not support this feature for Python-based runtimes, including R-script, SPSS Modeler, Spark, and Decision
Optimization. You can use only automatic mounts for storage volumes with Watson Machine Learning shiny app deployments and notebook runtimes.
As a workaround, you can use the download
method from the Data assets library, which is a part of
the ibm-watson-machine-learning
python client.
Batch deployment jobs that use large inline payload might get stuck in starting
or running
state
If you provide a large asynchronous payload for your inline batch deployment, it can result in the runtime manager process to go out of heap memory.
In the following example, 92 MB of payload was passed inline to the batch deployment which resulted in the heap to go out of memory.
Uncaught error from thread [scoring-runtime-manager-akka.scoring-jobs-dispatcher-35] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[scoring-runtime-manager]
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:172)
at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:538)
at java.base/java.lang.StringBuilder.append(StringBuilder.java:174)
This could result in concurrent jobs getting stuck in starting
or running
state. The starting
state can only be cleared once the deployment is deleted and a new deployement is created. The running
state can be cleared without deleting the deployment.
As a workaround, use data references instead of inline for huge payloads that are provided to batch deployments.
Setting environment variables in a conda yaml file does not work for deployments
Setting environment variables in a conda yaml file does not work for deployments. This means that you cannot override existing environment variables, for example LD_LIBRARY_PATH
, when deploying assets in Watson Machine Learning.
As a workaround, if you're using a Python function, consider setting default parameters. For details, see Deploying Python functions.
Deploying assets on IBM Z and LinuxONE fails
Deploying assets on IBM Z and LinuxONE fails because Watson Machine Learning for Cloud Pak for Data version 5.1.1 does not support deployments on the s390x architecture.
Hyperparameter tuning runs with 2 parallel jobs
A hyperparameter (HPO) tuning job will run with a maximum of 2 parallel jobs, even if you set the max_parallel_job_num
parameter of hyper_parameters_optimization
in training_reference
to a value larger
than 2.
Parent topic: Service issues