Pytorch error - RuntimeError: Unable to find a valid cuDNN algorithm to run convolution on Standard_NC6 and Python 3.8

相关文章推荐

大气的香蕉 · 韩星来中国发展都怎么样了？有人成中国女婿，有 ...· 7 月前 ·

暗恋学妹的煎鸡蛋 · 禄丰市2023年前三季度固定资产投资运行分析 ...· 8 月前 ·

热心肠的凉茶 · Copilot 的常見問題 | ...· 8 月前 ·

快乐的哑铃 · 芋喵喵请一键三连哦_哔哩哔哩_bilibili· 12 月前 ·

小眼睛的毛豆 · 解剖&影像 | 超详脑供血系统分类 - 知乎· 1 年前 ·

Hello community,

I am new to Azure. I have some scripts in a working environment in google colab, and as I am working on my Thesis I tried to user Azure Student promo.
I have setup a Standard_NC6 with Pytorch and Tensorflow kernel and I am getting the following error:

RuntimeError Traceback (most recent call last)
Input In [9], in <cell line: 64>()
76 loss_critic = -(torch.mean(critic_real) - torch.mean(critic_fake))
77 critic.zero_grad()
---> 78 loss_critic.backward(retain_graph=True)
79 opt_critic.step()
81 # clip critic weights between -0.01, 0.01

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/torch/_tensor.py:396, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
387 if has_torch_function_unary(self):
388 return handle_torch_function(
389 Tensor.backward,
390 (self,),
(...)
394 create_graph=create_graph,
395 inputs=inputs)
--> 396 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/torch/autograd/ init .py:173, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
168 retain_graph = create_graph
170 # The reason we repeat same the comment below is that
171 # some Python versions print out the first line of a multi-line function
172 # calls in the traceback and some print out the last line
--> 173 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
174 tensors, grad_tensors_, retain_graph, create_graph, inputs,
175 allow_unreachable=True, accumulate_grad=True)

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

I tried different versions of pytorch + cu113 and cu116.

The nvdia-smi output is:

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 11425 C ...eml_py38_PT_TF/bin/python 877MiB |
+-----------------------------------------------------------------------------+

I guess that the problem is about drivers and versions.. as in google colab environment it's working

Thanks,

@dp-5741Thanks for the question. using one of the containers / environments in AML which is correctly configured for GPU should be sufficient if the ML framework being used supports GPU acceleration. The GPU images contain Miniconda, OpenMPI, CUDA, cuDNN, and NCCL. You can use these images for your environments, or use their corresponding Dockerfiles as reference when building your own custom images.

For the set of base images and their corresponding Dockerfiles, see the AzureML-Containers repo.

https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=cli#create-an-environment