Hello community,

I am new to Azure. I have some scripts in a working environment in google colab, and as I am working on my Thesis I tried to user Azure Student promo.
I have setup a Standard_NC6 with Pytorch and Tensorflow kernel and I am getting the following error:

RuntimeError Traceback (most recent call last)
Input In [9], in <cell line: 64>()
76 loss_critic = -(torch.mean(critic_real) - torch.mean(critic_fake))
77 critic.zero_grad()
---> 78 loss_critic.backward(retain_graph=True)
79 opt_critic.step()
81 # clip critic weights between -0.01, 0.01

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/torch/_tensor.py:396, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
387 if has_torch_function_unary(self):
388 return handle_torch_function(
389 Tensor.backward,
390 (self,),
(...)
394 create_graph=create_graph,
395 inputs=inputs)
--> 396 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/torch/autograd/ init .py:173, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
168 retain_graph = create_graph
170 # The reason we repeat same the comment below is that
171 # some Python versions print out the first line of a multi-line function
172 # calls in the traceback and some print out the last line
--> 173 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
174 tensors, grad_tensors_, retain_graph, create_graph, inputs,
175 allow_unreachable=True, accumulate_grad=True)

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

I tried different versions of pytorch + cu113 and cu116.

The nvdia-smi output is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000001:00:00.0 Off | 0 |
| N/A 41C P0 70W / 149W | 880MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 11425 C ...eml_py38_PT_TF/bin/python 877MiB |
+-----------------------------------------------------------------------------+

I guess that the problem is about drivers and versions.. as in google colab environment it's working

Thanks,

@dp-5741Thanks for the question. using one of the containers / environments in AML which is correctly configured for GPU should be sufficient if the ML framework being used supports GPU acceleration. The GPU images contain Miniconda, OpenMPI, CUDA, cuDNN, and NCCL. You can use these images for your environments, or use their corresponding Dockerfiles as reference when building your own custom images.

For the set of base images and their corresponding Dockerfiles, see the AzureML-Containers repo.

https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=cli#create-an-environment