在对调用pytorch_pretrained_bert时,如果用多个GPU出现StopIteration: Caught StopIteration in replica 0 on device 0.具体如下。

File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/run_classifier.py", line 569, in
main()
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/run_classifier.py", line 504, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/run_classifier.py", line 113, in train
outputs = model(**inputs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/transformers/modeling_bert.py", line 897, in forward
head_mask=head_mask)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/transformers/modeling_bert.py", line 606, in forward
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration

我的pytorch版本是1.5,我用单个GPU把这个打印出来next(self.parameters()).dtype, 都是torch.float32,应该就是版本问题。直接替换掉就可以了。

可以解决问题,但总归不是长久之计: Pytorch多GPU并行Bug收集(长期):KeyError: Caught KeyError in replica *[device_id] on device *[device_id]._wanghan0801的博客-CSDN博客在单卡训练顺利的前提下,修改为多卡训练,可谓bug多多今天用pytorch 多GPU并行训练时,在最后一个step的时候报错了,KeyError: Caught KeyError in replica 5 on device 5.如图所示 File “/home/yy/anaconda3/envs/py37/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py”, line 727, in forward extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility 多gpu, torch1.6版本问题: next(self.parameters()).dtype报错:StopIteration: Caught StopIteration in replica 0 on device 0. 原文来自连接 不是第一次遇到了,但是遇到了真不会改,我只用下面两步就好了 1.torch报错:StopIteration: Caught StopIteration in replica 0 on device 0. 原因:多GPU运行此项目报错,可能是torch版本错误。 修改:按照别的博客将 weight = next(self.parameters()).data改为weight = torch.float32 2.仍报错:AttributeError: ‘torch.dtype’ no attrib StopIteration: Caught StopIteration in replica 0 on device 0. param = next(self.parameters()) StopIteration 问题排查与解决 跑torch代码遇到这个问题。 博客上有类似的问题: 运行开源库CCPD-RPnet代码,提示「KeyError: Caught KeyError in replica 0 on device 0」错误 不过具体的错误提示和我的还是不一样。 通过官方论坛里的一个问题 主要问题是出现在DataParrallel上。 当在多个gpu上跑代码的时候就可能遇到这个问题。 解决方法就是用一块GPU跑。 文章目录问题描述问题排查Solution 通过python3 demo.py -i ./demo -m ./models/fh02.pth运行CCPD代码,提示「KeyError: Caught KeyError in replica 0 on device 0」和「KeyError: <class ‘torch.Tensor’>」错误。 错误日志为: /pytor... 这里写自定义目录标题欢迎使用Markdown编辑器新的改变功能快捷键合理的创建标题,有助于目录的生成如何改变文本的样式插入链接与图片如何插入一段漂亮的代码片生成一个适合你的列表创建一个表格设定内容居中、居左、居右SmartyPants创建一个自定义列表如何创建一个注脚注释也是必不可少的KaTeX数学公式新的甘特图功能,丰富你的文章UML 图表FLowchart流程图导出与导入导出导入 欢迎使用Markdown编辑器 你好! 这是你第一次使用 Markdown编辑器 所展示的欢迎页。如果你想学习如何使用Mar File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 136, in forward self.w 问题:【PyTorch】RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemm() 解析:遇到这个问题的时候,我使用这条命令:pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html 解决了这个问题,但是又遇到新的问题:StopIterati # Prepare model model = BertForMultipleChoice.from_pretrained(args.bert_model, cache_dir=PYTORCH_PRETRAINED_BERT_CACHE / 'distributed_{}'.format(args.local_rank), num_choices=4) model.to(device) import torch.nn as nn # Prepare mod.