Description of the problem
The error will occur if the num_workers > 0 , But when I set num_workers = 0 , the error disappeared, though, this will slow down the trainning speed. I think the multiprocessing really matters here .How can I solve this problem?
docker python3.8 Pytorch 1.11.0+cu113
error output
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 149, in _serve
send(conn, destination_pid)
File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 50, in send
reduction.send_handle(conn, new_fd, pid)
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 184, in send_handle
sendfds(s, [handle])
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 149, in sendfds
File "save_disp.py", line 85, in <module>
sock.sendmsg([msg], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, fds)])
OSError: [Errno 9] Bad file descriptor
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 151, in _serve
test()
File "save_disp.py", line 55, in test
close()
for batch_idx, sample in enumerate(TestImgLoader):
File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 52, in close
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
os.close(new_fd)
OSError: [Errno 9] Bad file descriptor
data = self._next_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
idx, data = self._get_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
success, data = self._try_get_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
fd = df.detach()
File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 159, in recvfds
raise EOFError
EOFError
Dataloader
TrainImgLoader = DataLoader(train_dataset, args.batch_size, shuffle=True, num_workers=0, drop_last=True)
TestImgLoader = DataLoader(test_dataset, args.test_batch_size, shuffle=False, num_workers=0, drop_last=False)
def __getitem__(self, index):
left_img = self.load_image(os.path.join(self.datapath, self.left_filenames[index]))
right_img = self.load_image(os.path.join(self.datapath, self.right_filenames[index]))
disparity = self.load_disp(os.path.join(self.datapath, self.disp_filenames[index]))
roi = self.load_mask(os.path.join(self.datapath, self.mask_filenames[index]))
if self.training:
w, h = left_img.size
crop_w, crop_h = 512, 256
x1 = random.randint(0, w - crop_w)
y1 = random.randint(0, h - crop_h)
# random crop
left_img = left_img.crop((x1, y1, x1 + crop_w, y1 + crop_h))
right_img = right_img.crop((x1, y1, x1 + crop_w, y1 + crop_h))
disparity = disparity[y1:y1 + crop_h, x1:x1 + crop_w]
roi = roi[y1:y1 + crop_h,x1:x1 + crop_w]
# to tensor, normalize
processed = get_transform()
left_img = processed(left_img)
right_img = processed(right_img)
return {"left": left_img,
"right": right_img,
"disparity": disparity,
"left_filename": self.left_filenames[index],
"right_filename": self.right_filenames[index],
"roi":roi}
else:
w, h = left_img.size
# crop_w, crop_h = 1024, 1024
# left_img = left_img.crop((w - crop_w, h - crop_h, w, h))
# right_img = right_img.crop((w - crop_w, h - crop_h, w, h))
# disparity = disparity[h - crop_h:h, w - crop_w: w]
# roi = roi[h - crop_h:h, w - crop_w: w]
processed = get_transform()
left_img = processed(left_img)
right_img = processed(right_img)
return {"left": left_img,
"right": right_img,
"disparity": disparity,
"top_pad": 0,
"right_pad": 0,
"left_filename": self.left_filenames[index],
"right_filename": self.right_filenames[index],
"roi":roi}
except Exception as e:
print(e.args)
print(str(e))
print(repr(e))
print('here is get_item error')
File format
TIFF image and txt file
From what I am able to see, it’s an EOF error and even before that it’s a bad file descriptor error which in this case looks like stemming from an attempt to close a file that wasn’t open in the first place. See this-
File “/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 530, in next
os.close(new_fd)
OSError: [Errno 9] Bad file descriptor
this might help. Multiprocessing on windows is error prone due to many reasons like pickling etc. Potentially, check this.
Not sure though, if these links would solve your error.
I’m using torch 1.12.1 with pytorch-lightning 1.7.6 and getting the exact same issue, sporadically. I have 96 vCPUs available, and when I allow the dataloader to use them, training takes ~1 hour, but I will get Bad File descriptors (no matter what sharing strategy I set my workers to use (file-system, or file-descriptor)).
Seems to occur with any num_workers > 0, and when num_workers == 0, training times explodes (4+ hours from the 1 with multiprocessing).
I have sloved this problem.Adding this configuration to the dataset script works:
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')