Pytorch dataloaders : OSError: [Errno 9] Bad file descriptor - data

相关文章推荐

爱笑的汉堡包 · 域套接字sendto errno ...· 1 月前 ·

不羁的勺子 · Amazon.com· 1 年前 ·

乐观的苦瓜 · 中国基础研究十年回眸－－展望篇 ...· 1 年前 ·

谈吐大方的木瓜 · 加缪《局外人》“番外篇” ...· 1 年前 ·

坐怀不乱的猕猴桃 · 一部电梯一年只需100元 ...· 2 年前 ·

谦和的小熊猫 · 湖南省数字商务协会在长沙成立 ...· 2 年前 ·

Description of the problem

The error will occur if the num_workers > 0 , But when I set num_workers = 0 , the error disappeared, though, this will slow down the trainning speed. I think the multiprocessing really matters here .How can I solve this problem?

docker python3.8 Pytorch 1.11.0+cu113

error output

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 149, in _serve
    send(conn, destination_pid)
  File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 50, in send
    reduction.send_handle(conn, new_fd, pid)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 184, in send_handle
    sendfds(s, [handle])
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 149, in sendfds
  File "save_disp.py", line 85, in <module>
    sock.sendmsg([msg], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, fds)])
OSError: [Errno 9] Bad file descriptor
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 151, in _serve
    test()
  File "save_disp.py", line 55, in test
    close()
    for batch_idx, sample in enumerate(TestImgLoader):
  File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 52, in close
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    os.close(new_fd)
OSError: [Errno 9] Bad file descriptor
    data = self._next_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
    idx, data = self._get_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
    success, data = self._try_get_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
    fd = df.detach()
  File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 159, in recvfds
    raise EOFError
EOFError
              Dataloader
TrainImgLoader = DataLoader(train_dataset, args.batch_size, shuffle=True, num_workers=0, drop_last=True)
TestImgLoader = DataLoader(test_dataset, args.test_batch_size, shuffle=False, num_workers=0, drop_last=False)
    def __getitem__(self, index):
        left_img = self.load_image(os.path.join(self.datapath, self.left_filenames[index]))
        right_img = self.load_image(os.path.join(self.datapath, self.right_filenames[index]))
        disparity = self.load_disp(os.path.join(self.datapath, self.disp_filenames[index]))
        roi = self.load_mask(os.path.join(self.datapath, self.mask_filenames[index]))
            if self.training:
                w, h = left_img.size
                crop_w, crop_h = 512, 256
                x1 = random.randint(0, w - crop_w)
                y1 = random.randint(0, h - crop_h)
                # random crop
                left_img = left_img.crop((x1, y1, x1 + crop_w, y1 + crop_h))
                right_img = right_img.crop((x1, y1, x1 + crop_w, y1 + crop_h))
                disparity = disparity[y1:y1 + crop_h, x1:x1 + crop_w]
                roi = roi[y1:y1 + crop_h,x1:x1 + crop_w]
                # to tensor, normalize
                processed = get_transform()
                left_img = processed(left_img)
                right_img = processed(right_img)
                return {"left": left_img,
                        "right": right_img,
                        "disparity": disparity,
                        "left_filename": self.left_filenames[index],
                        "right_filename": self.right_filenames[index],
                        "roi":roi}
            else:
                w, h = left_img.size
                # crop_w, crop_h = 1024, 1024
                # left_img = left_img.crop((w - crop_w, h - crop_h, w, h))
                # right_img = right_img.crop((w - crop_w, h - crop_h, w, h))
                # disparity = disparity[h - crop_h:h, w - crop_w: w]
                # roi = roi[h - crop_h:h, w - crop_w: w]
                processed = get_transform()
                left_img = processed(left_img)
                right_img = processed(right_img)
                return {"left": left_img,
                        "right": right_img,
                        "disparity": disparity,
                        "top_pad": 0,
                        "right_pad": 0,
                        "left_filename": self.left_filenames[index],
                        "right_filename": self.right_filenames[index],
                        "roi":roi}
        except Exception as e:
            print(e.args)
            print(str(e))
            print(repr(e))
            print('here is get_item error')
File format

TIFF image and txt file
              From what I am able to see, it’s an EOF error and even before that it’s a bad file descriptor error which in this case looks like stemming from an attempt to close a file that wasn’t open in the first place. See this-
File “/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 530, in next

os.close(new_fd)

OSError: [Errno 9] Bad file descriptor
this might help. Multiprocessing on windows is error prone due to many reasons like pickling etc. Potentially, check this.
Not sure though, if these links would solve your error.
              I’m using torch 1.12.1 with pytorch-lightning 1.7.6 and getting the exact same issue, sporadically. I have 96 vCPUs available, and when I allow the dataloader to use them, training takes ~1 hour, but I will get Bad File descriptors (no matter what sharing strategy I set my workers to use (file-system, or file-descriptor)).
Seems to occur with any num_workers > 0, and when num_workers == 0, training times explodes (4+ hours from the 1 with multiprocessing).
              I have sloved this problem.Adding this configuration to the dataset script works:
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')