Python 分布式采样器&x2014;预期为‘;cuda&x2019;生成索引时生成器的设备类型
执行分布式培训时,我有如下代码:Python 分布式采样器&x2014;预期为‘;cuda&x2019;生成索引时生成器的设备类型,python,pytorch,distributed-computing,Python,Pytorch,Distributed Computing,执行分布式培训时,我有如下代码: training_sampler = DistributedSampler(training_set, num_replicas=2, rank=0) training_generator = data.DataLoader(training_set, **params, sampler=training_sampler) for x, y, z in training_generator: # Error occurs here. ... 总的来说
training_sampler = DistributedSampler(training_set, num_replicas=2, rank=0)
training_generator = data.DataLoader(training_set, **params, sampler=training_sampler)
for x, y, z in training_generator: # Error occurs here.
...
总的来说,我得到了以下信息:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/ubuntu/VC/ppg_training_extraction/ppg_training_scripts/train_ASR_trim_scp.py", line 336, in train
for local_batch_src, local_batch_tgt, lengths in dataloaders[phase]:
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 352, in __iter__
return self._get_iterator()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 294, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 827, in __init__
self._reset(loader, first_iter=True)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 857, in _reset
self._try_put_index()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1091, in _try_put_index
index = self._next_index()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 427, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 227, in __iter__
for idx in self.sampler:
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/distributed.py", line 97, in __iter__
indices = torch.randperm(len(self.dataset), generator=g).tolist() # type: ignore
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'
现在在那一行,我在pdb中运行了以下说明:
(Pdb) g = torch.Generator()
(Pdb) g.manual_seed(0)
<torch._C.Generator object at 0x7ff7f8143110>
(Pdb) indices = torch.randperm(4556, generator=g).tolist()
(Pdb) indices = torch.randperm(455604, generator=g).tolist()
*** RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'
(Pdb)g=torch.Generator()
(Pdb)g.手动种子(0)
(Pdb)索引=torch.randperm(4556,generator=g).tolist()
(Pdb)索引=torch.randperm(455604,generator=g).tolist()
***RuntimeError:生成器应为“cuda”设备类型,但找到“cpu”
为什么当上限整数很高时会出现运行时错误,而当上限整数足够低时却没有
注意,我运行了一个干净的Python会话,发现
>>> import torch
>>> g = torch.Generator()
>>> g.manual_seed(0)
<torch._C.Generator object at 0x7f9d2dfb39f0>
>>> indices = torch.randperm(455604, generator=g).tolist()
导入火炬
>>>g=火炬发生器()
>>>g.人工种子(0)
>>>索引=torch.randperm(455604,generator=g).tolist()
这很有效。我是如何在多个GPU之间处理分布式培训的?任何形式的见解都将不胜感激