Python 如何使用映射或循环对transformer数据集进行解码?
我加载一个包含文本列的数据集,我想翻译它们 为了加快进程,我尝试使用transformer数据集Python 如何使用映射或循环对transformer数据集进行解码?,python,pytorch,huggingface-transformers,Python,Pytorch,Huggingface Transformers,我加载一个包含文本列的数据集,我想翻译它们 为了加快进程,我尝试使用transformer数据集 model_size = "base" model_name = f"persiannlp/mt5-{model_size}-parsinlu-translation_en_fa" tokenizer = MT5Tokenizer.from_pretrained(model_name) model = MT5ForConditionalGeneration.f
model_size = "base"
model_name = f"persiannlp/mt5-{model_size}-parsinlu-translation_en_fa"
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)
dataset = load_dataset('csv', data_files=dfname, split='train')
dataset = dataset.map(lambda e: tokenizer(e['input_text'], padding='longest'))
dataset.set_format(type='torch', columns=['input_ids'])
# map for generating translation
#dataset = dataset.map(lambda e: {"trans":model.generate(e['input_ids'])})
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
for batch in tqdm(dataloader):
input_ids = batch["input_ids"]
res = model.generate(input_ids)
target = tokenizer.batch_decode(res, skip_special_tokens=True)
首先,我尝试调用另一个map
中的model.generate
,它给出了这个错误(在代码中注释):
然后我尝试在循环中调用它,但它给出了以下循环错误:
Traceback (most recent call last):
File "prepare_natural.py", line 146, in <module>
for batch in tqdm(dataloader):
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/tqdm/std.py", line 1129, in __iter__
for obj in iterable:
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 73, in default_collate
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 73, in <dictcomp>
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [17] at entry 0 and [15] at entry 1
回溯(最近一次呼叫最后一次):
文件“prepare_natural.py”,第146行,在
对于TQM(数据加载器)中的批处理:
文件“/home/pouramini/miniconda3/lib/python3.7/site packages/tqdm/std.py”,第1129行,在__
对于iterable中的obj:
文件“/home/pouramini/miniconda3/lib/python3.7/site packages/torch/utils/data/dataloader.py”,第435行,下一页__
data=self.\u next\u data()
文件“/home/pouramini/miniconda3/lib/python3.7/site packages/torch/utils/data/dataloader.py”,第475行,在下一个数据中
data=self._dataset_fetcher.fetch(index)#可能引发停止迭代
文件“/home/pouramini/miniconda3/lib/python3.7/site packages/torch/utils/data/_utils/fetch.py”,fetch中第47行
返回自我整理(数据)
默认情况下,文件“/home/pouramini/miniconda3/lib/python3.7/site packages/torch/utils/data/_-utils/collate.py”第73行
返回{key:default_collate([d[key]表示批处理中的d])表示元素中的键}
文件“/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_-utils/collate.py”,第73行,在
返回{key:default_collate([d[key]表示批处理中的d])表示元素中的键}
默认情况下,文件“/home/pouramini/miniconda3/lib/python3.7/site packages/torch/utils/data/_-utils/collate.py”第55行
返回火炬堆叠(批次,0,输出=输出)
RuntimeError:堆栈期望每个张量大小相等,但在条目0处得到[17],在条目1处得到[15]
循环对我来说很好。我认为问题出在别的地方,你应该把所有的问题都解决掉stacktrace@NatthaphonHongcharoen我添加了完整的堆栈跟踪。它可能适用于某些批次,然后输出错误。顺便说一句,它看起来还是很慢。您可以忽略我的代码,如果您知道一个快速解决方案(例如使用GPU),您可以提供它。您是否验证了填充是否按预期工作(即检查每个数据集项的形状)。
Traceback (most recent call last):
File "prepare_natural.py", line 146, in <module>
for batch in tqdm(dataloader):
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/tqdm/std.py", line 1129, in __iter__
for obj in iterable:
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 73, in default_collate
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 73, in <dictcomp>
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [17] at entry 0 and [15] at entry 1