Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/280.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何使用映射或循环对transformer数据集进行解码?_Python_Pytorch_Huggingface Transformers - Fatal编程技术网

Python 如何使用映射或循环对transformer数据集进行解码?

Python 如何使用映射或循环对transformer数据集进行解码?,python,pytorch,huggingface-transformers,Python,Pytorch,Huggingface Transformers,我加载一个包含文本列的数据集,我想翻译它们 为了加快进程,我尝试使用transformer数据集 model_size = "base" model_name = f"persiannlp/mt5-{model_size}-parsinlu-translation_en_fa" tokenizer = MT5Tokenizer.from_pretrained(model_name) model = MT5ForConditionalGeneration.f

我加载一个包含文本列的数据集,我想翻译它们

为了加快进程,我尝试使用transformer数据集

model_size = "base"
model_name = f"persiannlp/mt5-{model_size}-parsinlu-translation_en_fa"
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)
dataset = load_dataset('csv', data_files=dfname, split='train')

dataset = dataset.map(lambda e: tokenizer(e['input_text'], padding='longest'))


dataset.set_format(type='torch', columns=['input_ids'])

# map for generating translation
#dataset = dataset.map(lambda e: {"trans":model.generate(e['input_ids'])})




dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
for batch in tqdm(dataloader):
    input_ids = batch["input_ids"]
    res = model.generate(input_ids)
    target = tokenizer.batch_decode(res, skip_special_tokens=True)

首先,我尝试调用另一个
map
中的
model.generate
,它给出了这个错误(在代码中注释):

然后我尝试在循环中调用它,但它给出了以下循环错误:

Traceback (most recent call last):
  File "prepare_natural.py", line 146, in <module>
    for batch in tqdm(dataloader):
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/tqdm/std.py", line 1129, in __iter__
    for obj in iterable:
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 73, in default_collate
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 73, in <dictcomp>
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [17] at entry 0 and [15] at entry 1


回溯(最近一次呼叫最后一次):
文件“prepare_natural.py”,第146行,在
对于TQM(数据加载器)中的批处理:
文件“/home/pouramini/miniconda3/lib/python3.7/site packages/tqdm/std.py”,第1129行,在__
对于iterable中的obj:
文件“/home/pouramini/miniconda3/lib/python3.7/site packages/torch/utils/data/dataloader.py”,第435行,下一页__
data=self.\u next\u data()
文件“/home/pouramini/miniconda3/lib/python3.7/site packages/torch/utils/data/dataloader.py”,第475行,在下一个数据中
data=self._dataset_fetcher.fetch(index)#可能引发停止迭代
文件“/home/pouramini/miniconda3/lib/python3.7/site packages/torch/utils/data/_utils/fetch.py”,fetch中第47行
返回自我整理(数据)
默认情况下,文件“/home/pouramini/miniconda3/lib/python3.7/site packages/torch/utils/data/_-utils/collate.py”第73行
返回{key:default_collate([d[key]表示批处理中的d])表示元素中的键}
文件“/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_-utils/collate.py”,第73行,在
返回{key:default_collate([d[key]表示批处理中的d])表示元素中的键}
默认情况下,文件“/home/pouramini/miniconda3/lib/python3.7/site packages/torch/utils/data/_-utils/collate.py”第55行
返回火炬堆叠(批次,0,输出=输出)
RuntimeError:堆栈期望每个张量大小相等,但在条目0处得到[17],在条目1处得到[15]

循环对我来说很好。我认为问题出在别的地方,你应该把所有的问题都解决掉stacktrace@NatthaphonHongcharoen我添加了完整的堆栈跟踪。它可能适用于某些批次,然后输出错误。顺便说一句,它看起来还是很慢。您可以忽略我的代码,如果您知道一个快速解决方案(例如使用GPU),您可以提供它。您是否验证了填充是否按预期工作(即检查每个数据集项的形状)。
Traceback (most recent call last):
  File "prepare_natural.py", line 146, in <module>
    for batch in tqdm(dataloader):
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/tqdm/std.py", line 1129, in __iter__
    for obj in iterable:
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 73, in default_collate
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 73, in <dictcomp>
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [17] at entry 0 and [15] at entry 1