Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/291.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/url/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何使用';核对&#fn和#x27;用数据加载器?_Python_Pytorch_Huggingface Transformers_Dataloader - Fatal编程技术网

Python 如何使用';核对&#fn和#x27;用数据加载器?

Python 如何使用';核对&#fn和#x27;用数据加载器?,python,pytorch,huggingface-transformers,dataloader,Python,Pytorch,Huggingface Transformers,Dataloader,我正在尝试使用3个输入、3个输入屏蔽和一个标签作为训练数据集的张量来训练一个预训练的roberta模型 我使用以下代码执行此操作: from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler batch_size = 32 # Create the DataLoader for our training set. train_data = TensorDataset(train_

我正在尝试使用3个输入、3个输入屏蔽和一个标签作为训练数据集的张量来训练一个预训练的roberta模型

我使用以下代码执行此操作:

from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
batch_size = 32
# Create the DataLoader for our training set.
train_data = TensorDataset(train_AT, train_BT, train_CT, train_maskAT, train_maskBT, train_maskCT, labels_trainT)
train_dataloader = DataLoader(train_data, batch_size=batch_size)

# Create the Dataloader for our validation set.
validation_data = TensorDataset(val_AT, val_BT, val_CT, val_maskAT, val_maskBT, val_maskCT, labels_valT)
val_dataloader = DataLoader(validation_data, batch_size=batch_size)

# Pytorch Training
training_args = TrainingArguments(
    output_dir='C:/Users/samvd/Documents/Master/AppliedMachineLearning/FinalProject/results',          # output directory
    num_train_epochs=1,              # total # of training epochs
    per_device_train_batch_size=32,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='C:/Users/samvd/Documents/Master/AppliedMachineLearning/FinalProject/logs',            # directory for storing logs
)

trainer = Trainer(
    model=model,                          # the instantiated Basically, the 
collate_fn
receives a list of tuples if your
__getitem__
function from a Dataset subclass returns a tuple, or just a normal list if your Dataset subclass returns only one element. Its main objective is to create your batch without spending much time implementing it manually. Try to see it as a glue that you specify the way examples stick together in a batch. If you don’t use it, PyTorch only put
batch_size
examples together as you would using torch.stack (not exactly it, but it is simple like that).

Suppose for example, you want to create batches of a list of varying dimension tensors. The below code pads sequences with 0 until the maximum sequence size of the batch, that is why we need the collate_fn, because a standard batching algorithm (simply using
torch.stack
) won’t work in this case, and we need to manually pad different sequences with variable length to the same size before creating the batch.

def collate_fn(data):
    """
       data: is a list of tuples with (example, label, length)
             where 'example' is a tensor of arbitrary shape
             and label/length are scalars
    """
    _, labels, lengths = zip(*data)
    max_len = max(lengths)
    n_ftrs = data[0][0].size(1)
    features = torch.zeros((len(data), max_len, n_ftrs))
    labels = torch.tensor(labels)
    lengths = torch.tensor(lengths)

    for i in range(len(data)):
        j, k = data[i][0].size(0), data[i][0].size(1)
        features[i] = torch.cat([data[i][0], torch.zeros((max_len - j, k))])

    return features.float(), labels.long(), lengths.long()
来自torch.utils.data import TensorDataset、数据加载器、随机采样器、顺序采样器
批量大小=32
#为我们的训练集创建数据加载器。
列车数据=张力数据集(列车AT、列车BT、列车CT、列车maskAT、列车maskBT、列车maskCT、列车maskCT、标签)
列车数据加载器=数据加载器(列车数据,批次大小=批次大小)
#为验证集创建数据加载器。
验证数据=张力数据集(val_AT、val_BT、val_CT、val_maskAT、val_maskBT、val_maskCT、labels_valT)
val_dataloader=dataloader(验证_数据,批次大小=批次大小)
#Pytork培训
培训参数=培训参数(
输出目录:/Users/samvd/Documents/Master/AppliedMachineLearning/FinalProject/results',#输出目录
训练次数=1,训练次数总数
每台设备每列批量大小=32,培训期间每台设备的批量大小
每台设备评估批量大小=32,评估批量大小
预热步骤=500,#学习速率计划程序的预热步骤数
重量衰减=0.01,#重量衰减强度
logging_dir='C:/Users/samvd/Documents/Master/AppliedMachineLearning/FinalProject/logs',#用于存储日志的目录
)
教练(

model=model,#实例化的基本上,
collate\u fn
从Dataset子类返回一个元组时接收一个元组列表,如果Dataset子类返回一个元组,则只接收一个正常列表。其主要目标是在不花费太多时间的情况下创建批处理e手动实现。尝试将其视为一种粘合剂,指定示例在批处理中粘在一起的方式。如果不使用它,PyTorch只会像使用torch.stack一样将
batch\u size
示例粘在一起(不完全是这样,但很简单)

例如,假设您想创建一个不同维度张量列表的批。下面的代码用0填充序列,直到批的最大序列大小,这就是为什么我们需要collate\u fn,因为标准的批处理算法(只需使用
torch.stack
)在这种情况下不起作用,在创建批之前,我们需要手动将具有可变长度的不同序列填充到相同的大小

DataLoader(toy_dataset, collate_fn=collate_fn, batch_size=5)
上述函数被馈送到数据加载器中的collate_fn参数,如下例所示:

使用这个collate_fn函数,你总是会得到一个张量,其中所有的例子都有相同的大小。因此,当你用这些数据反馈forward()函数时,你需要使用长度来获取原始数据,不要在计算中使用那些无意义的零


来源:

你已经就同一个问题发表了三次帖子,我不确定这是否能帮助你得到答案。我建议编辑你的原始问题。这将帮助读者回答你的问题。这是否回答了你的问题?。它展示了如何使用
collate\fn