Python 3.x 无法使用torchtext创建自定义数据集和数据加载器
我对使用Python 3.x 无法使用torchtext创建自定义数据集和数据加载器,python-3.x,pytorch,torchtext,Python 3.x,Pytorch,Torchtext,我对使用torchtext构建自定义数据集和迭代器有疑问。我使用了本文中的以下代码,并根据我的案例进行了修改: tokenizer=XLNetTokenizer.from_pretrained(“xlnet base cased”) text_field=field(sequential=True,eos_token=“[CLS]”,tokenize=tokenizer) label\u field=field(sequential=False,use\u vocab=False) 数据字段=[
torchtext
构建自定义数据集和迭代器有疑问。我使用了本文中的以下代码,并根据我的案例进行了修改:
tokenizer=XLNetTokenizer.from_pretrained(“xlnet base cased”)
text_field=field(sequential=True,eos_token=“[CLS]”,tokenize=tokenizer)
label\u field=field(sequential=False,use\u vocab=False)
数据字段=[(“文件”,无),
(“文本”,文本字段),
(“标签”,标签\字段)]
列车,val=列车试验分离(输入dt,试验尺寸=0.1)
列车至csv(“列车输出路径”,索引=False)
val.to_csv(“val_输出路径”,索引=False)
train,val=tablerdataset(path=“path”,train=“train.csv”,validation=“val.csv”,
format=“csv”,跳过标题=真,字段=数据字段)
当涉及到
text\u field.build\u vocab(train)
时,我遇到了一个错误:TypeError:“我认为当您使用预定义的标记器时,您不需要构建vocab,相反,您可以按照以下步骤操作。显示如何使用BERT标记器执行此操作的示例
句子
:它是一个文本数据列表
标签
:标签是否关联
###tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []
# For every sentence...
for sent in sentences:
# `encode_plus` will:
# (1) Tokenize the sentence.
# (2) Prepend the `[CLS]` token to the start.
# (3) Append the `[SEP]` token to the end.
# (4) Map tokens to their IDs.
# (5) Pad or truncate the sentence to `max_length`
# (6) Create attention masks for [PAD] tokens.
encoded_dict = tokenizer.encode_plus(
sent, # Sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = 100, # Pad & truncate all sentences.
pad_to_max_length = True,
return_attention_mask = True, # Construct attn. masks.
return_tensors = 'pt', # Return pytorch tensors.
)
# Add the encoded sentence to the list.
input_ids.append(encoded_dict['input_ids'])
# And its attention mask (simply differentiates padding from non-padding).
attention_masks.append(encoded_dict['attention_mask'])
# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)
# Print sentence 0, now as a list of IDs.
print('Original: ', sentences[0])
print('Token IDs:', input_ids[0])
### Not combine the input id , mask and labels and divide the dataset
#:
from torch.utils.data import TensorDataset, random_split
# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, labels)
# Create a 90-10 train-validation split.
# Calculate the number of samples to include in each set.
train_size = int(0.90 * len(dataset))
val_size = len(dataset) - train_size
# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))
### Not you call loader of these datasets
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
# The DataLoader needs to know our batch size for training, so we specify it
# here. For fine-tuning BERT on a specific task, the authors recommend a batch
# size of 16 or 32.
batch_size = 32
# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order.
train_dataloader = DataLoader(
train_dataset, # The training samples.
sampler = RandomSampler(train_dataset), # Select batches randomly
batch_size = batch_size # Trains with this batch size.
)
# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
val_dataset, # The validation samples.
sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
batch_size = batch_size # Evaluate with this batch size.
)