Nlp Fastbert:BertDataBunch多标签文本分类错误

Nlp Fastbert:BertDataBunch多标签文本分类错误,nlp,pytorch,Nlp,Pytorch,我正在关注huggingface的FastBert教程 问题在于,代码并非完全可复制。我面临的主要问题是数据集的准备。在本教程中,将使用此数据集 但是,如果我根据教程设置文件夹结构,并将数据集文件放在文件夹中,则数据bunch会出错 databunch = BertDataBunch(args['data_dir'], LABEL_PATH, args.model_name, train_file='train.csv', val_file='val.csv',

我正在关注huggingface的FastBert教程

问题在于,代码并非完全可复制。我面临的主要问题是数据集的准备。在本教程中,将使用此数据集

但是,如果我根据教程设置文件夹结构,并将数据集文件放在文件夹中,则数据bunch会出错

databunch = BertDataBunch(args['data_dir'], LABEL_PATH, args.model_name, train_file='train.csv', val_file='val.csv',
                          test_data='test.csv',
                          text_col="comment_text", label_col=label_cols,
                          batch_size_per_gpu=args['train_batch_size'], max_seq_length=args['max_seq_length'], 
                          multi_gpu=args.multi_gpu, multi_label=True, model_type=args.model_type)
它抱怨文件格式错误。我应该如何使用fastbert格式化此数据集的数据集和标签

  • 首先,您可以将GitHub的笔记本用于FastBert
  • FastBert自述中有一个关于如何在使用之前处理数据集的小教程
  • 创建一个DataBunch对象

    
    
    The databunch object takes training, validation and test csv files and converts the data into internal representation for BERT, RoBERTa, DistilBERT or XLNet. The object also instantiates the correct data-loaders based on device profile and batch_size and max_sequence_length.
    
    from fast_bert.data_cls import BertDataBunch
    
    databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
                              tokenizer='bert-base-uncased',
                              train_file='train.csv',
                              val_file='val.csv',
                              label_file='labels.csv',
                              text_col='text',
                              label_col='label',
                              batch_size_per_gpu=16,
                              max_seq_length=512,
                              multi_gpu=True,
                              multi_label=False,
                              model_type='bert')
    
    File format for train.csv and val.csv
    index   text    label
    0   Looking through the other comments, I'm amazed that there aren't any warnings to potential viewers of what they have to look forward to when renting this garbage. First off, I rented this thing with the understanding that it was a competently rendered Indiana Jones knock-off.    neg
    1   I've watched the first 17 episodes and this series is simply amazing! I haven't been this interested in an anime series since Neon Genesis Evangelion. This series is actually based off an h-game, which I'm not sure if it's been done before or not, I haven't played the game, but from what I've heard it follows it very well     pos
    2   his movie is nothing short of a dark, gritty masterpiece. I may be bias, as the Apartheid era is an area I've always felt for.  pos
    
    In case the column names are different than the usual text and labels, you will have to provide those names in the databunch text_col and label_col parameters.
    
    labels.csv will contain a list of all unique labels. In this case the file will contain:
    
    pos
    neg
    
    For multi-label classification, labels.csv will contain all possible labels:
    
    severe_toxic
    obscene
    threat
    insult
    identity_hate
    
    The file train.csv will then contain one column for each label, with each column value being either 0 or 1. Don't forget to change multi_label=True for multi-label classification in BertDataBunch.
    id  text    toxic   severe_toxic    obscene     threat  insult  identity_hate
    0   Why the edits made under my username Hardcore Metallica Fan were reverted?  0   0   0   0   0   0
    0   I will mess you up  1   0   0   1   0   0
    
    label_col will be a list of label column names. In this case it will be:
    
    ['toxic','severe_toxic','obscene','threat','insult','identity_hate']
    
    因此,只需将train.csv、val.csv(只需复制train.csv)和test.csv保存在数据中/

    在labels文件夹中,保留包含以下内容的labels.csv文件

    severe_toxic
    obscene
    threat
    insult
    identity_hate