Python 利用函数对象加快训练速度

Python 利用函数对象加快训练速度,python,object,Python,Object,我想修改函数tokenized_dataset,因为它创建了一个非常繁重的字典。该函数创建的数据集将重新用于ML培训。但是,在培训期间拖动该词典会大大降低培训速度 请注意,文档与 from datasets import load_dataset #Huggingface from transformers import BertTokenizer #Huggingface: def tokenized_dataset(dataset): """ Meth

我想修改函数
tokenized_dataset
,因为它创建了一个非常繁重的字典。该函数创建的数据集将重新用于ML培训。但是,在培训期间拖动该词典会大大降低培训速度

请注意,
文档

from datasets import load_dataset #Huggingface
from transformers import BertTokenizer #Huggingface:

def tokenized_dataset(dataset):
    """ Method that tokenizes each document in the train, test and validation dataset

    Args:
        dataset (DatasetDict): dataset that will be tokenized (train, test, validation)
    
    Returns:
        dict: dataset once tokenized
    """

    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

    encode = lambda document: tokenizer(document, return_tensors='pt', padding=True, truncation=True)
    train_articles = [encode(document) for document in dataset["train"]["article"]]
    test_articles = [encode(document) for document in dataset["test"]["article"]]
    val_articles = [encode(document) for document in dataset["val"]["article"]]
    train_abstracts = [encode(document) for document in dataset["train"]["abstract"]]
    test_abstracts = [encode(document) for document in dataset["test"]["abstract"]]
    val_abstracts = [encode(document) for document in dataset["val"]["abstract"]]

    return {"train": (train_articles, train_abstracts),
            "test": (test_articles, test_abstracts),
            "val": (val_articles, val_abstracts)}

if __name__ == "__main__":
    dataset = load_data("./train/", "./test/", "./val/", "./.cache_dir")
    tokenized_data = tokenized_dataset(dataset)
所以在字典中,键只是字符串,但值都是字符串列表。与其让value=list of string,不如创建一个对象函数列表,而不是让list of string。这将使字典更轻。我该怎么做

编辑


对我来说,复制字符串列表和复制对象列表是有区别的。复制对象只会复制引用,而复制字符串列表会复制所有内容。因此,复制引用要快得多。这就是这个问题的重点。

我不知道你在这里说的“重”和“轻”是什么意思,也不知道“拖拽那本字典”会给系统带来什么负担。如果您需要存储所有这些单词,那么您需要存储所有这些单词。无论是字符串列表还是包含字符串列表的对象列表,在性能上都没有任何区别。@TimRoberts区别似乎在于我复制字符串列表和对象列表时。复制对象只需复制引用,而复制字符串列表将复制所有内容。因此,复制引用要快得多。这就是问题的重点。复制字符串列表并不能复制所有内容。完全一样,它对现有列表进行了新的引用。
[['eleven politicians from 7 parties made comments in letter to a newspaper .',
  "said dpp alison saunders had ` damaged public confidence ' in justice .",
  'ms saunders ruled lord janner unfit to stand trial over child abuse claims .',
  'the cps has pursued at least 19 suspected paedophiles with dementia .'],
 ['an increasing number of surveys claim to reveal what makes us happiest .',
  'but are these generic lists really of any use to us ?',
  'janet street-porter makes her own list - of things making her unhappy !'],
 ["author of ` into the wild ' spoke to five rape victims in missoula , montana .",
  "` missoula : rape and the justice system in a college town ' was released april 21 .",
  "three of five victims profiled in the book sat down with abc 's nightline wednesday night .",
  'kelsey belnap , allison huguet and hillary mclaughlin said they had been raped by university of montana football '
  'players .',
  "huguet and mclaughlin 's attacker , beau donaldson , pleaded guilty to rape in 2012 and was sentenced to 10 years .",
  'belnap claimed four players gang-raped her in 2010 , but prosecutors never charged them citing lack of probable '
  'cause .',
  'mr krakauer wrote book after realizing close friend was a rape victim .'],
 ['tesco announced a record annual loss of £ 6.38 billion yesterday .',
  'drop in sales , one-off costs and pensions blamed for financial loss .',
  'supermarket giant now under pressure to close 200 stores nationwide .',
  'here , retail industry veterans , plus mail writers , identify what went wrong .'],
 ...,
 ['snp leader said alex salmond did not field questions over his family .',
  "said she was not ` moaning ' but also attacked criticism of women 's looks .",
  'she made the remarks in latest programme profiling the main party leaders .',
  'ms sturgeon also revealed her tv habits and recent image makeover .',
  'she said she relaxed by eating steak and chips on a saturday night .']]