Python 标记化&；编码数据集使用了太多的RAM_Python_Nlp_Pytorch_Huggingface Transformers_Huggingface Tokenizers

Python 标记化&；编码数据集使用了太多的RAM

python nlp pytorch

Python 标记化&；编码数据集使用了太多的RAM,python,nlp,pytorch,huggingface-transformers,huggingface-tokenizers,Python,Nlp,Pytorch,Huggingface Transformers,Huggingface Tokenizers,试图对数据进行标记和编码，以提供给神经网络我只有25GB内存，每次我试图运行下面的代码时，我的google colab崩溃。你知道怎么防止他的事发生吗？“您的会话在使用所有可用RAM后崩溃” 我认为标记化/编码50000个句子的块可以工作，但不幸的是不行。该代码适用于长度为130万的数据集。当前数据集的长度为500万 max_q_len = 128 max_a_len = 64 trainq_list = train_q.tolist() batch_size = 50000

试图对数据进行标记和编码，以提供给神经网络

我只有25GB内存，每次我试图运行下面的代码时，我的google colab崩溃。你知道怎么防止他的事发生吗？“您的会话在使用所有可用RAM后崩溃”

我认为标记化/编码50000个句子的块可以工作，但不幸的是不行。该代码适用于长度为130万的数据集。当前数据集的长度为500万

max_q_len = 128
max_a_len = 64    
trainq_list = train_q.tolist()    
batch_size = 50000
    
def batch_encode(text, max_seq_len):
      for i in range(0, len(trainq_list), batch_size):
        encoded_sent = tokenizer.batch_encode_plus(
            text,
            max_length = max_seq_len,
            pad_to_max_length=True,
            truncation=True,
            return_token_type_ids=False
        )
      return encoded_sent

    # tokenize and encode sequences in the training set
    tokensq_train = batch_encode(trainq_list, max_q_len)

标记器来自HuggingFace：

tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')

您应该使用生成器并将数据传递给

标记器。无论大小，都应使用batch\u encode\u plus

从概念上讲，类似这样的情况：

import pathlib


def read_in_chunks(directory: pathlib.Path):
    # Use "*.txt" or any other extension your file might have
    for file in directory.glob("*"):
        with open(file, "r") as f:
            yield f.readlines()

# Generator should create lists useful for encoding
def batch_encode(generator, max_seq_len):
    tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-uncased")
    for text in generator:
        yield tokenizer.batch_encode_plus(
            text,
            max_length=max_seq_len,
            pad_to_max_length=True,
            truncation=True,
            return_token_type_ids=False,
        )

培训名单这一个可能包含从一些文件中读取的句子列表。如果这是一个大文件，您可以按照下面的步骤一次惰性地读取部分输入（最好是

batch\u size

行）：

否则，请打开一个文件（比内存小得多，因为使用BERT编码后文件会大得多），如下所示：

import pathlib


def read_in_chunks(directory: pathlib.Path):
    # Use "*.txt" or any other extension your file might have
    for file in directory.glob("*"):
        with open(file, "r") as f:
            yield f.readlines()

# Generator should create lists useful for encoding
def batch_encode(generator, max_seq_len):
    tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-uncased")
    for text in generator:
        yield tokenizer.batch_encode_plus(
            text,
            max_length=max_seq_len,
            pad_to_max_length=True,
            truncation=True,
            return_token_type_ids=False,
        )

编码编码器应使用此生成器并

产生反向编码的部分，类似于：
import pathlib


def read_in_chunks(directory: pathlib.Path):
    # Use "*.txt" or any other extension your file might have
    for file in directory.glob("*"):
        with open(file, "r") as f:
            yield f.readlines()

# Generator should create lists useful for encoding
def batch_encode(generator, max_seq_len):
    tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-uncased")
    for text in generator:
        yield tokenizer.batch_encode_plus(
            text,
            max_length=max_seq_len,
            pad_to_max_length=True,
            truncation=True,
            return_token_type_ids=False,
        )

保存编码文件
由于文件太大，无法放入RAM内存，您应该将其保存到磁盘（或者在生成文件时以某种方式使用）
大致如下：
import numpy as np


# I assume np.arrays are created, adjust to PyTorch Tensors or anything if needed
def save(encoding_generator):
    for i, encoded in enumerate(encoding_generator):
        np.save(str(i), encoded)

太好了，谢谢！在我的例子中，file_对象已经是一个加载到笔记本中的内存，它似乎使用的内存不超过1gb。因此，我需要编写一个生成器，它采用这个df而不是文件对象？@Exa是的，采用这个df
并生成它的片段（比如说64
示例，越多越好，但请记住RAM约束），可能是一个列表。好的，谢谢！您认为将标记化和编码转移到训练循环中有意义吗？因此，与其像上面那样有一个单独的函数，不如将它包含在类似run_training（）的内容中。通常使用许多小函数，这样更容易理解，所以我不这么认为。