Python 3.x Pytorch错误“;运行时错误:索引超出范围:试图访问511行的表外索引512;

Python 3.x Pytorch错误“;运行时错误:索引超出范围:试图访问511行的表外索引512;,python-3.x,pytorch,vectorization,word-embedding,huggingface-transformers,Python 3.x,Pytorch,Vectorization,Word Embedding,Huggingface Transformers,我有一些句子,我使用python模块()的句子向量()方法进行向量化。对于某些句子组,我没有问题,但对于其他一些句子,我有以下错误消息: 文件 “/home/nobunaga/.local/lib/python3.6/site packages/biobert_embedding/embedding.py”, 第133行,在句子_向量中 encoded_layers=self.eval_fwdrop_biobert(标记化的_文本)文件“/home/nobunaga/.local/lib/pyt

我有一些句子,我使用python模块()的句子向量()方法进行向量化。对于某些句子组,我没有问题,但对于其他一些句子,我有以下错误消息:

文件 “/home/nobunaga/.local/lib/python3.6/site packages/biobert_embedding/embedding.py”, 第133行,在句子_向量中 encoded_layers=self.eval_fwdrop_biobert(标记化的_文本)文件“/home/nobunaga/.local/lib/python3.6/site packages/biobert_embedding/embedding.py”, 第82行,在eval_fwdrop_biobert中 编码的\u层,\u=self.model(标记\u张量,段\u张量)文件 “/home/nobunaga/.local/lib/python3.6/site packages/torch/nn/modules/module.py”, 第547行,通话中__ 结果=self.forward(*输入,**kwargs)文件“/home/nobunaga/.local/lib/python3.6/site packages/pytorch\u pretrained\u bert/modeling.py”, 第730行,向前 嵌入\u输出=self.embeddings(输入\u ID、令牌\u类型\u ID)文件 “/home/nobunaga/.local/lib/python3.6/site packages/torch/nn/modules/module.py”, 第547行,通话中__ 结果=self.forward(*输入,**kwargs)文件“/home/nobunaga/.local/lib/python3.6/site packages/pytorch\u pretrained\u bert/modeling.py”, 第268行,向前 position\u embeddings=self.position\u embeddings(position\u id)文件 “/home/nobunaga/.local/lib/python3.6/site packages/torch/nn/modules/module.py”, 第547行,通话中__ 结果=self.forward(*输入,**kwargs)文件“/home/nobunaga/.local/lib/python3.6/site packages/torch/nn/modules/sparse.py”, 第114行,前进 self.norm_type,self.scale_grad_by_freq,self.sparse)文件“/home/nobunaga/.local/lib/python3.6/site packages/torch/nn/functional.py”, 第1467行,嵌入 return torch.嵌入(权重、输入、填充\u idx、缩放\u grad\u by\u freq、稀疏)运行时错误:索引超出范围:尝试 使用511行访问表外索引512。在 /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:237

我发现,对于某些句子组,问题与标记有关,例如
。但对于其他人,即使删除了标记,错误消息仍然存在。
(很遗憾,出于保密原因,我无法共享代码)

你对可能出现的问题有什么想法吗

提前谢谢你

编辑:你说得对,克罗诺克,举个例子会更好

例如:

sentences = ["This is the first sentence.", "This is the second sentence.", "This is the third sentence."

biobert = BiobertEmbedding(model_path='./biobert_v1.1_pubmed_pytorch_model')

vectors = [biobert.sentence_vector(doc) for doc in sentences]

在我看来,这最后一行代码是导致错误消息的原因。

由于原始的BERT具有512(0-511)大小的位置编码,并且bioBERT派生自BERT,因此得到512的索引错误就不足为奇了。然而,有点奇怪的是,对于您提到的一些句子,您可以访问512。

问题在于biobert嵌入模块没有考虑最大序列长度512的限制(标记而不是单词!)。这是相关的。请看下面的示例,以强制执行您收到的错误:

来自biobert_嵌入。嵌入导入biobert嵌入
#这个句子有385个单词
句子=“ASCII几乎无处不在是一个很大的帮助,但未能解决国际和语言方面的问题。美元符号在英国不太有用,西班牙语、法语、德语和许多其他语言中使用的重音字符在ASCII中完全不可用(更不用说希腊语、俄语和大多数东方语言中使用的字符)。许多个人、公司和国家/地区根据需要定义了额外的字符,通常会重新分配控制字符,或使用128到255之间的值。使用128以上的值与使用第8位作为校验和冲突,但校验和用法逐渐消失。无论编码如何,文本都被视为纯文本。要正确使用理解或处理它收件人必须知道(或能够弄清楚)使用了什么编码;但是,他们不需要知道所使用的计算机体系结构,也不需要知道任何程序(如果有)定义的二进制结构创建数据。文本被视为纯文本,无论其编码如何。要正确理解或处理文本,接收者必须知道(或能够弄清楚)使用了什么编码;但是,他们不需要知道所使用的计算机体系结构,或任何程序(如果有)定义的二进制结构创建数据。文本被视为纯文本,无论其编码如何。要正确理解或处理文本,接收者必须知道(或能够弄清楚)使用了什么编码;但是,他们不需要知道所使用的计算机体系结构,或任何程序(如果有)定义的二进制结构创建数据。文本被视为纯文本,无论其编码如何。要正确理解或处理文本,接收者必须知道(或能够弄清楚)使用了什么编码;但是,他们不需要知道所使用的计算机体系结构,或任何程序(如果有)定义的二进制结构美国信息交换标准(ASCII)几乎无处不在,这对数据的创建有很大帮助,但未能解决国际和语言问题。美元符号在英国用处不大,西班牙语、法语、德语和许多其他语言中使用的重音字符在美国信息交换标准(ASCII)中完全不可用(更不用说希腊语、俄语和大多数东方语言中使用的字符了)。许多个人、公司和国家根据需要定义额外字符,通常会重新分配控制权。”
longersentence=句子+some'
biobert=biobert()
打印('句子有{}个标记'。格式(len(biobert.process_text(句子)))
#工作
biobert.句子向量(句子)
print('longersentence有{}个标记。格式(len(biobert.process_text(longersentence)))
#没用
biobert.句子向量(longersentence)
输出:

sentence has 512 tokens
longersentence has 513 tokens
#your error message....
您应该做的是实现一个程序来处理这些文本:

import torch
from biobert_embedding.embedding import BiobertEmbedding

maxtokens = 512
startOffset = 0
docStride = 200

sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'

sentences = [sentence, longersentence, 'small test sentence']
vectors = []
biobert = BiobertEmbedding()

#https://github.com/Overfitter/biobert_embedding/blob/b114e3456de76085a6cf881ff2de48ce868e6f4b/biobert_embedding/embedding.py#L127
def sentence_vector(tokenized_text, biobert):
    encoded_layers = biobert.eval_fwdprop_biobert(tokenized_text)

    # `encoded_layers` has shape [12 x 1 x 22 x 768]
    # `token_vecs` is a tensor with shape [22 x 768]
    token_vecs = encoded_layers[11][0]

    # Calculate the average of all 22 token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)
    return sentence_embedding


for doc in sentences:
    #tokenize your text
    docTokens = biobert.process_text(doc)
    
    while startOffset < len(docTokens):
        print(startOffset)
        length = min(len(docTokens) - startOffset, maxtokens)

        #now we calculate the sentence_vector for the document slice
        vectors.append(sentence_vector(
                        docTokens[startOffset:startOffset+length]
                        , biobert)
                      )
        #stop when the whole document is processed (document has less than 512
        #or the last document slice was processed)
        if startOffset + length == len(docTokens):
            break
        startOffset += min(length, docStride)
    startOffset = 0
导入火炬
来自biobert