字符串数据类型的keras pad_序列_Keras_Nlp_Sequence To Sequence_Elmo

字符串数据类型的keras pad_序列

keras nlp

字符串数据类型的keras pad_序列,keras,nlp,sequence-to-sequence,elmo,Keras,Nlp,Sequence To Sequence,Elmo,我有一个句子列表。我想给它们添加填充物；但当我使用keras pad_序列时，如下所示： from keras.preprocessing.sequence import pad_sequences s = [["this", "is", "a", "book"], ["this", "is", "not"]] g = pad_sequences(s, dtype='str', maxlen=10, value='_PAD_') 结果是： array([['_', '_', '_', '_',

我有一个句子列表。我想给它们添加填充物；但当我使用keras pad_序列时，如下所示：

from keras.preprocessing.sequence import pad_sequences
s = [["this", "is", "a", "book"], ["this", "is", "not"]]
g = pad_sequences(s, dtype='str', maxlen=10, value='_PAD_')

结果是：

array([['_', '_', '_', '_', '_', '_', 't', 'i', 'a', 'b'],
       ['_', '_', '_', '_', '_', '_', '_', 't', 'i', 'n']], dtype='<U1')

数组（[['''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''，
首先应将文本转换为数值。Keras提供了标记器和两种方法，分别用于文本和文本到序列的匹配
请参阅本keras文件
标记器：这有助于对文本语料库进行矢量化，方法是将每个
将文本转换为整数序列（每个整数都是索引
（指字典中的一个标记）或一个向量，其中系数
根据字数，每个令牌可以是二进制的
适应文本：这将创建基于词频的词汇索引
文本\到\序列：将文本中的每个文本转换为整数序列
from keras.preprocessing import text, sequence
s = ["this", "is", "a", "book", "of my choice"]
tokenizer = text.Tokenizer(num_words=100,lower=True)
tokenizer.fit_on_texts(s)
seq_token = tokenizer.texts_to_sequences(s)
g = sequence.pad_sequences(seq_token, maxlen=10)
g

输出
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 2],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 3],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 4],
       [0, 0, 0, 0, 0, 0, 0, 5, 6, 7]], dtype=int32)

将dtype
更改为object
，它将为您完成任务
from keras.preprocessing.sequence import pad_sequences

s = [["this", "is", "a", "book"], ["this", "is", "not"]]
g = pad_sequences(s, dtype=object, maxlen=10, value='_PAD_')
print(g)

输出：
array([['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', 'this',
        'is', 'a', 'book'],
       ['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_',
        'this', 'is', 'not']], dtype=object)

可能的副本。