tensorflow.keras.preprocessing.text.Tokenizer.text\u to\u矩阵做什么？_Tensorflow_Keras

tensorflow.keras.preprocessing.text.Tokenizer.text\u to\u矩阵做什么？

tensorflow keras

tensorflow.keras.preprocessing.text.Tokenizer.text\u to\u矩阵做什么？,tensorflow,keras,Tensorflow,Keras,请解释是什么，结果是什么 from tensorflow.keras.preprocessing.text import Tokenizer tokenizer = Tokenizer(oov_token="<OOV>") sentences = [text] print(sentences) tokenizer.fit_on_texts(sentences) word_index = tokenizer.word_index sequences = toke

请解释是什么，结果是什么

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(oov_token="<OOV>")

sentences = [text]
print(sentences)
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
matrix = tokenizer.texts_to_matrix(sentences)
print(word_index)
print(sequences)
print(matrix)
---
['The fool doth think he is wise, but the wise man knows himself to be a fool.']

# word_index
{'<OOV>': 1, 'the': 2, 'fool': 3, 'wise': 4, 'doth': 5, 'think': 6, 'he': 7, 'is': 8, 'but': 9, 'man': 10, 'knows': 11, 'himself': 12, 'to': 13, 'be': 14, 'a': 15}

# sequences
[[2, 3, 5, 6, 7, 8, 4, 9, 2, 4, 10, 11, 12, 13, 14, 15, 3]]

# matrix
[[0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

来自tensorflow.keras.preprocessing.text导入标记器的


标记器=标记器（oov_token=“”）
句子=[正文]
打印（句子）
标记器。在文本（句子）上匹配
word\u index=tokenizer.word\u index
序列=标记器。文本到序列（句子）
矩阵=标记器。文本到矩阵（句子）
打印（word_索引）
打印（序列）
打印（矩阵）
---
愚人自以为聪明，智者却知道自己是愚人
#单词索引
{'：1，'：2，'愚人'：3，'智慧'：4，'多思'：5，'思考'：6，'他'：7，'是'：8，'但是'：9，'人'：10，'知道'：11，'他自己'：12，'到'：13，'是'：14，'a'：15}
#序列
[[2, 3, 5, 6, 7, 8, 4, 9, 2, 4, 10, 11, 12, 13, 14, 15, 3]]
#母体
[[0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

在二进制模式（默认模式）下，它指示从所学词汇中提取的单词在输入文本中。您已将您的标记器训练为

['The fool doth think he is wise, but the wise man knows himself to be a fool.']

因此，当您将同一文本转换为矩阵时，它将包含除

OOV

之外的所有单词（由

表示）-因为所有单词都是已知的-因此结果向量1的位置为0（请参见

word\u index

），并且由于从1开始枚举单词，因此0始终为0

一些示例

tokenizer.text_to_矩阵（['foo']）
#这篇文章中只有OOV
数组（[[0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,。，
0.]])

tokenizer.text\u to\u矩阵（['he']）
#已知单词，两次（不管多久一次）
数组（[[0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,。，
0.]])

tokenizer.text\u to\u矩阵（['thedool']）
数组（[[0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,。，
0.]])

其他MOD

其他MOD更为清晰

计数-词汇表中的一个单词在文本中出现了多少次

tokenizer.text_to_矩阵（['He，He the fool']，mode=“count”）
数组（[[0,0,1,1,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,。，
0.]])

freq-总和标准化为1.0的计数

tokenizer.text_to_matrix（['he the fool']，mode=“freq”）
数组（[[0,0,0.25,0.25,0,0,0,0,0,0.5,0,0,0,0,0,0,，
0.  , 0.  , 0.  , 0.  , 0.  , 0.  ]])

tokenizer.text_to_matrix（['he the fool']，mode=“tfidf”）
数组（[[0,0,0.84729786,0.84729786,0，
0.        , 0.        , 1.43459998, 0.        , 0.        ,
0.        , 0.        , 0.        , 0.        , 0.        ,
0.        , 0.        ]])

在二进制模式（默认模式）下，它指示从所学词汇中提取的单词在输入文本中。您已经对标记器进行了相关培训

['The fool doth think he is wise, but the wise man knows himself to be a fool.']