tensorflow.keras.preprocessing.text.Tokenizer.text\u to\u矩阵做什么?
请解释是什么,结果是什么tensorflow.keras.preprocessing.text.Tokenizer.text\u to\u矩阵做什么?,tensorflow,keras,Tensorflow,Keras,请解释是什么,结果是什么 from tensorflow.keras.preprocessing.text import Tokenizer tokenizer = Tokenizer(oov_token="<OOV>") sentences = [text] print(sentences) tokenizer.fit_on_texts(sentences) word_index = tokenizer.word_index sequences = toke
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(oov_token="<OOV>")
sentences = [text]
print(sentences)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
matrix = tokenizer.texts_to_matrix(sentences)
print(word_index)
print(sequences)
print(matrix)
---
['The fool doth think he is wise, but the wise man knows himself to be a fool.']
# word_index
{'<OOV>': 1, 'the': 2, 'fool': 3, 'wise': 4, 'doth': 5, 'think': 6, 'he': 7, 'is': 8, 'but': 9, 'man': 10, 'knows': 11, 'himself': 12, 'to': 13, 'be': 14, 'a': 15}
# sequences
[[2, 3, 5, 6, 7, 8, 4, 9, 2, 4, 10, 11, 12, 13, 14, 15, 3]]
# matrix
[[0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
来自tensorflow.keras.preprocessing.text导入标记器的
标记器=标记器(oov_token=“”)
句子=[正文]
打印(句子)
标记器。在文本(句子)上匹配
word\u index=tokenizer.word\u index
序列=标记器。文本到序列(句子)
矩阵=标记器。文本到矩阵(句子)
打印(word_索引)
打印(序列)
打印(矩阵)
---
愚人自以为聪明,智者却知道自己是愚人
#单词索引
{':1,':2,'愚人':3,'智慧':4,'多思':5,'思考':6,'他':7,'是':8,'但是':9,'人':10,'知道':11,'他自己':12,'到':13,'是':14,'a':15}
#序列
[[2, 3, 5, 6, 7, 8, 4, 9, 2, 4, 10, 11, 12, 13, 14, 15, 3]]
#母体
[[0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
在二进制模式(默认模式)下,它指示从所学词汇中提取的单词在输入文本中。您已将您的标记器训练为
['The fool doth think he is wise, but the wise man knows himself to be a fool.']
因此,当您将同一文本转换为矩阵时,它将包含除OOV
之外的所有单词(由1
表示)-因为所有单词都是已知的-因此结果向量1的位置为0(请参见word\u index
),并且由于从1开始枚举单词,因此0始终为0
一些示例
tokenizer.text_to_矩阵(['foo'])
#这篇文章中只有OOV
数组([[0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,。,
0.]])
tokenizer.text\u to\u矩阵(['he'])
#已知单词,两次(不管多久一次)
数组([[0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,。,
0.]])
tokenizer.text\u to\u矩阵(['thedool'])
数组([[0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,。,
0.]])
其他MOD
其他MOD更为清晰
- 计数-词汇表中的一个单词在文本中出现了多少次
tokenizer.text_to_矩阵(['He,He the fool'],mode=“count”)
数组([[0,0,1,1,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,。,
0.]])
- freq-总和标准化为1.0的计数
tokenizer.text_to_matrix(['he the fool'],mode=“freq”)
数组([[0,0,0.25,0.25,0,0,0,0,0,0.5,0,0,0,0,0,0,,
0. , 0. , 0. , 0. , 0. , 0. ]])
tokenizer.text_to_matrix(['he the fool'],mode=“tfidf”)
数组([[0,0,0.84729786,0.84729786,0,
0. , 0. , 1.43459998, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. ]])
在二进制模式(默认模式)下,它指示从所学词汇中提取的单词在输入文本中。您已经对标记器进行了相关培训
['The fool doth think he is wise, but the wise man knows himself to be a fool.']
因此,当您将同一文本转换为矩阵时,它将包含除OOV
之外的所有单词(由1
表示)-因为所有单词都是已知的-因此结果向量1的位置为0(请参见word\u index
),并且由于从1开始枚举单词,因此0始终为0
一些示例
tokenizer.text_to_矩阵(['foo'])
#这篇文章中只有OOV
数组([[0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,。,
0.]])
tokenizer.text\u to\u矩阵(['he'])
#已知单词,两次(不管多久一次)
数组([[0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,。,
0.]])
tokenizer.text\u to\u矩阵(['thedool'])
数组([[0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,。,
0.]])
其他MOD
其他MOD更为清晰
- 计数-词汇表中的一个单词在文本中出现了多少次
tokenizer.text_to_矩阵(['He,He the fool'],mode=“count”)
数组([[0,0,1,1,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,。,
0.]])
- freq-总和标准化为1.0的计数
tokenizer.text_to_matrix(['he the fool'],mode=“freq”)
数组([[0,0,0.25,0.25,0,0,0,0,0,0.5,0,0,0,0,0,0,,
0. , 0. , 0. , 0. , 0. , 0. ]])
tokenizer.text_to_matrix(['he the fool'],mode=“tfidf”)
数组([[0,0,0.84729786,0.84729786,0,
0. , 0. , 1.43459998, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. ]])