keras使用one_hot类对文本进行预处理_Keras

keras使用one_hot类对文本进行预处理

keras

keras使用one_hot类对文本进行预处理,keras,Keras,我在网上学习keras时遇到了这个代码 from keras.preprocessing.text import one_hot from keras.preprocessing.text import text_to_word_sequence text = 'One hot encoding in Keras' tokens = text_to_word_sequence(text) length = len(tokens) one_hot(text, length) 这会像这样返回整数

我在网上学习keras时遇到了这个代码

from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence

text = 'One hot encoding in Keras'
tokens = text_to_word_sequence(text)
length = len(tokens)
one_hot(text, length)

这会像这样返回整数

[3,1,1,2,3]

我不明白为什么独特的单词会返回重复的数字。例如，即使文本中的单词是唯一的，也会重复3和1。

从

one\u hot

的文档中，描述了它是如何包装

哈希技巧的：
这是使用散列函数作为散列函数的散列技巧
函数的包装器；不保证字到索引映射的唯一性
从hasing\u trick
的文档中：
由于哈希函数可能发生冲突，两个或多个单词可能被分配到同一索引。碰撞的概率与散列空间的维数和不同对象的数量有关
由于使用了散列，不同的单词有可能被散列到相同的索引中。非唯一散列的概率与所选的词汇表大小成正比。
Jason Brownlee建议使用比单词大小大25%的词汇，以增加哈希的唯一性
以下Jason Brownlee对您案例的建议会导致：
从tensorflow.keras.preprocessing.text导入一个
从tensorflow.keras.preprocessing.text导入文本到单词序列
来自tensorflow.random导入集\u random\u seed
输入数学
设置随机种子（1）
text='Keras中的一个热编码'
标记=文本到单词的顺序（文本）
长度=len（令牌）
打印（单色（文本、数学单元（长度*1.25）））

它返回整数
[3,4,5,1,6]