Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/312.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从gensim模型中提取令牌频率_Python_Gensim - Fatal编程技术网

Python 从gensim模型中提取令牌频率

Python 从gensim模型中提取令牌频率,python,gensim,Python,Gensim,从gensim word2vec模型中检索词汇频率的问题,如和 出于某种原因,他们实际上只是给出了一个从n(vocab大小)到0的弃用计数器,以及最常见的标记 例如: for idx, w in enumerate(model.vocab): print(idx, w, model.vocab[w].count) 给出: 0 </s> 111051 1 . 111050 2 , 111049 3 the 111048 4 of 111047 ... 111050 tokiw

从gensim word2vec模型中检索词汇频率的问题,如和

出于某种原因,他们实际上只是给出了一个从n(vocab大小)到0的弃用计数器,以及最常见的标记

例如:

for idx, w in enumerate(model.vocab):
    print(idx, w, model.vocab[w].count)
给出:

0 </s> 111051
1 . 111050
2 , 111049
3 the 111048
4 of 111047
...
111050 tokiwa 2
111051 muzorewa 1
0111051
1.111050
2, 111049
3、111048
第4页,共111047页
...
111050东京2号
111051穆佐雷瓦1

为什么要这样做?给定一个单词,如何从模型中提取术语频率?

这些答案对于从具有它们的模型中读取声明的标记计数是正确的

但在某些情况下,您的模型可能只被初始化为一个虚假的,每个单词按1递减的计数。在使用Gensim时,如果它是从计数不可用或未使用的源加载的,则最有可能出现这种情况

特别是,如果使用
load\u word2vec\u format()
创建模型,则该简单的纯向量格式(无论是
binary
还是纯文本)本质上不包含字数。但按照惯例,这些词几乎总是按照从最频繁到最不频繁的顺序排列

因此,Gensim选择在不存在频率的情况下,合成假计数,int值线性下降,其中(第一个)最频繁的单词以所有唯一单词的计数开始,而(最后一个)最不频繁的单词的计数为1

(我不确定这是否是一个好主意,但Gensim已经做了一段时间了,它确保依赖每令牌计数的代码不会中断,并将保持原始顺序,尽管显然不是未知的原始真实比例。)

在某些情况下,文件的原始源可能保存了一个单独的
.vocab
文件,其词频与
word2vec_格式
向量一起。(在Google的原始
word2vec.c
code发行版中,这是由可选的
-save vocab
标志生成的文件。在Gensim的
.save_word2vec_format()
方法中,可选的
fvocab
参数可用于生成此副文件。)

如果是这样,当您调用
.load\u word2vec\u format()
,作为
fvocab
参数时,可能会提供“vocab”频率文件名,然后向量集将具有真计数

如果单词向量最初是在Gensim中从提供实际频率的语料库中创建的,并且总是使用Gensim本机函数
.save()
/
.load()
保存/加载,该函数使用Python pickle的扩展形式,那么原始的true
count
信息将永远不会丢失

如果您丢失了原始频率数据,但您知道数据来自真实的自然语言源,并且您想要一组更真实(但仍然是伪造的)频率,那么可以选择使用Zipfian分布。(真实的自然语言使用频率往往大致符合这种“高头、长尾”分布。)答案中提供了一个用于创建更真实的虚拟计数的公式:

你不应该写“model.wv.vocab”吗?