Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/variables/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x gensim.similories.docsim.Similarity在查询时返回空_Python 3.x_Nltk_Jupyter Notebook_Gensim - Fatal编程技术网

Python 3.x gensim.similories.docsim.Similarity在查询时返回空

Python 3.x gensim.similories.docsim.Similarity在查询时返回空,python-3.x,nltk,jupyter-notebook,gensim,Python 3.x,Nltk,Jupyter Notebook,Gensim,直到最后一步,我似乎都得到了正确的结果。我的结果数组总是空的 我尝试按照本教程比较6组注释: 到目前为止,我有: #tokenize an array of all text raw_docs = [Notes_0, Notes_1, Notes_2, Notes_3, Notes_4, Notes_5] gen_docs = [[w.lower() for w in word_tokenize(text)] for text in raw_docs] #create

直到最后一步,我似乎都得到了正确的结果。我的结果数组总是空的

我尝试按照本教程比较6组注释:

到目前为止,我有:

#tokenize an array of all text
raw_docs = [Notes_0, Notes_1, Notes_2, Notes_3, Notes_4, Notes_5]
gen_docs = [[w.lower() for w in word_tokenize(text)]
           for text in raw_docs]

#create dictionary
dictionary_interactions = gensim.corpora.Dictionary(gen_docs)
print("Number of words in dictionary: ", len(dictionary_interactions))
#create a corpus
corpus_interactions = [dictionary_interactions.doc2bow(gen_docs) for gen_docs in gen_docs]
len(corpus_interactions)
#convert to tf-idf model
tf_idf_interactions = gensim.models.TfidfModel(corpus_interactions)
#check for similarities between docs
sims_interactions = gensim.similarities.Similarity('C:/Users/JNproject', tf_idf_interactions[corpus_interactions],
                               num_features = len(dictionary_interactions))

print(sims_interactions)
print(type(sims_interactions))
对于输出:

Number of words in dictionary:  46364
Similarity index with 6 documents in 0 shards (stored under C:/Users/Jeremy Bice/JNprojects/Company/Interactions/sim_interactions)
<class 'gensim.similarities.docsim.Similarity'>
我的输出是:

['client', 'is']
[(335, 1), (757, 1)]
[]
array([ 0.,  0.,  0.,  0.,  0.,  0.], dtype=float32)

如何在此处获得输出?

根据
原始文档的内容,这可能是正确的行为

尽管查询词出现在原始文档和词典中,但代码返回一个空的
tf_idf
<代码>tf_idf
术语_频率*逆文档_频率
计算<代码>逆文档频率由
log(N/d)
计算,其中
N
是您的文档总数,
d
是特定术语出现的文档数

我猜您的查询词
['client','is']
出现在您的每个文档中,导致
逆文档频率
0
,而
tf\u idf
列表为空。您可以使用我从您提到的教程中获取和修改的文档检查此行为:

# original: commented out
# added arbitrary words 'now' and 'the' where missing, so they occur in each document

#raw_documents = ["I'm taking the show on the road.",
raw_documents = ["I'm taking the show on the road now.",
#                 "My socks are a force multiplier.",
                 "My socks are the force multiplier now.",
#                 "I am the barber who cuts everyone's hair who doesn't cut their own.",
                 "I am the barber who cuts everyone's hair who doesn't cut their own now.",
#                 "Legend has it that the mind is a mad monkey.",
                 "Legend has it that the mind is a mad monkey now.",
#                 "I make my own fun."]
                 "I make my own the fun now."]
如果你查询

query_doc = [w.lower() for w in word_tokenize("the now")]
你得到

['the', 'now']
[(3, 1), (8, 1)]
[]
[0. 0. 0. 0. 0.]

谢谢你的帮助。我想我理解你的意思,但我只是想澄清一下——你认为问题可能在于每个文档中的短语“client is”会产生0的值?我几乎可以肯定,只有一个文档中没有出现任何短语。我在6个文档中的4个文档中发现了一个短语,但这也导致了类似的问题。请注意,您不会查询短语(按特定顺序排列的单词),而是查询顺序无关紧要的标记列表。因此,是的,如果所有6个文档中都出现了
'client'
'is'
,则查询的tfidf将为
0
。作为测试,查找至少一个文档中出现的单词,但不是所有文档中都出现的单词,然后将该单词添加到查询中。看看这会不会给你带来什么结果,这是有道理的。我发现一个短语(要求退款)只出现在4个文档中,得到了相同的结果。我制作了一个独特文档的测试集,并运行了它,代码运行正常。我猜这不是代码的问题,而是您刚才提到的问题。因此,这对我来说可能不是一个有用的工具。非常感谢你的帮助!
['the', 'now']
[(3, 1), (8, 1)]
[]
[0. 0. 0. 0. 0.]