Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/354.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 基于特定查询从SparSematrix相似度gensim中获取前N个最相似向量_Python_Gensim_Similarity_Tf Idf_Cosine Similarity - Fatal编程技术网

Python 基于特定查询从SparSematrix相似度gensim中获取前N个最相似向量

Python 基于特定查询从SparSematrix相似度gensim中获取前N个最相似向量,python,gensim,similarity,tf-idf,cosine-similarity,Python,Gensim,Similarity,Tf Idf,Cosine Similarity,我看到了gensim文档,但遇到了以下问题: -一旦我得到一个mXj相似性矩阵,其中m是文档的数量,j是唯一单词的总数,我不知道如何提取最相似的N文档 -长期目标是以xlsx或csv格式存储和保存,但这是另一个问题 这里使用相似性 from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile index_tmpfile = get_tmpfile("index") query = [(

我看到了
gensim
文档,但遇到了以下问题:

-一旦我得到一个
mXj
相似性矩阵,其中
m
是文档的数量,
j
是唯一单词的总数,我不知道如何提取最相似的
N
文档

-长期目标是以
xlsx
csv
格式存储和保存,但这是另一个问题

这里使用
相似性

from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
index_tmpfile = get_tmpfile("index")

query = [(1, 2), (6, 1), (7, 2)]

index = Similarity(index_tmpfile, common_corpus, num_features=len(common_dictionary))  # build the index

similarities = index[query]  # get similarities between the query and all index documents
在那之后,我得到了10个最相似的文档

预期产出:

arr=[('This is the most similar',0.99),('This is one of the most similar',0.98),('This is another very similar docs,0.98)]

在我个人的情况下,我有以下使用相似性的代码

#type here the query
query_vector='house is clean'
tokenized_query_vector = regexp_tokenize(query_vector,r"\w+")  

#Create an object of corpora.Dictionary() 
dictionary = corpora.Dictionary()
#Passing to dictionary.doc2bow() object
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in tokenized_lines]
feature_cnt = len(dictionary.token2id)

#n=raw
#t=zero corrected idf
#c=cosine
tfidf = TfidfModel(BoW_corpus, smartirs='ntc')

query_vector = dictionary.doc2bow(tokenized_query_vector)

index = similarities.SparseMatrixSimilarity(tfidf[BoW_corpus], num_features = feature_cnt)
#this way I get the similarity between the query and the similarity matrix
sim = index[tfidf[query_vector]]
#type here the query
query_vector='house is clean'
tokenized_query_vector = regexp_tokenize(query_vector,r"\w+")  

#Create an object of corpora.Dictionary() 
dictionary = corpora.Dictionary()
#Passing to dictionary.doc2bow() object
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in tokenized_lines]
feature_cnt = len(dictionary.token2id)

#n=raw
#t=zero corrected idf
#c=cosine
tfidf = TfidfModel(BoW_corpus, smartirs='ntc')

query_vector = dictionary.doc2bow(tokenized_query_vector)

index = similarities.SparseMatrixSimilarity(tfidf[BoW_corpus], num_features = feature_cnt)
#this way I get the similarity between the query and the similarity matrix
sim = index[tfidf[query_vector]]