Python 文档相似性_Python_Nlp_Gensim

Python 文档相似性

python nlp

Python 文档相似性,python,nlp,gensim,Python,Nlp,Gensim,我正在尝试从同一组10000个文档中获取10000个文档列表的相关文档。我使用两种算法进行测试：gensim lsi和gensim相似性。两者都会产生可怕的结果。我怎样才能改进它 from gensim import corpora, models, similarities from nltk.corpus import stopwords import re def cleanword(word): return re.sub(r'\W+', '', word).strip()

我正在尝试从同一组10000个文档中获取10000个文档列表的相关文档。我使用两种算法进行测试：gensim lsi和gensim相似性。两者都会产生可怕的结果。我怎样才能改进它

from gensim import corpora, models, similarities
from nltk.corpus import stopwords
import re

def cleanword(word):
    return re.sub(r'\W+', '', word).strip()

def create_corpus(documents):

    # remove common words and tokenize
    stoplist = stopwords.words('english')
    stoplist.append('')
    texts = [[cleanword(word) for word in document.lower().split() if cleanword(word) not in stoplist]
             for document in documents]

    # remove words that appear only once
    all_tokens = sum(texts, [])
    tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)

    texts = [[word for word in text if word not in tokens_once] for text in texts]

    dictionary = corpora.Dictionary(texts)
    corp = [dictionary.doc2bow(text) for text in texts]

def create_lsi(documents):

    corp = create_corpus(documents)
    # extract 400 LSI topics; use the default one-pass algorithm
    lsi = models.lsimodel.LsiModel(corpus=corp, id2word=dictionary, num_topics=400)
    # print the most contributing words (both positively and negatively) for each of the first ten topics
    lsi.print_topics(10)

def create_sim_index(documents):
    corp = create_corpus(documents)
    index = similarities.Similarity('/tmp/tst', corp, num_features=12)
    return index

看起来您根本没有使用

create\u lsi（）

？您只需打印创建的LSI模型，然后将其忘记

那么

num\u features=12

中的数字

来自哪里？对于弓向量，它应该是

num\u features=len（字典）

，对于lsi向量，它应该是

num\u features=lsi.num\u主题
在LSI之前添加TF-IDF转换
查看位于的gensim教程，它更详细地介绍了这些步骤并提供了注释。
您需要使用其他机器学习算法，例如：具有余弦相似性的聚类（k-means）
LSI用于大型文本数据集。我们可以使用奇异值分解在约化空间中形成一个具有相关项的矩阵。在gensim包中，您可以通过只返回前n个术语来获得语义上最相似的术语
lsimodel.print_主题（10，topn=5）
其中10表示主题数，5表示每个主题的前五个术语
因此，您可以减少不相关的术语。
首先，您不能从纯无监督的统计方法（如LSI或LDA）中期望太多。尝试tf-idf
、余弦相似性、更强的停止词列表、其他聚类方法（例如k-means）不，这种方法很好。这只是一堆乱七八糟的复制代码造成了这里的麻烦@alvas:）@Radim gensim可以与Solr/ElasticSearch一起使用吗？