python nlp中单词集与句子集的匹配_Python_Nlp_Data Science_Unsupervised Learning

python nlp中单词集与句子集的匹配

python nlp

python nlp中单词集与句子集的匹配,python,nlp,data-science,unsupervised-learning,Python,Nlp,Data Science,Unsupervised Learning,我有一个用例，我想将一个单词列表与一个句子列表进行匹配，并带来最相关的句子我在python中工作。我已经尝试过的是使用KMeans，我们将文档集聚集到集群中，然后预测它所在的句子结构。但就我而言，我已经有了一个可用的单词列表 def getMostRelevantSentences(): Sentences = ["This is the most beautiful place in the world.", "This man has more skills

我有一个用例，我想将一个单词列表与一个句子列表进行匹配，并带来最相关的句子

我在python中工作。我已经尝试过的是使用KMeans，我们将文档集聚集到集群中，然后预测它所在的句子结构。但就我而言，我已经有了一个可用的单词列表

def getMostRelevantSentences():
    Sentences = ["This is the most beautiful place in the world.",
            "This man has more skills to show in cricket than any other game.",
            "Hi there! how was your ladakh trip last month?",
            "Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport."]

    words = ["cricket","sports","team","play","match"]

    #TODO: now this should return me the 2nd and last item from the Sentences list as the words list mostly matches with them

因此，从上面的代码中，我想返回与所提供的单词紧密匹配的句子。我不想在这里使用有监督的机器学习。任何帮助都将不胜感激。

因此，最后我使用了这个名为gensim的超级库来生成相似性

import gensim
from nltk.tokenize import word_tokenize

def getSimilarityScore(raw_documents, words):
    gen_docs = [[w.lower() for w in word_tokenize(text)] 
            for text in raw_documents]
    dictionary = gensim.corpora.Dictionary(gen_docs)
    corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
    tf_idf = gensim.models.TfidfModel(corpus)
    sims = gensim.similarities.Similarity('/usr/workdir',tf_idf[corpus],
                                      num_features=len(dictionary))

    query_doc_bow = dictionary.doc2bow(words)
    query_doc_tf_idf = tf_idf[query_doc_bow]

    return sims[query_doc_tf_idf]

您可以将此方法用作：


Sentences = ["This is the most beautiful place in the world.",
            "This man has more skills to show in cricket than any other game.",
            "Hi there! how was your ladakh trip last month?",
            "Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport."]

words = ["cricket","sports","team","play","match"]

words_lower = [w.lower() for w in words]

getSimilarityScore(Sentences,words_lower)

这个问题似乎根本不包括任何解决问题的尝试。请编辑问题以显示您尝试了什么，并显示您遇到的具体障碍。有关更多信息，请参阅。