Python 估计'；近似值'；句子之间的语义相似性？_Python_Nlp_Machine Learning_Data Mining_Nltk

Python 估计'；近似值'；句子之间的语义相似性？

python nlp machine-learning

Python 估计'；近似值'；句子之间的语义相似性？,python,nlp,machine-learning,data-mining,nltk,Python,Nlp,Machine Learning,Data Mining,Nltk,在过去的几个小时里，我一直在看nlp标签，我相信我没有错过任何东西，但如果我错过了，请一定要告诉我这个问题同时，我将描述我正在尝试做的事情。我在许多帖子上看到的一个普遍观点是，语义相似性很难实现。例如，在post中，公认的解决方案建议如下： First of all, neither from the perspective of computational linguistics nor of theoretical linguistics is it clear what the te

在过去的几个小时里，我一直在看nlp标签，我相信我没有错过任何东西，但如果我错过了，请一定要告诉我这个问题

同时，我将描述我正在尝试做的事情。我在许多帖子上看到的一个普遍观点是，语义相似性很难实现。例如，在post中，公认的解决方案建议如下：

First of all, neither from the perspective of computational 
linguistics nor of theoretical linguistics is it clear what 
the term 'semantic similarity' means exactly. .... 
Consider these examples:

Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.

Which of the sentences 2-4 are similar to 1? 2 is the exact 
opposite of 1, still it is about Pete and Rob (not) finding a 
dog.

我的高级需求是利用k-means聚类并根据语义相似性对文本进行分类，因此我只需要知道它们是否近似匹配。例如，在上面的例子中，我可以将1,2,4,5分为一个类别，将3分为另一个类别（当然，3将用一些更相似的句子来支持）。比如，查找相关文章，但它们不必100%相关

我想我需要最终构造每个句子的向量表示，有点像它的指纹，但这个向量到底应该包含什么对我来说仍然是一个悬而未决的问题。它是n-grams，还是来自wordnet的东西，或者仅仅是单个词干的单词，或者其他什么东西

thread在列举所有相关技术方面做得非常出色，但不幸的是，当帖子达到我的目的时，它停止了。关于这方面最新的技术水平有什么建议吗？

我建议您尝试一种主题建模框架，如潜在Dirichlet分配（LDA）。这里的想法是，文档（在你的案例中，句子可能被证明是一个问题）是由一组潜在（隐藏）主题生成的；LDA检索这些主题，用词簇表示它们

可作为免费Gensim软件包的一部分提供。您可以尝试将其应用于您的句子，然后在其输出上运行k-means。

可能很有用。它基本上只是另一个应用程序。是这种方法的一个非常好的C实现，这是一个很老但很好的方法，甚至还有python绑定的形式

是的，我认为LDA现在非常流行。as

scikits.learn.utils.extmath.fast\u SVD

中也提供了稀疏矩阵SVD的Python实现。