Python 基于TF-IDF和余弦相似度的匹配短语_Python_Machine Learning

Python 基于TF-IDF和余弦相似度的匹配短语

python machine-learning

Python 基于TF-IDF和余弦相似度的匹配短语,python,machine-learning,Python,Machine Learning,我有一个如下所示的数据帧： question answer Why did the chicken cross the road? to get to the other side Who are you? a chatbot Hello, how are you? Hi . . . 我想做的是使用TF-IDF在这个数据集上进行

我有一个如下所示的数据帧：

question                                answer
Why did the chicken cross the road?     to get to the other side
Who are you?                            a chatbot
Hello, how are you?                     Hi
.
.
.

我想做的是使用TF-IDF在这个数据集上进行训练。当用户输入短语时，将使用余弦相似度选择与短语最匹配的问题。我可以通过这种方式为train数据集上的句子创建TF-IDF值，但是我如何使用它来查找用户输入的新短语的余弦相似性分数

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(intent_data["sentence"])

我想你需要像这样的东西

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarities = cosine_similarity(x, v.transform(['user input'])).flatten()
best_match_index = cosine_similarities.argmax()

我想你需要像这样的东西

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarities = cosine_similarity(x, v.transform(['user input'])).flatten()
best_match_index = cosine_similarities.argmax()

试试这个：

输入：

question    answer
0   Why did the chicken cross the road? to get to the other side
1   Who are you?    a chatbot
2   Hello, how are you? Hi

#Script

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#data = Input dataframe as above
v = TfidfVectorizer()
sentence_input = ["hello, you"]
similarity_index_list = cosine_similarity(v.fit_transform(data["question"]), v.transform(sentence_input)).flatten()
output = data.loc[similarity_index_list.argmax(), "answer"]

建议：使用一些基于预测的单词嵌入方法来维护输出向量中的上下文，在出现歧义句子时会得到更准确的结果。（例如：fasttext，word2vec）。

试试这个：

输入：

question    answer
0   Why did the chicken cross the road? to get to the other side
1   Who are you?    a chatbot
2   Hello, how are you? Hi

#Script

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#data = Input dataframe as above
v = TfidfVectorizer()
sentence_input = ["hello, you"]
similarity_index_list = cosine_similarity(v.fit_transform(data["question"]), v.transform(sentence_input)).flatten()
output = data.loc[similarity_index_list.argmax(), "answer"]

建议：使用一些基于预测的单词嵌入方法来维护输出向量中的上下文，在出现歧义句子时会得到更准确的结果。（例如：fasttext，word2vec）。

有没有办法获得输入与最佳匹配的相似程度的值？是的，只需取max（）而不是argmax（），就可以获得余弦相似性的最大值。或者简单地使用best_match_index来获取语料库中的文本。有没有办法获取输入与最佳匹配的相似程度的值？是的，只需取max（）而不是argmax（），就可以获得余弦相似性的最大值。或者简单地使用best_match_index获取语料库中的文本。是否有方法获取输入与best match的相似度值？是否有方法获取输入与best match的相似度值？