Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/343.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 基于TF-IDF和余弦相似度的匹配短语_Python_Machine Learning - Fatal编程技术网

Python 基于TF-IDF和余弦相似度的匹配短语

Python 基于TF-IDF和余弦相似度的匹配短语,python,machine-learning,Python,Machine Learning,我有一个如下所示的数据帧: question answer Why did the chicken cross the road? to get to the other side Who are you? a chatbot Hello, how are you? Hi . . . 我想做的是使用TF-IDF在这个数据集上进行

我有一个如下所示的数据帧:

question                                answer
Why did the chicken cross the road?     to get to the other side
Who are you?                            a chatbot
Hello, how are you?                     Hi
.
.
.  
我想做的是使用TF-IDF在这个数据集上进行训练。当用户输入短语时,将使用余弦相似度选择与短语最匹配的问题。 我可以通过这种方式为train数据集上的句子创建TF-IDF值,但是我如何使用它来查找用户输入的新短语的余弦相似性分数

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(intent_data["sentence"])

我想你需要像这样的东西

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarities = cosine_similarity(x, v.transform(['user input'])).flatten()
best_match_index = cosine_similarities.argmax()

我想你需要像这样的东西

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarities = cosine_similarity(x, v.transform(['user input'])).flatten()
best_match_index = cosine_similarities.argmax()
试试这个:

输入:

question    answer
0   Why did the chicken cross the road? to get to the other side
1   Who are you?    a chatbot
2   Hello, how are you? Hi

#Script

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#data = Input dataframe as above
v = TfidfVectorizer()
sentence_input = ["hello, you"]
similarity_index_list = cosine_similarity(v.fit_transform(data["question"]), v.transform(sentence_input)).flatten()
output = data.loc[similarity_index_list.argmax(), "answer"]
建议:使用一些基于预测的单词嵌入方法来维护输出向量中的上下文,在出现歧义句子时会得到更准确的结果。(例如:fasttext,word2vec)。

试试这个:

输入:

question    answer
0   Why did the chicken cross the road? to get to the other side
1   Who are you?    a chatbot
2   Hello, how are you? Hi

#Script

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#data = Input dataframe as above
v = TfidfVectorizer()
sentence_input = ["hello, you"]
similarity_index_list = cosine_similarity(v.fit_transform(data["question"]), v.transform(sentence_input)).flatten()
output = data.loc[similarity_index_list.argmax(), "answer"]

建议:使用一些基于预测的单词嵌入方法来维护输出向量中的上下文,在出现歧义句子时会得到更准确的结果。(例如:fasttext,word2vec)。

有没有办法获得输入与最佳匹配的相似程度的值?是的,只需取max()而不是argmax(),就可以获得余弦相似性的最大值。或者简单地使用best_match_index来获取语料库中的文本。有没有办法获取输入与最佳匹配的相似程度的值?是的,只需取max()而不是argmax(),就可以获得余弦相似性的最大值。或者简单地使用best_match_index获取语料库中的文本。是否有方法获取输入与best match的相似度值?是否有方法获取输入与best match的相似度值?