Python 基于TF-IDF和余弦相似度的匹配短语
我有一个如下所示的数据帧:Python 基于TF-IDF和余弦相似度的匹配短语,python,machine-learning,Python,Machine Learning,我有一个如下所示的数据帧: question answer Why did the chicken cross the road? to get to the other side Who are you? a chatbot Hello, how are you? Hi . . . 我想做的是使用TF-IDF在这个数据集上进行
question answer
Why did the chicken cross the road? to get to the other side
Who are you? a chatbot
Hello, how are you? Hi
.
.
.
我想做的是使用TF-IDF在这个数据集上进行训练。当用户输入短语时,将使用余弦相似度选择与短语最匹配的问题。
我可以通过这种方式为train数据集上的句子创建TF-IDF值,但是我如何使用它来查找用户输入的新短语的余弦相似性分数
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(intent_data["sentence"])
我想你需要像这样的东西
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarities = cosine_similarity(x, v.transform(['user input'])).flatten()
best_match_index = cosine_similarities.argmax()
我想你需要像这样的东西
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarities = cosine_similarity(x, v.transform(['user input'])).flatten()
best_match_index = cosine_similarities.argmax()
试试这个:
输入:
question answer
0 Why did the chicken cross the road? to get to the other side
1 Who are you? a chatbot
2 Hello, how are you? Hi
#Script
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
#data = Input dataframe as above
v = TfidfVectorizer()
sentence_input = ["hello, you"]
similarity_index_list = cosine_similarity(v.fit_transform(data["question"]), v.transform(sentence_input)).flatten()
output = data.loc[similarity_index_list.argmax(), "answer"]
建议:使用一些基于预测的单词嵌入方法来维护输出向量中的上下文,在出现歧义句子时会得到更准确的结果。(例如:fasttext,word2vec)。试试这个:
输入:
question answer
0 Why did the chicken cross the road? to get to the other side
1 Who are you? a chatbot
2 Hello, how are you? Hi
#Script
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
#data = Input dataframe as above
v = TfidfVectorizer()
sentence_input = ["hello, you"]
similarity_index_list = cosine_similarity(v.fit_transform(data["question"]), v.transform(sentence_input)).flatten()
output = data.loc[similarity_index_list.argmax(), "answer"]
建议:使用一些基于预测的单词嵌入方法来维护输出向量中的上下文,在出现歧义句子时会得到更准确的结果。(例如:fasttext,word2vec)。有没有办法获得输入与最佳匹配的相似程度的值?是的,只需取max()而不是argmax(),就可以获得余弦相似性的最大值。或者简单地使用best_match_index来获取语料库中的文本。有没有办法获取输入与最佳匹配的相似程度的值?是的,只需取max()而不是argmax(),就可以获得余弦相似性的最大值。或者简单地使用best_match_index获取语料库中的文本。是否有方法获取输入与best match的相似度值?是否有方法获取输入与best match的相似度值?