Python 数据帧中的Word2vec_Python_Pandas_Nlp_Word2vec

Python 数据帧中的Word2vec

python pandas nlp

Python 数据帧中的Word2vec,python,pandas,nlp,word2vec,Python,Pandas,Nlp,Word2vec,我试图应用word2vec来检查数据集中每行两列的相似性例如： Sent1 Sent2 It is a sunny day Today the weather is good. It is warm outside What people think about democracy In ancient times, Greeks were the first

我试图应用word2vec来检查数据集中每行两列的相似性

例如：

Sent1                                     Sent2
It is a sunny day                         Today the weather is good. It is warm outside
What people think about democracy         In ancient times, Greeks were the first to propose democracy  
I have never played tennis                I do not know who Roger Feder is

应用Word2VEC，我考虑如下：

import numpy as np

words1 = sentence1.split(' ')
words2 = sentence2.split(' ')
#The meaning of the sentence can be interpreted as the average of its words
sentence1_meaning = word2vec(words1[0])
count = 1
for w in words1[1:]:

    sentence1_meaning = np.add(sentence1_meaning, word2vec(w))
    count += 1
sentence1_meaning /= count

sentence2_meaning = word2vec(words1[0])
count = 1

for w in words1[1:]:
    sentence1_meaning = np.add(sentence1_meaning, word2vec(w))
    count += 1
sentence1_meaning /= count

sentence2_meaning = word2vec(words2[0])
count = 1
sentence2_meaning = word2vec(words2[0])
count = 1
for w in words2[1:]:
    sentence2_meaning = np.add(sentence2_meaning, word2vec(w))
    count += 1
sentence2_meaning /= count

#Similarity is the cosine between the vectors
similarity = np.dot(sentence1_meaning, sentence2_meaning)/(np.linalg.norm(sentence1_meaning)*np.linalg.norm(sentence2_meaning))

但是，这应该适用于不在数据框架中的两个句子

您能告诉我在熊猫数据帧中应用word2vec检查sent1和sent2之间的相似性需要做什么吗？我想要一个新的专栏来介绍结果。

我没有经过培训且可用的

word2vec

，因此我将演示如何使用伪造的

word2vec

，通过

tfidf

权重将单词转换成句子

步骤1。准备数据

from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.DataFrame({"sentences": ["this is a sentence", "this is another sentence"]})

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df.sentences).todense()
vocab = tfidf.vocabulary_
vocab
{'this': 3, 'is': 1, 'sentence': 2, 'another': 0}

步骤2。有假的word2vec（与我们的vocab大小相同）

第3步。计算包含word2vec的句子列：

sent2vec_matrix = np.dot(tfidf_matrix, word2vec) # word2vec here contains vectors in the same order as in vocab
df["sent2vec"] = sent2vec_matrix.tolist()
df

sentences   sent2vec
0   this is a sentence  [-2.098592110459085, 1.4292324332403232, -1.10...
1   this is another sentence    [-1.7879436822159966, 1.680865619703155, -2.00...

第4步。计算相似性矩阵

from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(df["sent2vec"].tolist())
similarity
array([[1.        , 0.76557098],
       [0.76557098, 1.        ]])

要使

word2vec

正常工作，您需要稍微调整步骤2，以便

word2vec

以相同的顺序（按值或字母顺序）包含

vocab

中的所有单词

对于您的情况，应该是：

sorted_vocab = sorted([word for word,key in vocab.items()])
sorted_word2vec = []
for word in sorted_vocab:
    sorted_word2vec.append(word2vec[word])

在一个专栏里有句子。计算

word2vec

句子表示。计算[平方]成对距离矩阵。嗨，谢尔盖·布什马诺夫，非常感谢你的回答。我会有一个问题：你所展示的，是用来取代我的全部代码，还是只是需要集成到我的代码中？我正在考虑来自两个不同栏目的文本，我想比较Sent1和Sent2中每一行的句子。为了我的理解（但可能我误解了），您的代码比较了同一列中的两行。如果我误解了你给我看的东西，你能告诉我吗？多谢了，我的建议是改变你的方法，只在一个栏目里写上句子（也就是说，把youts的代码改成我的）。如果你仍然坚持将文本放在两个df列中，你仍然可以应用余弦相似性，但是你的方法对我来说太复杂了。我只需要做3件事：定义

sent2vec

函数，将其应用于两列，在两列之间应用

cosine\u相似度。但说实话，我从来没见过这样做。谢谢你的建议，谢尔盖。我试图做的是比较文本之间的相似性，一个在col1中，另一个在col2中，而不是在行中。例如，你可以把它想象成一篇文章的标题和语料库。我会选择您的方法，但重要的是保持两列之间的比较，而不是行之间的比较，因为它们对我来说意味着不同的事情。
sorted_vocab = sorted([word for word,key in vocab.items()])
sorted_word2vec = []
for word in sorted_vocab:
    sorted_word2vec.append(word2vec[word])