Python 比较pandas中两列的字符串值并找到向量余弦
嗨,我在这里找到了一些代码,可以将两个字符串转换成向量,然后比较它们以返回它们相似性的余弦值Python 比较pandas中两列的字符串值并找到向量余弦,python,string,pandas,vector,comparison,Python,String,Pandas,Vector,Comparison,嗨,我在这里找到了一些代码,可以将两个字符串转换成向量,然后比较它们以返回它们相似性的余弦值 import re, math from collections import Counter import sys import pandas as pd WORD = re.compile(r'\w+') def get_cosine(vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) num
import re, math
from collections import Counter
import sys
import pandas as pd
WORD = re.compile(r'\w+')
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
text1 = 'string one'
text2 = 'string two'
vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
print get_cosine(vector1 , vector2)
我试图将其应用于pandas数据帧,这样它就不用使用两个随机字符串,而是在每一行上迭代,并将每一行的字符串值转换为一个向量,然后返回一个包含结果的新列。上面的示例返回0.5,因为string和string匹配,但是two和one不匹配,这意味着1/2=0.5个单词匹配。我有两个列df['Address 1']
和df['Address 2']
,每个列都有字符串地址值。我想比较并获得它们相似性的余弦值,并将该值作为新列返回df['Address cosine']
例如,如果df['Address 1']持有'685 EASY STREET'、'122 FOURTH AVE'、'9189五十九街'
和df['Address 2']持有'685 EASY STREET'、'240 FOURTH AVE'、'9189三十八街'
然后我希望df['Address Cosine']
包含'1.0',0.66',0.5'
有什么想法吗