Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/string/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 比较pandas中两列的字符串值并找到向量余弦_Python_String_Pandas_Vector_Comparison - Fatal编程技术网

Python 比较pandas中两列的字符串值并找到向量余弦

Python 比较pandas中两列的字符串值并找到向量余弦,python,string,pandas,vector,comparison,Python,String,Pandas,Vector,Comparison,嗨,我在这里找到了一些代码,可以将两个字符串转换成向量,然后比较它们以返回它们相似性的余弦值 import re, math from collections import Counter import sys import pandas as pd WORD = re.compile(r'\w+') def get_cosine(vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) num

嗨,我在这里找到了一些代码,可以将两个字符串转换成向量,然后比较它们以返回它们相似性的余弦值

import re, math
from collections import Counter
import sys
import pandas as pd

WORD = re.compile(r'\w+')

def get_cosine(vec1, vec2):
     intersection = set(vec1.keys()) & set(vec2.keys())
     numerator = sum([vec1[x] * vec2[x] for x in intersection])

     sum1 = sum([vec1[x]**2 for x in vec1.keys()])
     sum2 = sum([vec2[x]**2 for x in vec2.keys()])
     denominator = math.sqrt(sum1) * math.sqrt(sum2)

     if not denominator:
        return 0.0
     else:
        return float(numerator) / denominator

def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)

text1 = 'string one'
text2 = 'string two'

vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)

print get_cosine(vector1 , vector2)
我试图将其应用于pandas数据帧,这样它就不用使用两个随机字符串,而是在每一行上迭代,并将每一行的字符串值转换为一个向量,然后返回一个包含结果的新列。上面的示例返回0.5,因为string和string匹配,但是two和one不匹配,这意味着1/2=0.5个单词匹配。我有两个列
df['Address 1']
df['Address 2']
,每个列都有字符串地址值。我想比较并获得它们相似性的余弦值,并将该值作为新列返回
df['Address cosine']

例如,如果df['Address 1']持有
'685 EASY STREET'、'122 FOURTH AVE'、'9189五十九街'
和df['Address 2']持有
'685 EASY STREET'、'240 FOURTH AVE'、'9189三十八街'
然后我希望
df['Address Cosine']
包含
'1.0',0.66',0.5'

有什么想法吗