Python 3.x 数据帧中每行两列的字符串匹配_Python 3.x_Pandas_Matrix_Fuzzy

Python 3.x 数据帧中每行两列的字符串匹配

python-3.x pandas matrix

Python 3.x 数据帧中每行两列的字符串匹配,python-3.x,pandas,matrix,fuzzy,Python 3.x,Pandas,Matrix,Fuzzy,假设我有一个熊猫数据框，看起来像这样： ID String1 String2 1 The big black wolf The small wolf 2 Close the door on way out door the Close 3 where's the money where is the money 4 123 further out

假设我有一个熊猫数据框，看起来像这样：

ID    String1                         String2
1     The big black wolf              The small wolf
2     Close the door on way out       door the Close
3     where's the money               where is the money
4     123 further out                 out further

在进行模糊字符串匹配之前，我想交叉标记列String1和String2中的每一行，类似于

我的挑战是，我发布的链接中的解决方案只有在String1和String2中的字数相同时才有效。其次，该解决方案查看列中的所有行，而我希望我的解决方案只进行逐行比较

建议的解决方案应对第1行进行类似矩阵的比较，如：

       string1     The  big  black  wolf  Maximum
       string2
       The          100  0    0      0     100
       small        0    0    0      0     0
       wolf         0    0    0      100   100

其中，“匹配平均值”是“最大”列的总和除以String2中的字数

您可以首先从2个系列中获取虚拟值，然后获取列的交点，将它们相加并除以第二列的虚拟值：

a = df['String1'].str.get_dummies(' ')
b = df['String2'].str.get_dummies(' ')
u = b[b.columns.intersection(a.columns)]
df['Matching_Average'] = u.sum(1).div(b.sum(1)).mul(100).round(2)

否则，如果您可以使用字符串匹配算法，则可以使用

difflib

：

from difflib import SequenceMatcher
[SequenceMatcher(None,x,y).ratio() for x,y in zip(df['String1'],df['String2'])]
#[0.625, 0.2564102564102564, 0.9142857142857143, 0.6153846153846154]

其中matching average是“maximum”列的总和除以String1中的字数-您指的是String2而不是String1？这是正确的@anky，现在将进行编辑。感谢@anky，如果我想使用FuzzyFuzzy import fuzz中的

对字符串进行模糊匹配怎么办？因为您只想在比较[拉链中x，y的模糊比率（x，y）（df['String1']，df['String2']）]
导入后，您也可以尝试模糊。部分_比率

取决于您花了我一段时间来计算，但我认为您的算法在比较a和b中的所有行之前都是虚拟编码。ID=2的结果很明显，我希望结果是100，而不是133，因为ID=3中的

the

，所以得分更高。我只想在比较之前逐行模拟代码。有意义吗？@user1783739编辑了我的答案（我在定义u时把代码的一部分搞乱了），请现在检查？

print(df)

   ID                    String1             String2  Matching_Average
0   1         The big black wolf      The small wolf             66.67
1   2  Close the door on way out      door the Close            100.00
2   3          where's the money  where is the money             50.00
3   4            123 further out         out further            100.00

from difflib import SequenceMatcher
[SequenceMatcher(None,x,y).ratio() for x,y in zip(df['String1'],df['String2'])]
#[0.625, 0.2564102564102564, 0.9142857142857143, 0.6153846153846154]