Python合并多个子字符串_Python_Pandas_Merge_Substring_Nan

Python合并多个子字符串

python pandas merge

Python合并多个子字符串,python,pandas,merge,substring,nan,Python,Pandas,Merge,Substring,Nan,我有以下数据帧： import pandas as pd df1 = pd.DataFrame({'Name':['Jon','Alex','Jenny','Rick','Joe'], 'Color':['Red', 'Blue', 'Green', 'Black', 'Yellow'], 'Tel':['3745 569', '785 985', '635 565a', '987', np.nan]}) df2 = pd.DataFrame({'Phone':['987 856','985'

我有以下数据帧：

import pandas as pd

df1 = pd.DataFrame({'Name':['Jon','Alex','Jenny','Rick','Joe'], 'Color':['Red', 'Blue', 'Green', 'Black', 'Yellow'], 'Tel':['3745 569', '785 985', '635 565a', '987', np.nan]})
df2 = pd.DataFrame({'Phone':['987 856','985',np.nan, '569','459 56']})

我想：

查找存储在列df1['Tel']和df2['Phone]中的公共子字符串值

左合并df2，输出电话、df1['Tel']和df1['color']列中的公共子字符串值

预期结果：

我发现并编辑了一段代码，该代码只在没有NaN值的情况下工作，如果键是子字符串，则无法进行搜索，如我的示例所示：

a = ['Tel', 'Phone']
b = [1 ,2]
rhs ={}

for x,y in zip(a, b):
    rhs[y] = (df1[x].apply(lambda x: df2[df2['Phone'].str.find(x).ge(0)]['colour']).bfill(axis=1).iloc[:, 0])

所以，如果我理解正确，您希望通过公共子字符串合并。这段代码可以做到这一点，尽管不是很优雅。但我保持它的明确性是为了显示潜在的缺陷：这段代码假定按最长的子字符串进行匹配（可能有较短的匹配，实际上可能有相同公共长度的多个匹配；这段代码不处理RosettaCode中最长的公共子字符串，参考文献给定）

这几乎是所需的输出，但最后一行：_56匹配到_56tel：这是正确的，但可能只需要数字匹配。在这种情况下，最好在匹配之前清除电话号码（其中一个号码的末尾有一个“a”，所以我选择了常规的字符串匹配）

import pandas as pd
import numpy as np

# https://rosettacode.org/wiki/Longest_common_substring#Python
def longestCommon(s1, s2):
    len1, len2 = len(s1), len(s2)
    ir, jr = 0, -1
    for i1 in range(len1):
        i2 = s2.find(s1[i1])
        while i2 >= 0:
            j1, j2 = i1, i2
            while j1 < len1 and j2 < len2 and s2[j2] == s1[j1]:
                if j1-i1 >= jr-ir:
                    ir, jr = i1, j1
                j1 += 1; j2 += 1
            i2 = s2.find(s1[i1], i2+1)
    return len(s1[ir:jr+1])

df1 = pd.DataFrame({'Name':['Jon','Alex','Jenny','Rick','Joe'], 'Color':['Red', 'Blue', 'Green', 'Black', 'Yellow'], 
                    'Tel':['3745 569', '785 985', '635 565a', '987', np.nan]})
df2 = pd.DataFrame({'Phone':['987 856','985', np.nan, '569','459 56']})

# left merge df2 to df1 via longest matching substring Tel to Phone
mrglst = []
for phone in df2['Phone']:
    lgstr = 0
    lgtel = ''
    lgcol = ''
    for tidx, trow in df1.iterrows():
        if str(phone) != 'nan' and str(trow['Tel']) != 'nan':
            thisstrl = longestCommon(phone, trow['Tel'])
            if thisstrl > lgstr:
                lgstr = thisstrl
                lgtel, lgcol = trow['Tel'], trow['Color']
    mrglst.append([phone, lgtel, lgcol])
    
dfmrg = pd.DataFrame(mrglst, columns=['Phone', 'Tel', 'Color'])
print(dfmrg)

     Phone       Tel  Color
0  987 856       987  Black
1      985   785 985   Blue
2      NaN                 
3      569  3745 569    Red
4   459 56  3745 569    Red