Python合并多个子字符串

Python合并多个子字符串,python,pandas,merge,substring,nan,Python,Pandas,Merge,Substring,Nan,我有以下数据帧: import pandas as pd df1 = pd.DataFrame({'Name':['Jon','Alex','Jenny','Rick','Joe'], 'Color':['Red', 'Blue', 'Green', 'Black', 'Yellow'], 'Tel':['3745 569', '785 985', '635 565a', '987', np.nan]}) df2 = pd.DataFrame({'Phone':['987 856','985'

我有以下数据帧:

import pandas as pd

df1 = pd.DataFrame({'Name':['Jon','Alex','Jenny','Rick','Joe'], 'Color':['Red', 'Blue', 'Green', 'Black', 'Yellow'], 'Tel':['3745 569', '785 985', '635 565a', '987', np.nan]})
df2 = pd.DataFrame({'Phone':['987 856','985',np.nan, '569','459 56']})
我想:

  • 查找存储在列df1['Tel']和df2['Phone]中的公共子字符串值
  • 左合并df2,输出电话、df1['Tel']和df1['color']列中的公共子字符串值
  • 预期结果:

    我发现并编辑了一段代码,该代码只在没有NaN值的情况下工作,如果键是子字符串,则无法进行搜索,如我的示例所示:

    a = ['Tel', 'Phone']
    b = [1 ,2]
    rhs ={}
    
    for x,y in zip(a, b):
        rhs[y] = (df1[x].apply(lambda x: df2[df2['Phone'].str.find(x).ge(0)]['colour']).bfill(axis=1).iloc[:, 0])
    

    所以,如果我理解正确,您希望通过公共子字符串合并。这段代码可以做到这一点,尽管不是很优雅。但我保持它的明确性是为了显示潜在的缺陷:这段代码假定按最长的子字符串进行匹配(可能有较短的匹配,实际上可能有相同公共长度的多个匹配;这段代码不处理RosettaCode中最长的公共子字符串,参考文献给定)

    这几乎是所需的输出,但最后一行:_56匹配到_56tel:这是正确的,但可能只需要数字匹配。在这种情况下,最好在匹配之前清除电话号码(其中一个号码的末尾有一个“a”,所以我选择了常规的字符串匹配)

    import pandas as pd
    import numpy as np
    
    # https://rosettacode.org/wiki/Longest_common_substring#Python
    def longestCommon(s1, s2):
        len1, len2 = len(s1), len(s2)
        ir, jr = 0, -1
        for i1 in range(len1):
            i2 = s2.find(s1[i1])
            while i2 >= 0:
                j1, j2 = i1, i2
                while j1 < len1 and j2 < len2 and s2[j2] == s1[j1]:
                    if j1-i1 >= jr-ir:
                        ir, jr = i1, j1
                    j1 += 1; j2 += 1
                i2 = s2.find(s1[i1], i2+1)
        return len(s1[ir:jr+1])
    
    df1 = pd.DataFrame({'Name':['Jon','Alex','Jenny','Rick','Joe'], 'Color':['Red', 'Blue', 'Green', 'Black', 'Yellow'], 
                        'Tel':['3745 569', '785 985', '635 565a', '987', np.nan]})
    df2 = pd.DataFrame({'Phone':['987 856','985', np.nan, '569','459 56']})
    
    # left merge df2 to df1 via longest matching substring Tel to Phone
    mrglst = []
    for phone in df2['Phone']:
        lgstr = 0
        lgtel = ''
        lgcol = ''
        for tidx, trow in df1.iterrows():
            if str(phone) != 'nan' and str(trow['Tel']) != 'nan':
                thisstrl = longestCommon(phone, trow['Tel'])
                if thisstrl > lgstr:
                    lgstr = thisstrl
                    lgtel, lgcol = trow['Tel'], trow['Color']
        mrglst.append([phone, lgtel, lgcol])
        
    dfmrg = pd.DataFrame(mrglst, columns=['Phone', 'Tel', 'Color'])
    print(dfmrg)
    
         Phone       Tel  Color
    0  987 856       987  Black
    1      985   785 985   Blue
    2      NaN                 
    3      569  3745 569    Red
    4   459 56  3745 569    Red