Python合并多个子字符串
我有以下数据帧:Python合并多个子字符串,python,pandas,merge,substring,nan,Python,Pandas,Merge,Substring,Nan,我有以下数据帧: import pandas as pd df1 = pd.DataFrame({'Name':['Jon','Alex','Jenny','Rick','Joe'], 'Color':['Red', 'Blue', 'Green', 'Black', 'Yellow'], 'Tel':['3745 569', '785 985', '635 565a', '987', np.nan]}) df2 = pd.DataFrame({'Phone':['987 856','985'
import pandas as pd
df1 = pd.DataFrame({'Name':['Jon','Alex','Jenny','Rick','Joe'], 'Color':['Red', 'Blue', 'Green', 'Black', 'Yellow'], 'Tel':['3745 569', '785 985', '635 565a', '987', np.nan]})
df2 = pd.DataFrame({'Phone':['987 856','985',np.nan, '569','459 56']})
我想:
a = ['Tel', 'Phone']
b = [1 ,2]
rhs ={}
for x,y in zip(a, b):
rhs[y] = (df1[x].apply(lambda x: df2[df2['Phone'].str.find(x).ge(0)]['colour']).bfill(axis=1).iloc[:, 0])
所以,如果我理解正确,您希望通过公共子字符串合并。这段代码可以做到这一点,尽管不是很优雅。但我保持它的明确性是为了显示潜在的缺陷:这段代码假定按最长的子字符串进行匹配(可能有较短的匹配,实际上可能有相同公共长度的多个匹配;这段代码不处理RosettaCode中最长的公共子字符串,参考文献给定) 这几乎是所需的输出,但最后一行:_56匹配到_56tel:这是正确的,但可能只需要数字匹配。在这种情况下,最好在匹配之前清除电话号码(其中一个号码的末尾有一个“a”,所以我选择了常规的字符串匹配)
import pandas as pd
import numpy as np
# https://rosettacode.org/wiki/Longest_common_substring#Python
def longestCommon(s1, s2):
len1, len2 = len(s1), len(s2)
ir, jr = 0, -1
for i1 in range(len1):
i2 = s2.find(s1[i1])
while i2 >= 0:
j1, j2 = i1, i2
while j1 < len1 and j2 < len2 and s2[j2] == s1[j1]:
if j1-i1 >= jr-ir:
ir, jr = i1, j1
j1 += 1; j2 += 1
i2 = s2.find(s1[i1], i2+1)
return len(s1[ir:jr+1])
df1 = pd.DataFrame({'Name':['Jon','Alex','Jenny','Rick','Joe'], 'Color':['Red', 'Blue', 'Green', 'Black', 'Yellow'],
'Tel':['3745 569', '785 985', '635 565a', '987', np.nan]})
df2 = pd.DataFrame({'Phone':['987 856','985', np.nan, '569','459 56']})
# left merge df2 to df1 via longest matching substring Tel to Phone
mrglst = []
for phone in df2['Phone']:
lgstr = 0
lgtel = ''
lgcol = ''
for tidx, trow in df1.iterrows():
if str(phone) != 'nan' and str(trow['Tel']) != 'nan':
thisstrl = longestCommon(phone, trow['Tel'])
if thisstrl > lgstr:
lgstr = thisstrl
lgtel, lgcol = trow['Tel'], trow['Color']
mrglst.append([phone, lgtel, lgcol])
dfmrg = pd.DataFrame(mrglst, columns=['Phone', 'Tel', 'Color'])
print(dfmrg)
Phone Tel Color
0 987 856 987 Black
1 985 785 985 Blue
2 NaN
3 569 3745 569 Red
4 459 56 3745 569 Red