Python 基于两个字符串之间相似性度量条件的分组数据帧
我想按“code”列对数据帧进行分组,但仅当“name”中的值明显不同时Python 基于两个字符串之间相似性度量条件的分组数据帧,python,pandas,dataframe,pandas-groupby,similarity,Python,Pandas,Dataframe,Pandas Groupby,Similarity,我想按“code”列对数据帧进行分组,但仅当“name”中的值明显不同时 d = {'code': ['ABC', 'ABC','DB','DB','CDP'], 'name': ['abcde','abc de', 'defs','wokj','lkj']} df = pd.DataFrame(data=d) print(df) code name 0 ABC abcde 1 ABC abc de 2 DB defs 3 DB wokj 4 CDP
d = {'code': ['ABC', 'ABC','DB','DB','CDP'], 'name': ['abcde','abc de', 'defs','wokj','lkj']}
df = pd.DataFrame(data=d)
print(df)
code name
0 ABC abcde
1 ABC abc de
2 DB defs
3 DB wokj
4 CDP lkj
那会是什么样子
df2 = df.groupby(['code']).agg(name = ('name', (' + '.join))).reset_index()
print(df2)
code name
0 ABC abcde + abc de
1 CDP lkj
2 DB defs + wokj
但ABC不应该是分组的,而是根据如下条件保持为单独的行值
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
print(similar('abcde', 'abc de'))
print(similar('defs', 'wokj'))
0.9090909090909091
0.0
我想要的最终结果是
code name
0 ABC abcde
1 ABC abc de
1 CDP lkj
2 DB defs + wokj
如何在groupby中设置条件?这可能不是一个很好的解决方案,但我希望这对您有效。有些作品可以做得更像蟒蛇
import numpy as np
import pandas as pd
from difflib import SequenceMatcher
def similar(dfg):
df = pd.DataFrame(columns=['code', 'name'])
if len(dfg) > 1:
dfg = dfg.assign(a=1).merge(dfg[['name']].assign(a=1), on='a')
dfg = dfg[dfg['name_x'] != dfg['name_y']]
dfg[['name_x', 'name_y']] = pd.DataFrame(np.sort(dfg[['name_x', 'name_y']], axis=1), index=dfg.index)
dfg = dfg.drop_duplicates(subset=['name_x', 'name_y'])
dfg['sim'] = dfg.apply(lambda x: SequenceMatcher(None, x.name_x, x.name_y).ratio(), axis=1)
for index, row in dfg.iterrows():
if row['sim'] > 0:
# this block could be more pythonic
row['name'] = row['name_x']
df = df.append(row, sort=False)
row['name'] = row['name_y']
df = df.append(row, sort=False)
else:
row['name'] = row.name_x + ' + ' + row.name_y
df = df.append(row, sort=False)
else:
df = df.append(dfg, sort=False)
return df[['code', 'name']]
d = {'code': ['ABC', 'ABC', 'ABC', 'DB','DB','CDP'], 'name': ['abcde','abc de', 'xyz', 'defs','wokj','lkj']}
df = pd.DataFrame(data=d)
print(df)
df2 = df.groupby(['code']).apply(similar)
print(df2)
输入:
code name
0 ABC abcde
1 ABC abc de
2 DB defs
3 DB wokj
4 CDP lkj
code name
0 ABC abcde
1 ABC abc de
2 ABC xyz
3 DB defs
4 DB wokj
5 CDP lkj
输出:
code name
code
ABC 1 ABC abc de
1 ABC abcde
CDP 4 CDP lkj
DB 1 DB defs + wokj
code name
code
ABC 1 ABC abc de
1 ABC abcde
2 ABC abcde + xyz
5 ABC abc de + xyz
CDP 5 CDP lkj
DB 1 DB defs + wokj
输入:
code name
0 ABC abcde
1 ABC abc de
2 DB defs
3 DB wokj
4 CDP lkj
code name
0 ABC abcde
1 ABC abc de
2 ABC xyz
3 DB defs
4 DB wokj
5 CDP lkj
输出:
code name
code
ABC 1 ABC abc de
1 ABC abcde
CDP 4 CDP lkj
DB 1 DB defs + wokj
code name
code
ABC 1 ABC abc de
1 ABC abcde
2 ABC abcde + xyz
5 ABC abc de + xyz
CDP 5 CDP lkj
DB 1 DB defs + wokj