Python 单击groupby并检查一行的值是否在另一行的值中
我想对客户进行分组,并将计数为1的项目与计数大于1的项目进行匹配,如果所有项目都匹配,则将可能的合并id添加到新列中。例如:客户1,id=3项目在id=2中,因此这是一个匹配,可分配的合并id为1,同样,对于客户2,id=7是计数1,项目在id=5项目中,所以匹配和可能的合并id是4 我的数据帧:Python 单击groupby并检查一行的值是否在另一行的值中,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我想对客户进行分组,并将计数为1的项目与计数大于1的项目进行匹配,如果所有项目都匹配,则将可能的合并id添加到新列中。例如:客户1,id=3项目在id=2中,因此这是一个匹配,可分配的合并id为1,同样,对于客户2,id=7是计数1,项目在id=5项目中,所以匹配和可能的合并id是4 我的数据帧: count custmr id items 3 Customer1 1 Cabbage, beet, Okra, root 3 Customer1
count custmr id items
3 Customer1 1 Cabbage, beet, Okra, root
3 Customer1 2 Apple, Banana, Mango ,Pears, leafs
1 Customer1 3 Mango leafs
1 Customer1 4 tomato root
4 Customer2 5 grapes,leach,guava,pappaya
2 Customer2 6 blackberry,blueberry
1 Customer2 7 pappaya
count custmr id items probable_merge_id
3 Customer1 1 Cabbage, beet, Okra, root
3 Customer1 2 Apple, Banana, Mango ,Pears, leafs
1 Customer1 3 Mango leafs 2
1 Customer1 4 tomato root
4 Customer2 5 grapes,leach,guava,pappaya
2 Customer2 6 blackberry,blueberry
1 Customer2 7 pappaya 4
预期输出:
count custmr id items
3 Customer1 1 Cabbage, beet, Okra, root
3 Customer1 2 Apple, Banana, Mango ,Pears, leafs
1 Customer1 3 Mango leafs
1 Customer1 4 tomato root
4 Customer2 5 grapes,leach,guava,pappaya
2 Customer2 6 blackberry,blueberry
1 Customer2 7 pappaya
count custmr id items probable_merge_id
3 Customer1 1 Cabbage, beet, Okra, root
3 Customer1 2 Apple, Banana, Mango ,Pears, leafs
1 Customer1 3 Mango leafs 2
1 Customer1 4 tomato root
4 Customer2 5 grapes,leach,guava,pappaya
2 Customer2 6 blackberry,blueberry
1 Customer2 7 pappaya 4
首先通过
merge
创建交叉连接,通过count=1
进行过滤,将字符串转换为set
s,以便进行比较。上次为地图创建系列
:
df1 = df.merge(df, on='custmr')
df1 = df1[(df1['count_x'] == 1)]
df1['items_x'] = df1['items_x'].str.split('\s+|,\s*').apply(set)
df1['items_y'] = df1['items_y'].str.split('\s+|,\s*').apply(set)
df1 = df1[ df1['items_x'] < df1['items_y']]
print (df1)
count_x custmr id_x items_x count_y id_y \
9 1 Customer1 3 {Mango, leafs} 3 2
22 1 Customer2 7 {pappaya} 4 5
items_y
9 {Mango, Pears, leafs, Apple, Banana}
22 {grapes, pappaya, leach, guava}
s = df1.set_index('id_x')['id_y']
print (s)
id_x
3 2
7 5
Name: id_y, dtype: int64
df['probable_merge_id'] = df['id'].map(s)
print (df)
count custmr id items probable_merge_id
0 3 Customer1 1 Cabbage,beet,Okra,root NaN
1 3 Customer1 2 Apple,Banana,Mango,Pears,leafs NaN
2 1 Customer1 3 Mango leafs 2.0
3 1 Customer1 4 tomato root NaN
4 4 Customer2 5 grapes,leach,guava,pappaya NaN
5 2 Customer2 6 blackberry,blueberry NaN
6 1 Customer2 7 pappaya 5.0
df1=df.merge(df,on='custmr')
df1=df1[(df1['count_x']==1)]
df1['items_x']=df1['items_x'].str.split('\s+|,\s*')。应用(集)
df1['items_y']=df1['items_y'].str.split('\s+|,\s*')。应用(集)
df1=df1[df1['items\u x']
首先通过merge
创建交叉联接,通过count=1
过滤,将字符串转换为集
s,以便进行比较。上次为地图创建系列
:
df1 = df.merge(df, on='custmr')
df1 = df1[(df1['count_x'] == 1)]
df1['items_x'] = df1['items_x'].str.split('\s+|,\s*').apply(set)
df1['items_y'] = df1['items_y'].str.split('\s+|,\s*').apply(set)
df1 = df1[ df1['items_x'] < df1['items_y']]
print (df1)
count_x custmr id_x items_x count_y id_y \
9 1 Customer1 3 {Mango, leafs} 3 2
22 1 Customer2 7 {pappaya} 4 5
items_y
9 {Mango, Pears, leafs, Apple, Banana}
22 {grapes, pappaya, leach, guava}
s = df1.set_index('id_x')['id_y']
print (s)
id_x
3 2
7 5
Name: id_y, dtype: int64
df['probable_merge_id'] = df['id'].map(s)
print (df)
count custmr id items probable_merge_id
0 3 Customer1 1 Cabbage,beet,Okra,root NaN
1 3 Customer1 2 Apple,Banana,Mango,Pears,leafs NaN
2 1 Customer1 3 Mango leafs 2.0
3 1 Customer1 4 tomato root NaN
4 4 Customer2 5 grapes,leach,guava,pappaya NaN
5 2 Customer2 6 blackberry,blueberry NaN
6 1 Customer2 7 pappaya 5.0
df1=df.merge(df,on='custmr')
df1=df1[(df1['count_x']==1)]
df1['items_x']=df1['items_x'].str.split('\s+|,\s*')。应用(集)
df1['items_y']=df1['items_y'].str.split('\s+|,\s*')。应用(集)
df1=df1[df1['items\u x']
到目前为止,您尝试了哪些代码?你在哪里卡住了?到目前为止你试过什么代码?你在哪里卡住了?