Python Pandas groupby查找公共字符串
我的数据帧:Python Pandas groupby查找公共字符串,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我的数据帧: Name fav_fruit 0 justin apple 1 bieber justin apple 2 Kris Justin bieber apple 3 Kim Lee orange 4 lee kim orange 5 mary barnet orange 6 tom hawkins pear
Name fav_fruit
0 justin apple
1 bieber justin apple
2 Kris Justin bieber apple
3 Kim Lee orange
4 lee kim orange
5 mary barnet orange
6 tom hawkins pears
7 Sr Tom Hawkins pears
8 Jose Hawkins pears
9 Shanita pineapple
10 Joe pineapple
df1=pd.DataFrame({'Name':['justin','bieber justin','Kris Justin bieber','Kim Lee','lee kim','mary barnet','tom hawkins','Sr Tom Hawkins','Jose Hawkins','Shanita','Joe'],
'fav_fruit':['apple'
,'apple'
,'apple'
,'orange'
,'orange'
,'orange'
,'pears'
,'pears','pears'
,'pineapple','pineapple']})
我想在fav_fruit列上grouby之后计算Name列中常用词的数量,因此苹果的计数是2贾斯汀·比伯,橙色的kim,lee和菠萝的计数是0
预期产出:
Name fav_fruit count
0 justin apple 2
1 bieber justin apple 2
2 Kris Justin bieber apple 2
3 Kim Lee orange 2
4 lee kim orange 2
5 mary barnet orange 2
6 tom hawkins pears 2
7 Sr Tom Hawkins pears 2
8 Jose Hawkins pears 2
9 Shanita pineapple 0
10 Joe pineapple 0
我认为需要使用自定义函数-首先创建一大串连接值,转换为小写和拆分,最后使用过滤所有重复值:
from collections import Counter
def f(x):
a = ' '.join(x).lower().split()
return len([k for k, v in Counter(a).items() if v != 1])
df['count'] = df.groupby('fav_fruit')['Name'].transform(f)
print (df)
Name fav_fruit count
0 justin apple 2
1 bieber justin apple 2
2 Kris Justin bieber apple 2
3 Kim Lee orange 2
4 lee kim orange 2
5 mary barnet orange 2
6 tom hawkins pears 2
7 Sr Tom Hawkins pears 2
8 Jose Hawkins pears 2
9 Shanita pineapple 0
10 Joe pineapple 0
使用集合尝试了类似的操作,但如果word在所有行中都不常见,则该操作无效。我只是在评估你的解决方案。将让您知道它是否适用于整个数据集。