Python 从数据帧中删除低频项_Python_Pandas_Dataframe_Filter

Python 从数据帧中删除低频项

python pandas dataframe filter

Python 从数据帧中删除低频项,python,pandas,dataframe,filter,Python,Pandas,Dataframe,Filter,我正在玩一点数据集。数据集由用户id、艺术家姓名和播放次数组成。大概是这样的： user artist plays 0 00000c289a1829a808ac09c00daf10bc3c4e223b betty blowtorch 2137 1 00000c289a1829a808ac09c00daf10bc3c4e223b die Ärz

我正在玩一点数据集。数据集由用户id、艺术家姓名和播放次数组成。大概是这样的：

    user                                        artist                  plays
0   00000c289a1829a808ac09c00daf10bc3c4e223b    betty blowtorch         2137
1   00000c289a1829a808ac09c00daf10bc3c4e223b    die Ärzte               1099
2   00000c289a1829a808ac09c00daf10bc3c4e223b    melissa etheridge       897
3   00000c289a1829a808ac09c00daf10bc3c4e223b    elvenking               717
4   00000c289a1829a808ac09c00daf10bc3c4e223b    juliette & the licks    706

    user    artist      plays
0   a       metallica   100
1   a       coldplay     24
3   b       metallica    48
4   b       coldplay    135
6   c       metallica    62
7   c       coldplay     38

现在，我想做的是清理一下这些数据。由于许多名称不正确，我想删除所有用户播放少于50次的艺术家
我想，我应该使用groupby并尝试数一数。但由于我对熊猫有点陌生，而且我的数据集非常大，我想知道删除这些项目的最佳做法是什么
tl；博士：
删除最低级艺术家的最佳方法是什么？
PS（编辑）：
所需的输出将是一个与输入具有相同模式的数据帧，且已播放的艺术家（他们在所有用户上的播放总和）不少于某个数字
PS2：例如，我有以下数据集：

df = pd.DataFrame({ 'user': 3 * ('abc'), 'artist': 3 * ('metallica', 'coldplay', 'dfj'), 'plays': [100,24,0,48,135,10,62,38,2] })
所以我们有了这个数据框架：

user artist plays 0 a metallica 100 1 a coldplay 24 2 a dfj 3 3 b metallica 48 4 b coldplay 135 5 b dfj 10 6 c metallica 62 7 c coldplay 38 8 c dfj 2
现在“dfj”总共只播放了15次。我想删除“dfj”并返回如下内容：

user artist plays 0 00000c289a1829a808ac09c00daf10bc3c4e223b betty blowtorch 2137 1 00000c289a1829a808ac09c00daf10bc3c4e223b die Ärzte 1099 2 00000c289a1829a808ac09c00daf10bc3c4e223b melissa etheridge 897 3 00000c289a1829a808ac09c00daf10bc3c4e223b elvenking 717 4 00000c289a1829a808ac09c00daf10bc3c4e223b juliette & the licks 706

user artist plays 0 a metallica 100 1 a coldplay 24 3 b metallica 48 4 b coldplay 135 6 c metallica 62 7 c coldplay 38
我相信您需要使用与原始数据帧大小相同的聚合值的for系列：

print (df.groupby('artist')['plays'].transform('sum')) 0 210 1 197 2 12 3 210 4 197 5 12 6 210 7 197 8 12 Name: plays, dtype: int64 df1 = df[df.groupby('artist')['plays'].transform('sum') > 50] print (df1) user artist plays 0 abcabcabc metallica 100 1 abcabcabc coldplay 24 3 abcabcabc metallica 48 4 abcabcabc coldplay 135 6 abcabcabc metallica 62 7 abcabcabc coldplay 38

最简单的尝试，基于对文章的理解

>>> df user artist plays 0 00000c289a1829a808ac09c00daf10bc3c4e223b betty blowtorch 2137 1 00000c289a1829a808ac09c00daf10bc3c4e223b die Ärzte 1099 2 00000c289a1829a808ac09c00daf10bc3c4e223b melissa etheridge 897 3 00000c289a1829a808ac09c00daf10bc3c4e223b elvenking 717 4 00000c289a1829a808ac09c00daf10bc3c4e223b juliette & the licks 706
结果:

>>> df[(df['plays'] >897)] user artist plays 0 00000c289a1829a808ac09c00daf10bc3c4e223b betty blowtorch 2137 1 00000c289a1829a808ac09c00daf10bc3c4e223b die Ärzte 1099

你的groupby和count需要多少时间？你能创建吗？@Experience在这个数据集上不需要太多时间，可能需要两分钟。我只是想知道最好的做法，因为我以后要处理更大的数据。@AmirAghdam，根据你的帖子，你希望得到什么样的结果？@jezrael现在怎么样？：）