Python 如何使用多列的值计数按组汇总数据帧？_Python_Pandas_Dataframe

Python 如何使用多列的值计数按组汇总数据帧？

python pandas dataframe

Python 如何使用多列的值计数按组汇总数据帧？,python,pandas,dataframe,Python,Pandas,Dataframe,如果这是个骗局，请带路。我来了一些，但没有解决我的问题我有一个虚拟的数据帧，如下所示： grp Ax Bx Ay By A_match B_match 0 foo 3 2 2 2 False True 1 foo 2 1 1 0 False False 2 foo 4 3 0 3 False True 3 foo 4 3 1 4 False False

如果这是个骗局，请带路。我来了一些，但没有解决我的问题

我有一个虚拟的

数据帧

，如下所示：

   grp  Ax  Bx  Ay  By  A_match  B_match
0  foo   3   2   2   2    False     True
1  foo   2   1   1   0    False    False
2  foo   4   3   0   3    False     True
3  foo   4   3   1   4    False    False
4  foo   4   4   3   0    False    False
5  bar   3   0   3   0     True     True
6  bar   3   4   0   3    False    False
7  bar   1   2   1   2     True     True
8  bar   1   3   4   1    False    False
9  bar   1   1   0   3    False    False

我的目标是比较

s和

s列，并通过

grp

总结结果，因此：

           A_match       B_match      
           False  True   False True 
grp                                 
bar            3     2       3     2
foo            5     0       3     2

因此，我添加了两个

\u match

列，如下所示，以获得上述

df

：

df['A_match'] = df['Ax'].eq(df['Ay'])
df['B_match'] = df['Bx'].eq(df['By'])

根据我的理解，我希望我能做这样的事情，但它不起作用：

df.groupby('grp')[['A_match', 'B_match']].agg(pd.Series.value_counts)

# trunc'd Traceback:
# ... ValueError: no results ...
# ... During handling of the above exception, another exception occurred: ...
# ... ValueError: could not broadcast input array from shape (5,7) into shape (5)

在我的实际数据中，我可以通过强制将

\u match

es设置为

pd.Categorical

来避免这一点，但方式并不令人满意。然而，我已经注意到了成功的时断时续，即使使用这个虚拟数据，我也得到了如上所述的确切错误，即使使用

pd.Categorial

：

df['A_match'] = pd.Categorical(df['Ax'].eq(df['Ay']).values, categories=[True, False])
df['B_match'] = pd.Categorical(df['Bx'].eq(df['By']).values, categories=[True, False])
df.groupby('grp')[['A_match', 'B_match']].agg(pd.Series.value_counts)

# ... ValueError: could not broadcast input array from shape (5,7) into shape (5)

这对我来说毫无意义——形状（5，7）从何而来？每个

agg

都会传递一个我上次检查的形状

（5，）

。甚至

agg

的运行方式似乎也与我想象的不同，它应该针对

系列运行：
>>> df.groupby('grp')[['A_match', 'B_match']].agg(lambda x: type(x))
                                 A_match                              B_match
grp                                                                          
bar  <class 'pandas.core.series.Series'>  <class 'pandas.core.series.Series'>
foo  <class 'pandas.core.series.Series'>  <class 'pandas.core.series.Series'>

# Good - it's Series, I should be able to call value_counts directly?

>>> df.groupby('grp')[['A_match', 'B_match']].agg(lambda x: x.value_counts())

# AttributeError: 'DataFrame' object has no attribute 'value_counts'  <-- ?!?!? Where did 'DataFrame' come from?

这两种方法似乎都是人为地实现了一些应该是比较常见的用法。我想我的问题是，我是不是忽略了一些显而易见的东西？
你可以在字典上找到agg
：
(df.groupby('grp').agg({'A_match':'value_counts',
                      'B_match':'value_counts'})
   .unstack(-1, fill_value=0)
)

输出：
      A_match       B_match      
      False  True   False  True 
bar     3.0   2.0       3     2
foo     5.0   NaN       3     2

这看起来很简单，很好！你救了我一个不眠之夜：）也祝你晚安。编辑：奇怪的是，fill\u value
在这里似乎什么都不做，但是NaN
或0
对我有效：）啊，需要做：df.groupby（'grp'）.agg（{'A\u match'：'value\u counts'，'B\u match'：'value\u counts'）。fillna（0）。unstack（-1）
相反，因为NaN
不是由取消堆栈引起的。
      A_match       B_match      
      False  True   False  True 
bar     3.0   2.0       3     2
foo     5.0   NaN       3     2