Python 如何对熊猫中多个列的组进行求和或计数_Python_Pandas

Python 如何对熊猫中多个列的组进行求和或计数

python pandas

Python 如何对熊猫中多个列的组进行求和或计数,python,pandas,Python,Pandas,我正在尝试将几组列分组，以对数据帧中的行进行计数或求和我已经检查了许多问题，我发现最相似的是这个>，但是，据我所知，我必须采取许多步骤来实现我的目标。我也在看这个例如，我有下面的数据框： import numpy as np df = pd.DataFrame(np.random.randint(0,5,size=(5, 7)), columns=["grey2","red1","blue1","red2","red3","blue2","grey1"]) grey2 red

我正在尝试将几组列分组，以对数据帧中的行进行计数或求和

我已经检查了许多问题，我发现最相似的是这个>，但是，据我所知，我必须采取许多步骤来实现我的目标。我也在看这个

例如，我有下面的数据框：

import numpy as np
df = pd.DataFrame(np.random.randint(0,5,size=(5, 7)), columns=["grey2","red1","blue1","red2","red3","blue2","grey1"])

     grey2   red1 blue1 red2 red3 blue2 grey1
0       4      3    0      2    4   0   2
1       4      2    0      4    0   3   1
2       1      1    3      1    1   3   1
3       4      4    1      4    1   1   1
4       3      4    1      0    3   3   1

我想在这里按颜色对所有列进行分组，例如，我希望：

如果我把数字加起来

blue  15
grey  22
red   34

如果我计数（x>0），那么我将得到

  blue  7
  grey  10
  red   13

这就是我到目前为止所取得的成果，所以现在我必须求和，然后用结果创建一个数据帧，但如果我有100个组，这将非常耗时

pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='sum', margins=True)
   red1  red2   red3
0    3     2    4
1    2     4    0
2    1     1    1
3    4     4    1
4    4     0    3
ALL  14   11    9

pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='count', margins=True)

但这里也在计算零：

     red1 red2  red3
   0    1   1   1
   1    1   1   1
   2    1   1   1
   3    1   1   1
   4    1   1   1
  All   5   5   5

不知道如何修改函数以获得结果，我已经花了几个小时，希望您能提供帮助

注: 我在这个例子中只使用颜色来简化这个例子，但是我可以有很多列，它们被称为col001到col300，等等。。。因此，这些小组可以是：

blue = col131, col254, col005
red =  col023, col190, col053

以此类推……

您可以使用

pd.wide\u to\u long

：

data= pd.wide_to_long(df.reset_index(), stubnames=['grey','red','blue'], 
                i='index',
                j='group',
                sep=''
               )

输出：

# data
             grey  red  blue
index group                 
0     1       2.0    3   0.0
      2       4.0    2   0.0
      3       NaN    4   NaN
1     1       1.0    2   0.0
      2       4.0    4   3.0
      3       NaN    0   NaN
2     1       1.0    1   3.0
      2       1.0    1   3.0
      3       NaN    1   NaN
3     1       1.0    4   1.0
      2       4.0    4   1.0
      3       NaN    1   NaN
4     1       1.0    4   1.0
      2       3.0    0   3.0
      3       NaN    3   NaN

以及：

更新

从宽到长

只是

合并

和

重命名

的便捷快捷方式。因此，如果你有一本字典

{cat:[col_list]}

，你可以解决这个问题：

groups = {'blue' : ['col131', 'col254', 'col005'],
          'red' : ['col023', 'col190', 'col053']}

# create the inverse dictionary for mapping
inv_group = {v:k for k,v in groups.items()}

data = df.melt()

# map the original columns to group
data['group'] = data['variable'].map(inv_group)

# from now on, it's similar to other answers
# sum
data.groupby('group')['value'].sum()

# count
data['value'].gt(0).groupby(data['group']).sum()

这里的复杂之处在于，您希望同时按行和列进行折叠，这通常很难同时进行。我们可以

melt

将您的宽幅格式转换为较长的格式，然后将问题简化为单个

groupby

# Get rid of the numbers + reshape
df.columns = pd.Index(df.columns.str.rstrip('0123456789'), name='color')
df = df.melt()

df.groupby('color').sum()
#       value
#color       
#blue      15
#grey      22
#red       34

df.value.gt(0).groupby(df.color).sum()
#color
#blue     7.0
#grey    10.0
#red     13.0
#Name: value, dtype: float64

对于不太容易分组的名称，我们需要在某个地方进行映射，步骤非常类似：

# Unnecessary in this case, but more general
d = {'grey1': 'color_1', 'grey2': 'color_1', 
     'red1': 'color_2', 'red2': 'color_2', 'red3': 'color_2',
     'blue1': 'color_3', 'blue2': 'color_3'}

df.columns = pd.Index(df.columns.map(d), name='color')
df = df.melt()
df.groupby('color').sum()

#         value
#color         
#color_1     22
#color_2     34
#color_3     15

使用：

输出：

blue    15
grey    22
red     34
dtype: int64

df=df.add_suffix('22')
print(df)

   grey22222  red12222  blue12222  red22222  red32222  blue22222  grey12222
0          4         3          0         2         4          0          2
1          4         2          0         4         0          3          1
2          1         1          3         1         1          3          1
3          4         4          1         4         1          1          1
4          3         4          1         0         3          3          1

df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()
blue    15
grey    22
red     34
dtype: int64

无论列名称中包含多少位数，此选项都有效：

blue    15
grey    22
red     34
dtype: int64

df=df.add_suffix('22')
print(df)

   grey22222  red12222  blue12222  red22222  red32222  blue22222  grey12222
0          4         3          0         2         4          0          2
1          4         2          0         4         0          3          1
2          1         1          3         1         1          3          1
3          4         4          1         4         1          1          1
4          3         4          1         0         3          3          1

df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()
blue    15
grey    22
red     34
dtype: int64

对于一般情况，您也可以这样做：

colors = {'blue':['blue1','blue2'], 'red':['red1','red2','red3'], 'grey':['grey1','grey2']}
orig_columns = df.columns
df.columns = [key for col in df.columns for key in colors.keys() if col in colors[key]]
print(df.groupby(level=0,axis=1).sum().sum())
df.columns = orig_columns

哇，这看起来很简单，因为我只是用颜色来命名这些列，但是如果这些列是用随机名称命名的，也许我可以准备一本字典，说明它们需要如何分组。例如：蓝色=col001，col134，col567红色=col876，col324，col9876@VMEscoli由于

rstrip

会将所有数字向右剥离，因此上述三个数字仍将对这三个数字进行分组，因此这三个数字都将成为

'col'

。但是，如果您需要进行一些分组，例如

col001、col134、col56

然后

col002、col007、col131

，那么显然我的选项不起作用。在这种情况下，您需要准备dictionary

d={'col001'：'label1'，'col134'：'label1'，…}

并将第一步替换为

=。。。df.columns.map（d）.

如果分组确实没有简单的模式，那么你就必须写出字典如果你从未听说过这个函数，我只是用颜色来命名列，但是如果列是用随机名称调用的，我可以在stubnames中使用字典吗？

df.groupby（df.columns.str.replace（'\d+'，''），axis=1.sum（）.sum（）