python使用二进制类按多个组进行分组_Python_Pandas_Group By_Data Analysis

python使用二进制类按多个组进行分组

python pandas

python使用二进制类按多个组进行分组,python,pandas,group-by,data-analysis,Python,Pandas,Group By,Data Analysis,我有一个数据帧，如下所示： id class A 1 B 1 C 0 D 0 E 1 F 1 我想把它分成3组，G1:A，B，G2:C，D，G3:E，F。有没有办法在所有行上循环，为每个id分配一个新类您可以使用，并且：测试性能：这些计时将非常依赖于df的大小以及0和1的数量（和位置）：测试len（df）=10： In [28]: %timeit jez(df) The slowest run took 5.08 times longer than the

我有一个数据帧，如下所示：

id class
A   1
B   1
C   0 
D   0
E   1
F   1

我想把它分成3组，G1:A，B，G2:C，D，G3:E，F。有没有办法在所有行上循环，为每个id分配一个新类

您可以使用，并且：

测试性能：

这些计时将非常依赖于df的大小以及

和

的数量（和位置）：

测试

len（df）=10

：

In [28]: %timeit jez(df)
The slowest run took 5.08 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 454 µs per loop

In [29]: %timeit eze(df)
The slowest run took 4.83 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 422 µs per loop

In [30]: %timeit sy2(df)
The slowest run took 4.57 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 1.46 ms per loop

测试

len（df）=10000

：

In [32]: %timeit jez(df)
The slowest run took 4.78 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 543 µs per loop

In [33]: %timeit eze(df)
1 loops, best of 3: 245 ms per loop

In [34]: %timeit sy2(df)
The slowest run took 4.11 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 9.11 ms per loop

您可以使用，并且：

测试性能：

这些计时将非常依赖于df的大小以及

和

的数量（和位置）：

测试

len（df）=10

：

In [28]: %timeit jez(df)
The slowest run took 5.08 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 454 µs per loop

In [29]: %timeit eze(df)
The slowest run took 4.83 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 422 µs per loop

In [30]: %timeit sy2(df)
The slowest run took 4.57 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 1.46 ms per loop

测试

len（df）=10000

：

In [32]: %timeit jez(df)
The slowest run took 4.78 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 543 µs per loop

In [33]: %timeit eze(df)
1 loops, best of 3: 245 ms per loop

In [34]: %timeit sy2(df)
The slowest run took 4.11 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 9.11 ms per loop

迭代“类”，并在每次类与前一个不同时启动一个新组，例如：

英国国防军：

import pandas as pd
df = pd.DataFrame()
df['id'] = ['a','b','c','d','e','f']
df['class'] = [1,1,0,0,1,1]

迭代“类”以创建组索引：

group_index = [0]
for i in df.index[1:]:
    if df['class'][i]==df['class'][i-1]:
        group_index.append(group_index[-1])
    else:
        group_index.append(group_index[-1]+1)

将组索引添加到DF：

df['group_index'] = group_index

输出应为：

    id  class   group_index
  0 a     1        0
  1 b     1        0
  2 c     0        1
  3 d     0        1
  4 e     1        2
  5 f     1        2

迭代“类”并在每次类与前一个不同时启动一个新组，例如：

英国国防军：

import pandas as pd
df = pd.DataFrame()
df['id'] = ['a','b','c','d','e','f']
df['class'] = [1,1,0,0,1,1]

迭代“类”以创建组索引：

group_index = [0]
for i in df.index[1:]:
    if df['class'][i]==df['class'][i-1]:
        group_index.append(group_index[-1])
    else:
        group_index.append(group_index[-1]+1)

将组索引添加到DF：

df['group_index'] = group_index

输出应为：

    id  class   group_index
  0 a     1        0
  1 b     1        0
  2 c     0        1
  3 d     0        1
  4 e     1        2
  5 f     1        2

这是一个单行代码：P 它利用相邻行的差异信息和累积总和为每行分配组ID

>>> df = pd.DataFrame({'id': ['A','B','C','D','E','F'],
                       'class': [1, 1, 0, 0, 1, 1]},
                       columns=['id', 'class'])

>>> pd.concat([df, pd.Series(map(lambda x: 1 if abs(x) > 0 else 0,
df['class'].diff().fillna(0)), name='groupid').cumsum()], axis=1)

  id  class  groupid
0  A      1        0
1  B      1        0
2  C      0        1
3  D      0        1
4  E      1        2
5  F      1        2

现在，您可以使用groupby（）获得groupy对象

>>> g = pd.concat([df, pd.Series(map(lambda x: 1 if abs(x) > 0 else 0,
df['class'].diff().fillna(0)), name='groupid').cumsum()], axis=1).groupby('groupid')

>>> for index, group_df in g:
        print(group_df)

  id  class  groupid
0  A      1        0
1  B      1        0
  id  class  groupid
2  C      0        1
3  D      0        1
  id  class  groupid
4  E      1        2
5  F      1        2

完整代码见附件

import pandas as pd

def groupby_binaryflag(df, key='class'):
    return pd.concat([df,
                      pd.Series(map(lambda x: 1
                                    if abs(x) > 0
                                    else 0, df['class'].diff().fillna(0)),
                                name='groupid').cumsum()], axis=1).groupby('groupid')

if __name__ == '__main__':
    df1 = pd.DataFrame({'id': ['A','B','C','D','E','F'],
                        'class': [1, 1, 0, 0, 1, 1]}, columns=['id', 'class'])

    df2 = pd.DataFrame({'id': ['A','B','C','D','E','F', 'G', 'H', 'I', 'J', 'K', 'L'],
                        'class': [1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1]}, columns=['id', 'class'])

    for df in [df1, df2]:
        for index, group_df in groupby_binaryflag(df):
            print(group_df)
        print("=====\n")

输出：

  id  class  groupid
0  A      1        0
1  B      1        0
  id  class  groupid
2  C      0        1
3  D      0        1
  id  class  groupid
4  E      1        2
5  F      1        2
=====

  id  class  groupid
0  A      1        0
1  B      1        0
  id  class  groupid
2  C      0        1
3  D      0        1
  id  class  groupid
4  E      1        2
5  F      1        2
  id  class  groupid
6  G      0        3
7  H      0        3
8  I      0        3
   id  class  groupid
9   J      1        4
10  K      1        4
11  L      1        4
=====

这是一个单行代码：P 它利用相邻行的差异信息和累积总和为每行分配组ID

>>> df = pd.DataFrame({'id': ['A','B','C','D','E','F'],
                       'class': [1, 1, 0, 0, 1, 1]},
                       columns=['id', 'class'])

>>> pd.concat([df, pd.Series(map(lambda x: 1 if abs(x) > 0 else 0,
df['class'].diff().fillna(0)), name='groupid').cumsum()], axis=1)

  id  class  groupid
0  A      1        0
1  B      1        0
2  C      0        1
3  D      0        1
4  E      1        2
5  F      1        2

现在，您可以使用groupby（）获得groupy对象

>>> g = pd.concat([df, pd.Series(map(lambda x: 1 if abs(x) > 0 else 0,
df['class'].diff().fillna(0)), name='groupid').cumsum()], axis=1).groupby('groupid')

>>> for index, group_df in g:
        print(group_df)

  id  class  groupid
0  A      1        0
1  B      1        0
  id  class  groupid
2  C      0        1
3  D      0        1
  id  class  groupid
4  E      1        2
5  F      1        2

完整代码见附件

import pandas as pd

def groupby_binaryflag(df, key='class'):
    return pd.concat([df,
                      pd.Series(map(lambda x: 1
                                    if abs(x) > 0
                                    else 0, df['class'].diff().fillna(0)),
                                name='groupid').cumsum()], axis=1).groupby('groupid')

if __name__ == '__main__':
    df1 = pd.DataFrame({'id': ['A','B','C','D','E','F'],
                        'class': [1, 1, 0, 0, 1, 1]}, columns=['id', 'class'])

    df2 = pd.DataFrame({'id': ['A','B','C','D','E','F', 'G', 'H', 'I', 'J', 'K', 'L'],
                        'class': [1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1]}, columns=['id', 'class'])

    for df in [df1, df2]:
        for index, group_df in groupby_binaryflag(df):
            print(group_df)
        print("=====\n")

输出：

  id  class  groupid
0  A      1        0
1  B      1        0
  id  class  groupid
2  C      0        1
3  D      0        1
  id  class  groupid
4  E      1        2
5  F      1        2
=====

  id  class  groupid
0  A      1        0
1  B      1        0
  id  class  groupid
2  C      0        1
3  D      0        1
  id  class  groupid
4  E      1        2
5  F      1        2
  id  class  groupid
6  G      0        3
7  H      0        3
8  I      0        3
   id  class  groupid
9   J      1        4
10  K      1        4
11  L      1        4
=====

您能否发布所需输出，以便我们更好地理解您的问题？您能否发布所需输出，以便我们更好地理解您的问题？