Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/301.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 熊猫分组依据和组内总和_Python_Pandas_Dataframe_Pandas Groupby - Fatal编程技术网

Python 熊猫分组依据和组内总和

Python 熊猫分组依据和组内总和,python,pandas,dataframe,pandas-groupby,Python,Pandas,Dataframe,Pandas Groupby,假设我有一个数据帧,看起来像这样: interview longitude latitude 1 A1 34.2 90.2 2 A1 54.2 23.5 3 A3 32.1 21.5 4 A4 54.3 93.1 5

假设我有一个数据帧,看起来像这样:

    interview       longitude        latitude
1   A1                  34.2             90.2
2   A1                  54.2             23.5
3   A3                  32.1             21.5
4   A4                  54.3             93.1
5   A2                  45.1             29.5
6   A1                  NaN              NaN
7   A7                  NaN              NaN
8   A1                  NaN              NaN
9   A3                  23.1             38.2
10  A5                  -23.7            -98.4
我希望能够执行某种groupby方法,将每个子组中的总现值输出给我。因此,类似这样的东西的期望输出是:

    interview         longitude         latitude       occurs 
1   A1                  2                2              4
2   A2                  1                1              1
3   A3                  2                2              2
4   A4                  1                1              1
5   A5                  1                1              1    
6   A7                  0                0              1
我尝试使用此命令尝试使用纬度,但未获得所需的输出:

df.groupby(by=['interview', 'latitude'])['interview'].count()

谢谢

groupby
+
sum

s1=(df[['**longitude**','**latitude**']].notna()).groupby(df['**interview**']).sum()
s2=df.groupby(df['**interview**']).size()# note size will count the NaN value as well 
pd.concat([s1,s2.to_frame('**occurs** ')],axis=1)
Out[115]: 
               **longitude**  **latitude**  **occurs** 
**interview**                                          
A1                       2.0           2.0            4
A2                       1.0           1.0            1
A3                       2.0           2.0            2
A4                       1.0           1.0            1
A5                       1.0           1.0            1
A7                       0.0           0.0            1

以下三种不同的方法可以帮助您实现:

 import pandas as pd
    import numpy as np

    data = np.array([   
            ['A1',  'A1',   'A3'    ,'A4'   ,'A2'   ,'A1'   ,'A7',  'A1',   'A3',   'A5'],      
            [34.2,  54.2,   32.1,   54.3,   45.1,   np.NaN  ,np.NaN ,np.NaN,    23.1,   -23.7],
            [   90.2,   23.5,   21.5,   93.1,   29.5,       np.NaN,np.NaN   ,np.NaN ,38.2,  -98.4]])


    df = pd.DataFrame({'**interview**':data[0,:],'**longitude**':data[1,:],'**latitude**':data[2,:]})  

    #first way
    df['**occurs**']=1
    print(df.groupby('**interview**')[['**longitude**','**latitude**','**occurs**']].count().\
    reset_index())
    #or
     counts=0

gb = df.groupby(['**interview**'])
gb1 = df.groupby(['**interview**','**latitude**'])
counts = gb.size().to_frame(name='**occurs**')

print((counts
   .join(gb1.agg({'**longitude**':lambda x: x.notnull().size}))
 .join(gb1.agg({'**latitude**': lambda x: x.notnull().size}).rename(columns={'**latitude**': '*latitude*'}))

   .reset_index()
  ))

     #second way
    counts=0

    gb = df.groupby(['**interview**'])
    counts = gb.size().to_frame(name='**occurs**')

    print((counts
       .join(gb.agg({'**longitude**': 'size'}))
     .join(gb.agg({'**latitude**': 'size'}))

       .reset_index()
      ))

     #Third way   this just for compare
    print(df.groupby(['**interview**']).agg({'**longitude**':lambda x: x.notnull().sum(),
                                       '**latitude**':lambda x: x.notnull().sum(),
                                       '**interview**': 'size'})\
                                 .rename(columns={'**interview**':'**occurs**'}))

请参见此处的代码:

无需使用agg,只需将列传递给groupby即可。Count返回非空值的计数

df.groupby('interview')[['interview','longitude','latitude']].count()


        interview   longitude   latitude
interview           
A1      4           2           2
A2      1           1           1
A3      2           2           2
A4      1           1           1
A5      1           1           1
A7      1           0           0

我不知道您可以将分组依据的列传递给groupby。这很有用。