Python 按熊猫的长度固定分组_Python_Python 3.x_Pandas_Pandas Groupby

Python 按熊猫的长度固定分组

python python-3.x pandas

Python 按熊猫的长度固定分组,python,python-3.x,pandas,pandas-groupby,Python,Python 3.x,Pandas,Pandas Groupby,我有一个按熊猫分组的数据帧： id date temperature 1 2011-9-12 12 2011-9-18 12 2011-9-19 12 2 2011-9-12 15 3 2011-9-12 15 2011-9-16 15 这里，每个id都有不同数量的温度记录我想修正它们，比如说每个id的平均记录数（比如3条）。如果一些记录丢失，我想在开始时放零 i、 e.我的最终数据帧应该是： id temperature 1

我有一个按熊猫分组的数据帧：

id    date    temperature
1  2011-9-12   12
   2011-9-18   12
   2011-9-19   12
2  2011-9-12   15
3  2011-9-12   15
   2011-9-16   15

这里，每个id都有不同数量的温度记录

我想修正它们，比如说每个id的平均记录数（比如3条）。如果一些记录丢失，我想在开始时放零

i、 e.我的最终数据帧应该是：

id    temperature
1     12
      12
      12
2     0
      0
      15
3     0
3     15
3     15

我需要将每个id的记录数自定义为某个数字，也可以是每个id的平均记录数。如何获得平均值呢

只需使用

stack

和

unstack

df.groupby(level=0)['temperature'].\
      apply(list).\
         apply(pd.Series).iloc[:,:3].\
                 apply(lambda x : pd.Series(sorted(x,key=pd.notnull)),1).\
                   fillna(0).stack().reset_index(level=0)
Out[523]: 
   id     0
0   1  12.0
1   1  12.0
2   1  12.0
0   2   0.0
1   2   0.0
2   2  15.0
0   3   0.0
1   3  15.0
2   3  15.0

Numpy加速解决方案

s=df.groupby(level=0)['temperature'].apply(list)
s1=s.tolist()
arr = np.zeros((len(s1),3),int)
lens = [3-len(l) for l in s1]
mask = np.arange(3) >=np.array(lens)[:,None]
arr[mask] = np.concatenate(s1)
pd.DataFrame({'id':s.index.repeat(3),'temperature':arr.ravel()})

在访问groupby元素时，我们可以将

reindex

与

range（3）

一起使用。之后，我们对值进行排序，并将

NaN

设置为第一个位置，这样我们就可以

fillna

设置为0

df_new = pd.concat([
    d[['id', 'temperature']].reset_index(drop=True).reindex(range(3)).sort_values('id', na_position='first')
    for _, d in df.groupby('id')
], ignore_index=True)

df_new['id'].fillna(method='bfill', inplace=True)
df_new['temperature'].fillna(0, inplace=True)

print(df_new)
    id  temperature
0  1.0         12.0
1  1.0         12.0
2  1.0         12.0
3  2.0          0.0
4  2.0          0.0
5  2.0         15.0
6  3.0          0.0
7  3.0         15.0
8  3.0         15.0

注意您有

id

和

date

作为索引，因此第一次运行：

df.reset_index(inplace=True)

非常感谢。如何控制每个id的行数以及零的顺序是错误的。@tstseby所以你的日期在这里没有意义？@Wen_ben是的，因为我添加的是零记录日期不是必需的，我只想确保每个id都有相同的行数records@tstseby添加一个numpy快速解决方案。：-）谢谢，我得到了ValueError:Numpy布尔数组索引无法将输入值分配给输出值。。。