Python 使用groupby向后线性填充值

Python 使用groupby向后线性填充值,python,pandas,lambda,pandas-groupby,interpolation,Python,Pandas,Lambda,Pandas Groupby,Interpolation,我有这个df: df = pd.DataFrame({"Time": [nat, '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', nat, '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515', '2020-04-09 06:50:1

我有这个df:

df = pd.DataFrame({"Time": [nat, '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', nat, '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515'],
                  "Power": [0, 0, 0, 0, 4200, 4200, 0, 4200, 4200, 4200, 5000],
                  "Total Energy": [5200, 5200, 5200, 5200, 5500, 5600, 5600, 5600, 5600, 5900, 6100],
                  "ID": ['-', 1, 1, 1, 1, 1, '-', 2, 2, 2, 2],
                  "Energy": [0, 0, 0, 0, 300, 400, 0, 0, 0, 300, 500]},
                  index=pd.date_range(start = "2020-04-09 6:45", periods = 11, freq = 'T'))
    
df['Time'] = pd.to_datetime(df['Time'])
df['Power'] = pd.to_numeric(df['Power'], errors = 'ignore')
df['Total Energy'] = pd.to_numeric(df['Total Energy'], errors = 'coerce')
df['ID'] = pd.to_numeric(df['ID'], errors = 'coerce')
df['Energy'] = pd.to_numeric(df['Energy'], errors = 'coerce')
    
df
输出:

                                          Time  Power   Total Energy     ID Energy
2020-04-09 06:45:00                        NaT     0            5200    NaN      0
2020-04-09 06:46:00 2020-04-09 06:46:00.000000     0            5200    1.0      0
2020-04-09 06:47:00 2020-04-09 06:46:00.000000     0            5200    1.0      0
2020-04-09 06:47:00 2020-04-09 06:46:00.000000     0            5200    1.0      0
2020-04-09 06:48:00 2020-04-09 06:46:00.000000  4200            5500    1.0    300
2020-04-09 06:49:00 2020-04-09 06:46:00.000000  4200            5600    1.0    400
2020-04-09 06:50:00                        NaT     0            5600    NaN      0
2020-04-09 06:51:00 2020-04-09 06:50:16.268515  4200            5600    2.0      0
2020-04-09 06:51:00 2020-04-09 06:50:16.268515  4200            5600    2.0      0
2020-04-09 06:52:00 2020-04-09 06:50:16.268515  4200            5900    2.0    300
2020-04-09 06:53:00 2020-04-09 06:50:16.268515  5000            6100    2.0    500
我想用'df['Time']列(从0开始)填充
df['Energy']
linear-groupby列

预期结果:

                                          Time  Power   Total Energy     ID Energy
2020-04-09 06:45:00                        NaT     0            5200    NaN      0
2020-04-09 06:46:00 2020-04-09 06:46:00.000000     0            5200    1.0      0
2020-04-09 06:47:00 2020-04-09 06:46:00.000000     0            5200    1.0    100
2020-04-09 06:47:00 2020-04-09 06:46:00.000000     0            5200    1.0    200
2020-04-09 06:48:00 2020-04-09 06:46:00.000000  4200            5500    1.0    300
2020-04-09 06:49:00 2020-04-09 06:46:00.000000  4200            5600    1.0    400
2020-04-09 06:50:00                        NaT     0            5600    NaN      0
2020-04-09 06:51:00 2020-04-09 06:50:16.268515  4200            5600    2.0      0
2020-04-09 06:51:00 2020-04-09 06:50:16.268515  4200            5600    2.0    150
2020-04-09 06:52:00 2020-04-09 06:50:16.268515  4200            5900    2.0    300
2020-04-09 06:53:00 2020-04-09 06:50:16.268515  5000            6100    2.0    500

我试过这样做:
df['Energy']=df.groupby('Time')['Energy'].apply(lambda x:x.interpolate())
,但它不起作用。

问题不在代码中,而是数据和插值的使用

插值()函数用于填充数据帧或序列中的NA值。。。但在您的数据帧中,能量序列具有“0s”,不会应用于插值

我对你的数据做了一个小的修改来演示。请注意,能量序列已更改为在需要“插值”的区域中具有np.NAN

df = pd.DataFrame({"Time": [nat, '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', nat, '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515'],
                  "Power": [0, 0, 0, 0, 4200, 4200, 0, 4200, 4200, 4200, 5000],
                  "Total Energy": [5200, 5200, 5200, 5200, 5500, 5600, 5600, 5600, 5600, 5900, 6100],
                  "ID": ['-', 1, 1, 1, 1, 1, '-', 2, 2, 2, 2],
                  "Energy": [np.nan, 0, np.nan, np.nan, 300, 400, np.nan, 0, np.nan, 300, 500]},
                  index=pd.date_range(start = "2020-04-09 6:45", periods = 11, freq = 'T'))
现在当你运行这个

df['Energy'] = df.groupby('Time')['Energy'].apply(lambda x: x.interpolate())
print(df)
您将获得以下信息:

                                      Time  Power  Total Energy   ID  Energy
2020-04-09 06:45:00                        NaT      0          5200  NaN     NaN
2020-04-09 06:46:00 2020-04-09 06:46:00.000000      0          5200  1.0     0.0
2020-04-09 06:47:00 2020-04-09 06:46:00.000000      0          5200  1.0   100.0
2020-04-09 06:48:00 2020-04-09 06:46:00.000000      0          5200  1.0   200.0
2020-04-09 06:49:00 2020-04-09 06:46:00.000000   4200          5500  1.0   300.0
2020-04-09 06:50:00 2020-04-09 06:46:00.000000   4200          5600  1.0   400.0
2020-04-09 06:51:00                        NaT      0          5600  NaN     NaN
2020-04-09 06:52:00 2020-04-09 06:50:16.268515   4200          5600  2.0     0.0
2020-04-09 06:53:00 2020-04-09 06:50:16.268515   4200          5600  2.0   150.0
2020-04-09 06:54:00 2020-04-09 06:50:16.268515   4200          5900  2.0   300.0
2020-04-09 06:55:00 2020-04-09 06:50:16.268515   5000          6100  2.0   500.0

我不知道您的数据来源或意图,因此我没有就如何更改数据结构提出进一步建议。根据你的目标,有很多方法可以做到这一点。

问题不在于你的代码,而在于数据和插值的使用

插值()函数用于填充数据帧或序列中的NA值。。。但在您的数据帧中,能量序列具有“0s”,不会应用于插值

我对你的数据做了一个小的修改来演示。请注意,能量序列已更改为在需要“插值”的区域中具有np.NAN

df = pd.DataFrame({"Time": [nat, '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', nat, '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515'],
                  "Power": [0, 0, 0, 0, 4200, 4200, 0, 4200, 4200, 4200, 5000],
                  "Total Energy": [5200, 5200, 5200, 5200, 5500, 5600, 5600, 5600, 5600, 5900, 6100],
                  "ID": ['-', 1, 1, 1, 1, 1, '-', 2, 2, 2, 2],
                  "Energy": [np.nan, 0, np.nan, np.nan, 300, 400, np.nan, 0, np.nan, 300, 500]},
                  index=pd.date_range(start = "2020-04-09 6:45", periods = 11, freq = 'T'))
现在当你运行这个

df['Energy'] = df.groupby('Time')['Energy'].apply(lambda x: x.interpolate())
print(df)
您将获得以下信息:

                                      Time  Power  Total Energy   ID  Energy
2020-04-09 06:45:00                        NaT      0          5200  NaN     NaN
2020-04-09 06:46:00 2020-04-09 06:46:00.000000      0          5200  1.0     0.0
2020-04-09 06:47:00 2020-04-09 06:46:00.000000      0          5200  1.0   100.0
2020-04-09 06:48:00 2020-04-09 06:46:00.000000      0          5200  1.0   200.0
2020-04-09 06:49:00 2020-04-09 06:46:00.000000   4200          5500  1.0   300.0
2020-04-09 06:50:00 2020-04-09 06:46:00.000000   4200          5600  1.0   400.0
2020-04-09 06:51:00                        NaT      0          5600  NaN     NaN
2020-04-09 06:52:00 2020-04-09 06:50:16.268515   4200          5600  2.0     0.0
2020-04-09 06:53:00 2020-04-09 06:50:16.268515   4200          5600  2.0   150.0
2020-04-09 06:54:00 2020-04-09 06:50:16.268515   4200          5900  2.0   300.0
2020-04-09 06:55:00 2020-04-09 06:50:16.268515   5000          6100  2.0   500.0
我不知道您的数据来源或意图,因此我没有就如何更改数据结构提出进一步建议。根据你的目标,有很多方法可以做到这一点