Python 使用groupby向后线性填充值
我有这个df:Python 使用groupby向后线性填充值,python,pandas,lambda,pandas-groupby,interpolation,Python,Pandas,Lambda,Pandas Groupby,Interpolation,我有这个df: df = pd.DataFrame({"Time": [nat, '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', nat, '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515', '2020-04-09 06:50:1
df = pd.DataFrame({"Time": [nat, '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', nat, '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515'],
"Power": [0, 0, 0, 0, 4200, 4200, 0, 4200, 4200, 4200, 5000],
"Total Energy": [5200, 5200, 5200, 5200, 5500, 5600, 5600, 5600, 5600, 5900, 6100],
"ID": ['-', 1, 1, 1, 1, 1, '-', 2, 2, 2, 2],
"Energy": [0, 0, 0, 0, 300, 400, 0, 0, 0, 300, 500]},
index=pd.date_range(start = "2020-04-09 6:45", periods = 11, freq = 'T'))
df['Time'] = pd.to_datetime(df['Time'])
df['Power'] = pd.to_numeric(df['Power'], errors = 'ignore')
df['Total Energy'] = pd.to_numeric(df['Total Energy'], errors = 'coerce')
df['ID'] = pd.to_numeric(df['ID'], errors = 'coerce')
df['Energy'] = pd.to_numeric(df['Energy'], errors = 'coerce')
df
输出:
Time Power Total Energy ID Energy
2020-04-09 06:45:00 NaT 0 5200 NaN 0
2020-04-09 06:46:00 2020-04-09 06:46:00.000000 0 5200 1.0 0
2020-04-09 06:47:00 2020-04-09 06:46:00.000000 0 5200 1.0 0
2020-04-09 06:47:00 2020-04-09 06:46:00.000000 0 5200 1.0 0
2020-04-09 06:48:00 2020-04-09 06:46:00.000000 4200 5500 1.0 300
2020-04-09 06:49:00 2020-04-09 06:46:00.000000 4200 5600 1.0 400
2020-04-09 06:50:00 NaT 0 5600 NaN 0
2020-04-09 06:51:00 2020-04-09 06:50:16.268515 4200 5600 2.0 0
2020-04-09 06:51:00 2020-04-09 06:50:16.268515 4200 5600 2.0 0
2020-04-09 06:52:00 2020-04-09 06:50:16.268515 4200 5900 2.0 300
2020-04-09 06:53:00 2020-04-09 06:50:16.268515 5000 6100 2.0 500
我想用'df['Time']列(从0开始)填充df['Energy']
linear-groupby列
预期结果:
Time Power Total Energy ID Energy
2020-04-09 06:45:00 NaT 0 5200 NaN 0
2020-04-09 06:46:00 2020-04-09 06:46:00.000000 0 5200 1.0 0
2020-04-09 06:47:00 2020-04-09 06:46:00.000000 0 5200 1.0 100
2020-04-09 06:47:00 2020-04-09 06:46:00.000000 0 5200 1.0 200
2020-04-09 06:48:00 2020-04-09 06:46:00.000000 4200 5500 1.0 300
2020-04-09 06:49:00 2020-04-09 06:46:00.000000 4200 5600 1.0 400
2020-04-09 06:50:00 NaT 0 5600 NaN 0
2020-04-09 06:51:00 2020-04-09 06:50:16.268515 4200 5600 2.0 0
2020-04-09 06:51:00 2020-04-09 06:50:16.268515 4200 5600 2.0 150
2020-04-09 06:52:00 2020-04-09 06:50:16.268515 4200 5900 2.0 300
2020-04-09 06:53:00 2020-04-09 06:50:16.268515 5000 6100 2.0 500
我试过这样做:
df['Energy']=df.groupby('Time')['Energy'].apply(lambda x:x.interpolate())
,但它不起作用。问题不在代码中,而是数据和插值的使用
插值()函数用于填充数据帧或序列中的NA值。。。但在您的数据帧中,能量序列具有“0s”,不会应用于插值
我对你的数据做了一个小的修改来演示。请注意,能量序列已更改为在需要“插值”的区域中具有np.NAN
df = pd.DataFrame({"Time": [nat, '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', nat, '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515'],
"Power": [0, 0, 0, 0, 4200, 4200, 0, 4200, 4200, 4200, 5000],
"Total Energy": [5200, 5200, 5200, 5200, 5500, 5600, 5600, 5600, 5600, 5900, 6100],
"ID": ['-', 1, 1, 1, 1, 1, '-', 2, 2, 2, 2],
"Energy": [np.nan, 0, np.nan, np.nan, 300, 400, np.nan, 0, np.nan, 300, 500]},
index=pd.date_range(start = "2020-04-09 6:45", periods = 11, freq = 'T'))
现在当你运行这个
df['Energy'] = df.groupby('Time')['Energy'].apply(lambda x: x.interpolate())
print(df)
您将获得以下信息:
Time Power Total Energy ID Energy
2020-04-09 06:45:00 NaT 0 5200 NaN NaN
2020-04-09 06:46:00 2020-04-09 06:46:00.000000 0 5200 1.0 0.0
2020-04-09 06:47:00 2020-04-09 06:46:00.000000 0 5200 1.0 100.0
2020-04-09 06:48:00 2020-04-09 06:46:00.000000 0 5200 1.0 200.0
2020-04-09 06:49:00 2020-04-09 06:46:00.000000 4200 5500 1.0 300.0
2020-04-09 06:50:00 2020-04-09 06:46:00.000000 4200 5600 1.0 400.0
2020-04-09 06:51:00 NaT 0 5600 NaN NaN
2020-04-09 06:52:00 2020-04-09 06:50:16.268515 4200 5600 2.0 0.0
2020-04-09 06:53:00 2020-04-09 06:50:16.268515 4200 5600 2.0 150.0
2020-04-09 06:54:00 2020-04-09 06:50:16.268515 4200 5900 2.0 300.0
2020-04-09 06:55:00 2020-04-09 06:50:16.268515 5000 6100 2.0 500.0
我不知道您的数据来源或意图,因此我没有就如何更改数据结构提出进一步建议。根据你的目标,有很多方法可以做到这一点。问题不在于你的代码,而在于数据和插值的使用 插值()函数用于填充数据帧或序列中的NA值。。。但在您的数据帧中,能量序列具有“0s”,不会应用于插值 我对你的数据做了一个小的修改来演示。请注意,能量序列已更改为在需要“插值”的区域中具有np.NAN
df = pd.DataFrame({"Time": [nat, '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', '2020-04-09 06:46:00', nat, '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515'],
"Power": [0, 0, 0, 0, 4200, 4200, 0, 4200, 4200, 4200, 5000],
"Total Energy": [5200, 5200, 5200, 5200, 5500, 5600, 5600, 5600, 5600, 5900, 6100],
"ID": ['-', 1, 1, 1, 1, 1, '-', 2, 2, 2, 2],
"Energy": [np.nan, 0, np.nan, np.nan, 300, 400, np.nan, 0, np.nan, 300, 500]},
index=pd.date_range(start = "2020-04-09 6:45", periods = 11, freq = 'T'))
现在当你运行这个
df['Energy'] = df.groupby('Time')['Energy'].apply(lambda x: x.interpolate())
print(df)
您将获得以下信息:
Time Power Total Energy ID Energy
2020-04-09 06:45:00 NaT 0 5200 NaN NaN
2020-04-09 06:46:00 2020-04-09 06:46:00.000000 0 5200 1.0 0.0
2020-04-09 06:47:00 2020-04-09 06:46:00.000000 0 5200 1.0 100.0
2020-04-09 06:48:00 2020-04-09 06:46:00.000000 0 5200 1.0 200.0
2020-04-09 06:49:00 2020-04-09 06:46:00.000000 4200 5500 1.0 300.0
2020-04-09 06:50:00 2020-04-09 06:46:00.000000 4200 5600 1.0 400.0
2020-04-09 06:51:00 NaT 0 5600 NaN NaN
2020-04-09 06:52:00 2020-04-09 06:50:16.268515 4200 5600 2.0 0.0
2020-04-09 06:53:00 2020-04-09 06:50:16.268515 4200 5600 2.0 150.0
2020-04-09 06:54:00 2020-04-09 06:50:16.268515 4200 5900 2.0 300.0
2020-04-09 06:55:00 2020-04-09 06:50:16.268515 5000 6100 2.0 500.0
我不知道您的数据来源或意图,因此我没有就如何更改数据结构提出进一步建议。根据你的目标,有很多方法可以做到这一点