Python 重新索引timeseries数据
我有一个类似的问题。未提供解决方案 我有一个包含多行和多列天气数据的excel文件。虽然下面的示例中未显示数据,但数据在某些时间间隔丢失。我想每隔5分钟对时间列重新编制索引,以便插入缺失的值。数据样本:Python 重新索引timeseries数据,python,pandas,time-series,python-datetime,reindex,Python,Pandas,Time Series,Python Datetime,Reindex,我有一个类似的问题。未提供解决方案 我有一个包含多行和多列天气数据的excel文件。虽然下面的示例中未显示数据,但数据在某些时间间隔丢失。我想每隔5分钟对时间列重新编制索引,以便插入缺失的值。数据样本: 这是我试过的 import pandas as pd ts = pd.read_excel('E:\DATA\AP.xlsx') ts['Time'] = pd.to_datetime(ts['Time']) ts.set_index('Time', inplace=True) dt = p
这是我试过的
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
dt = pd.date_range("2018-04-01 00:00:00", "2018-05-01 00:00:00", freq='5min', name='T')
idx = pd.DatetimeIndex(dt)
ts.reindex(idx)
我只想让我的索引以5分钟的频率,这样我可以在以后插值NaN预期输出:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:15 a NaN NaN NaN NaN
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:25 a NaN NaN NaN NaN
04/01/18 12:30 a 30.7 51 19.4 2.2
您可以尝试以下方法,例如:
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()
此处的详细信息:您可以尝试以下方法,例如:
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()
此处的详细信息:将时间列设置为索引,确保它是DateTime类型,然后重试
ts.asfreq('5T')
使用
将以前的值向前拉 将时间列设置为索引,确保它是DateTime类型,然后重试
ts.asfreq('5T')
使用
将以前的值向前拉 我会采取创建一个空白表的方法,并用来自您的数据源的数据填充它。对于本例,三个观察值被读入为NaN,加上1:15和1:20的行丢失
import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
2018-04-01 01:00:00 1.010.02018-04-01 01:05:00 2.0南
2018-04-01 01:10:00南10.0
3 2018-04-01 01:20:00南10.0
4 2018-04-01 01:30:00 5.010.0
现在创建一个具有理想结构的数据帧targpd
time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)
2018-04-01 01:00:00楠楠2018-04-01 01:05:00楠楠
2018-04-01 01:10:00楠楠楠
2018-04-01 01:15:00楠楠
2018-04-01 01:20:00楠楠楠
2018-04-01 01:25:00楠楠
2018-04-01 01:30:00楠楠
现在的诀窍是使用rawpd中发送给您的数据更新targpd。要实现这一点,日期和时间列必须在rawpd中组合并形成索引
print(rawpd.Date,rawpd.Time)
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
0 2018-04-012018年4月1日
2 2018-04-01
3 2018-04-01
4 2018-04-01
名称:日期,数据类型:datetime64[ns]
0 01:00:00
01:05:00
2 01:10:00
3 01:20:00
401:30:00
名称:时间,数据类型:对象
你可以看到以上所有这些技巧。日期数据已转换为日期时间,但时间数据只是一个字符串。下面使用lambda函数创建适当的索引
rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)
这可以作为索引应用于rawpd数据库
print(rawpd.Date,rawpd.Time)
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
一旦这是到位的更新命令可以得到你想要的
targpd.update(rawpd2,overwrite=True)
print(targpd)
2018-04-01 01:00:00 1.010.02018-04-01 01:00:00 1.010.0
2018-04-01 01:05:00 2.0南
2018-04-01 01:10:00南10.0
2018-04-01 01:15:00楠楠
2018-04-01 01:20:00南10.0
2018-04-01 01:25:00楠楠
2018-04-01 01:30:00 5.010.0
2018-04-01 01:05:00 2.0南
2018-04-01 01:10:00南10.0
2018-04-01 01:15:00楠楠
2018-04-01 01:20:00南10.0
2018-04-01 01:25:00楠楠
2018-04-01 01:30:00 5.0 10.0
您现在有了一个可供插值的文件我将采用创建一个空白表的方法,并使用来自数据源的数据填充它。对于本例,三个观察值被读入为NaN,加上1:15和1:20的行丢失
import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
2018-04-01 01:00:00 1.010.02018-04-01 01:05:00 2.0南
2018-04-01 01:10:00南10.0
3 2018-04-01 01:20:00南10.0
4 2018-04-01 01:30:00 5.010.0
现在创建一个具有理想结构的数据帧targpd
time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)
2018-04-01 01:00:00楠楠2018-04-01 01:05:00楠楠
2018-04-01 01:10:00楠楠楠
2018-04-01 01:15:00楠楠
2018-04-01 01:20:00楠楠楠
2018-04-01 01:25:00楠楠
2018-04-01 01:30:00楠楠
现在的诀窍是使用rawpd中发送给您的数据更新targpd。要实现这一点,日期和时间列必须在rawpd中组合并形成索引
print(rawpd.Date,rawpd.Time)
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
0 2018-04-012018年4月1日
2 2018-04-01
3 2018-04-01
4 2018-04-01
名称:日期,数据类型:datetime64[ns]
0 01:00:00
01:05:00
2 01:10:00
3 01:20:00
401:30:00
名称:时间,数据类型:对象
你可以看到以上所有这些技巧。日期数据已转换为日期时间,但时间数据只是一个字符串。下面使用lambda函数创建适当的索引
rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)
这可以作为索引应用于rawpd数据库
print(rawpd.Date,rawpd.Time)
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
一旦这是到位的更新命令可以得到你想要的
targpd.update(rawpd2,overwrite=True)
print(targpd)
2018-04-01 01:00:00 1.010.02018-04-01 01:00:00 1.010.0
2018-04-01 01:05:00 2.0南
2018-04-01 01:10:00南10.0
2018-04-01 01:15:00楠楠
2018-04-01 01:20:00南10.0
2018-04-01 01:25:00楠楠
2018-04-01 01:30:00 5.010.0
2018-04-01 01:05:00 2.0南
2018-04-01 01:10:00南10.0
2018-04-01 01:15:00楠楠
2018-04-01 01:20:00南10.0
2018-04-01 01:25:00楠楠
2018-04-01 01:30:00 5.0 10.0
现在,您已经准备好了一个文件,可以再进行一次插值
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df
输出
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9
如果必须重新采样多个日期的时间,可以使用下面的代码
但是,稍后您必须将“日期”和“时间”列分开
df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1
输出
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9
还有一个办法
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df
输出
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9
如果必须重新采样多个日期的时间,可以使用下面的代码
但是,稍后您必须将“日期”和“时间”列分开
df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1
输出
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9
我要让它工作。谢谢大家抽出时间。我正在提供工作代码
import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)
我要让它工作。谢谢大家抽出时间。我正在提供工作代码
import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)
提供您的输入数据和预期ou