Python 3.x 熊猫中的单记录到多记录

Python 3.x 熊猫中的单记录到多记录,python-3.x,pandas,Python 3.x,Pandas,我是python新手,我必须实现以下逻辑。我知道如何将其实现为sql查询,但需要知道如何在pandas中实现 我从一个查询中得到如下输出: startdatetime,endatetime,value 2019-03-26 23:00:00.000,2019-03-27 01:00:00.000,37.86 2019-03-27 01:00:00.000,2019-03-27 03:00:00.000,37.91 2019-03-27 03:00:00.000,2019-03-27 05:00:

我是python新手,我必须实现以下逻辑。我知道如何将其实现为sql查询,但需要知道如何在pandas中实现

我从一个查询中得到如下输出:

startdatetime,endatetime,value
2019-03-26 23:00:00.000,2019-03-27 01:00:00.000,37.86
2019-03-27 01:00:00.000,2019-03-27 03:00:00.000,37.91
2019-03-27 03:00:00.000,2019-03-27 05:00:00.000,34.54
我需要将datetime拆分为15分钟的持续时间,保留相同的值,或者示例:

startdatetime,endatetime,value
2019-03-26 23:00:00.000,2019-03-26 23:15:00.000,37.86
2019-03-26 23:15:00.000,2019-03-26 23:30:00.000,37.86
2019-03-26 23:30:00.000,2019-03-26 23:45:00.000,37.86
2019-03-26 23:45:00.000,2019-03-27 00:00:00.000,37.86
2019-03-27 00:00:00.000,2019-03-27 00:15:00.000,37.86
2019-03-27 00:15:00.000,2019-03-27 00:30:00.000,37.86
2019-03-27 00:30:00.000,2019-03-27 00:45:00.000,37.86
2019-03-27 00:45:00.000,2019-03-27 01:00:00.000,37.86
按转换为分钟的日期时间差使用,然后将15分钟的时间增量添加到由和创建的
startdatetime
,对于
endatetime
仅移位值,并按原始值重新计算每组最后的
NaN
s:

df['startdatetime'] = pd.to_datetime(df['startdatetime'])
df['endatetime'] = pd.to_datetime(df['endatetime'])

v = ((df['endatetime'] - df['startdatetime']).dt.total_seconds() / (60 * 15))
df = df.loc[df.index.repeat(v)]
df['startdatetime'] += pd.to_timedelta(df.groupby(level=0).cumcount(), unit='s') * 15 * 60
df['endatetime'] = df['startdatetime'].shift(-1).fillna(df['endatetime'])
df = df.reset_index(drop=True)
print (df)
         startdatetime          endatetime  value
0  2019-03-26 23:00:00 2019-03-26 23:15:00  37.86
1  2019-03-26 23:15:00 2019-03-26 23:30:00  37.86
2  2019-03-26 23:30:00 2019-03-26 23:45:00  37.86
3  2019-03-26 23:45:00 2019-03-27 00:00:00  37.86
4  2019-03-27 00:00:00 2019-03-27 00:15:00  37.86
5  2019-03-27 00:15:00 2019-03-27 00:30:00  37.86
6  2019-03-27 00:30:00 2019-03-27 00:45:00  37.86
7  2019-03-27 00:45:00 2019-03-27 01:00:00  37.86
8  2019-03-27 01:00:00 2019-03-27 01:15:00  37.91
9  2019-03-27 01:15:00 2019-03-27 01:30:00  37.91
10 2019-03-27 01:30:00 2019-03-27 01:45:00  37.91
11 2019-03-27 01:45:00 2019-03-27 02:00:00  37.91
12 2019-03-27 02:00:00 2019-03-27 02:15:00  37.91
13 2019-03-27 02:15:00 2019-03-27 02:30:00  37.91
14 2019-03-27 02:30:00 2019-03-27 02:45:00  37.91
15 2019-03-27 02:45:00 2019-03-27 03:00:00  37.91
16 2019-03-27 03:00:00 2019-03-27 03:15:00  34.54
17 2019-03-27 03:15:00 2019-03-27 03:30:00  34.54
18 2019-03-27 03:30:00 2019-03-27 03:45:00  34.54
19 2019-03-27 03:45:00 2019-03-27 04:00:00  34.54
20 2019-03-27 04:00:00 2019-03-27 04:15:00  34.54
21 2019-03-27 04:15:00 2019-03-27 04:30:00  34.54
22 2019-03-27 04:30:00 2019-03-27 04:45:00  34.54
23 2019-03-27 04:45:00 2019-03-27 05:00:00  34.54

有很多方法可以做到这一点,只是提供了我的观点

首先,让我们重新创建您的数据

import pandas as pd
df = pd.DataFrame([
    ('2019-03-26 23:00:00.000','2019-03-27 01:00:00.000','37.86'),
    ('2019-03-27 01:00:00.000','2019-03-27 03:00:00.000','37.91'),
    ('2019-03-27 03:00:00.000','2019-03-27 05:00:00.000','34.54')
], columns=['startdatetime','enddatetime','value'])
df['startdatetime'] = pd.to_datetime(df['startdatetime'])
df['enddatetime'] = pd.to_datetime(df['enddatetime'])
现在直观地说,我将遵循以下两种方法之一:

  • Apply
    语法:我们将每一行分成一组。对我来说感觉非常直观,但通常不是很快的语法
  • Join
    语法:我们创建时间间隔并将值连接到它们。更接近SQL风格。我在下面添加了这个的代码
加入

我们创建范围,并加入灵活的
merge\u asof
。这是一个不严格的合并,允许加入范围。对于您的示例,它非常有效,如果实际数据不同,您可能需要进行一些调整

range = pd.date_range(start=df.startdatetime.min(), end=df.enddatetime.max(), freq='15min')
df_range = pd.DataFrame(range, columns=['startdatetime'])
result = pd.merge_asof(df_range, df, left_on='startdatetime', right_on='startdatetime')

这看起来像是timeseries数据。这意味着源数据中会出现问题。依靠源数据不出错最终是现实世界系统的一个问题

因此,重采样是处理这些数据并为不可避免的抖动做好准备的合理方法

此外,在每个阶段都有机会干预并对数据采取行动

import pandas as pd
import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

csvdata = StringIO("""startdatetime,endatetime,value
2019-03-26 23:00:00.000,2019-03-27 01:00:00.000,37.86
2019-03-27 01:00:00.000,2019-03-27 03:00:00.000,37.91
2019-03-27 03:00:00.000,2019-03-27 05:00:00.000,34.54""")

df = pd.read_csv(csvdata, sep=",", index_col="startdatetime", parse_dates=True, infer_datetime_format=True)

# flexibility to statistically pick resampled values should the index
# not be on a ten minute boundary
df = df.resample('15T').last()
df = df.reset_index()

# now that the DataFrame has a ten minute freq index, use it to make the end interval
enddatetime = df['startdatetime']
enddatetime = enddatetime.append(pd.Series(enddatetime.values[-1] +  pd.Timedelta(minutes=15)))
enddatetime = enddatetime.shift(-1).values[:-1]
df['endatetime'] = enddatetime

# flexibility to fill missing values
df['value'] = df['value'].ffill()

# results
print(df)

将熊猫作为pd导入
导入系统
如果系统版本信息[0]<3:
从StringIO导入StringIO
其他:
从io导入StringIO
csvdata=StringIO(““”startdatetime,endatetime,value
2019-03-26 23:00:00.000,2019-03-27 01:00:00.000,37.86
2019-03-27 01:00:00.000,2019-03-27 03:00:00.000,37.91
2019-03-27 03:00:00.000,2019-03-27 05:00:00.000,34.54""")
df=pd.read\u csv(csvdata,sep=“,”,index\u col=“startdatetime”,parse\u dates=True,推断\u datetime\u format=True)
#如果索引不正确,可以灵活地从统计上选择重采样值
#不在十分钟之内
df=df.resample('15T')。last()
df=df.reset_index()
#既然数据帧有一个10分钟的频率索引,那么就用它来创建结束间隔
enddatetime=df['startdatetime']
enddatetime=enddatetime.append(pd.Series(enddatetime.values[-1]+pd.Timedelta(分钟=15)))
enddatetime=enddatetime.shift(-1).值[:-1]
df['endatetime']=enddatetime
#灵活地填充缺少的值
df['value']=df['value'].ffill()
#结果
打印(df)

下面的答案有用吗?:)是的,确实有用:)