Python/Pandas:从大数据帧中提取间隔

Python/Pandas:从大数据帧中提取间隔,python,pandas,vectorization,Python,Pandas,Vectorization,我有两个数据帧: 2000万行连续时间序列数据,带有日期时间索引df 20000行,带两个时间戳df_seq 我想使用第二个数据帧从第一个数据帧的第一个所有行中提取所有序列,在每行2的两个时间戳之间,然后,每个序列都需要转换成990列,然后所有序列都必须组合在一个新的数据帧中 因此,新的DataFrame有一行,每个sequence case行有990列,稍后添加 现在,我的代码如下所示: sequences = pd.DataFrame() for row in df_seq.itertup

我有两个数据帧:

2000万行连续时间序列数据,带有日期时间索引df 20000行,带两个时间戳df_seq 我想使用第二个数据帧从第一个数据帧的第一个所有行中提取所有序列,在每行2的两个时间戳之间,然后,每个序列都需要转换成990列,然后所有序列都必须组合在一个新的数据帧中

因此,新的DataFrame有一行,每个sequence case行有990列,稍后添加

现在,我的代码如下所示:

sequences = pd.DataFrame()

for row in df_seq.itertuples(index=True, name='Pandas'):
    sequences = sequences.append(df.loc[row.date:row.end_date].reset_index(drop=True)[:990].transpose())

sequences = sequences.reset_index(drop=True)
sequences = pd.merge_asof(df, df_seq[["date"]], left_on="timestamp", right_on="date", )
sequences = pd.merge_asof(sequences, df_seq[["end_date"]], left_on="timestamp", right_on="end_date", direction="forward")
sequences = sequences[(sequences.timestamp >= sequences.date) & (sequences.timestamp <= sequences.end_date)]

sequences = sequences.groupby('date')['feature_1'].apply(lambda df_temp: df_temp.reset_index(drop=True)).unstack().loc[:,:990]
sequences = sequences.reset_index(drop=True)
这段代码可以工作,但执行速度非常慢->20-25分钟


有没有办法在矢量化操作中重写这个?或者用其他方法来提高代码的性能

这里有一个方法。大数据帧为“df”,间隔称为“interval”:

inx = pd.date_range(start="2020-01-01", freq="1s", periods=1000)
df = pd.DataFrame(range(len(inx)), index=inx)
df.index.name = "timestamp"

intervals = pd.DataFrame([("2020-01-01 00:00:12","2020-01-01 00:00:18"), 
                   ("2020-01-01 00:01:20","2020-01-01 00:02:03")], 
                  columns=["start_time", "end_time"])

intervals.start_time = pd.to_datetime(intervals.start_time)
intervals.end_time = pd.to_datetime(intervals.end_time)
intervals

t = pd.merge_asof(df.reset_index(), intervals[["start_time"]], left_on="timestamp", right_on="start_time", )
t = pd.merge_asof(t, intervals[["end_time"]], left_on="timestamp", right_on="end_time", direction="forward")

t = t[(t.timestamp >= t.start_time) & (t.timestamp <= t.end_time)]

在以上答案的步骤之后,我添加了一个groupby和一个unstack,结果正是我需要的df:

执行时间约为30秒

完整代码如下所示:

sequences = pd.DataFrame()

for row in df_seq.itertuples(index=True, name='Pandas'):
    sequences = sequences.append(df.loc[row.date:row.end_date].reset_index(drop=True)[:990].transpose())

sequences = sequences.reset_index(drop=True)
sequences = pd.merge_asof(df, df_seq[["date"]], left_on="timestamp", right_on="date", )
sequences = pd.merge_asof(sequences, df_seq[["end_date"]], left_on="timestamp", right_on="end_date", direction="forward")
sequences = sequences[(sequences.timestamp >= sequences.date) & (sequences.timestamp <= sequences.end_date)]

sequences = sequences.groupby('date')['feature_1'].apply(lambda df_temp: df_temp.reset_index(drop=True)).unstack().loc[:,:990]
sequences = sequences.reset_index(drop=True)

为我们添加一个数据帧示例如何?