Python-两个数据帧中数百万行的矢量化日期差_Python_Pandas_Datetime_Vectorization_Counting

Python-两个数据帧中数百万行的矢量化日期差

python pandas datetime

Python-两个数据帧中数百万行的矢量化日期差,python,pandas,datetime,vectorization,counting,Python,Pandas,Datetime,Vectorization,Counting,我有两个数据帧： Date Variable 2013-04-01 05:00:00 S 2013-04-01 05:00:00 A 2013-04-01 05:10:00 S 2013-04-01 05:20:00 A 2013-04-01 05:25:00 S 2013-04-01 05:35:00 S 以及：我的目标是在第二个数据帧上的每个日期之前20分钟和之后20分钟计算第一个数

我有两个数据帧：

Date                  Variable
2013-04-01 05:00:00     S   
2013-04-01 05:00:00     A   
2013-04-01 05:10:00     S   
2013-04-01 05:20:00     A
2013-04-01 05:25:00     S   
2013-04-01 05:35:00     S

以及：

我的目标是在第二个数据帧上的每个日期之前20分钟和之后20分钟计算第一个数据帧上的日期数。所以，我需要做的是迭代第二个数据帧上的所有日期，并计算每个特定日期前20分钟和后20分钟第一个数据帧中有多少日期。此外，我还想计算变量A或S的出现次数，换句话说，Nr_var_20_bef列具有相同变量的20分钟bef的日期数）。因此，输出类似于：

Date               Variable   Nr_20_bef   Nr_20_aft  Nr_var_20_bef  Nr_var_20_after  
2013-04-01 04:50:00     A        0            3             0             1
2013-04-01 05:00:00     A        2            4             1             2
2013-04-01 05:05:00     S        2            3             1             2
2013-04-01 05:15:00     S        3            3             2             2
2013-04-01 05:35:00     S        3            1             2             1
2013-04-01 05:40:00     S        3            0             2             0

我的主要问题是两个数据帧都有超过100万行，这意味着我不能使用for循环或pandas apply，因为它们对于如此巨大的数据帧来说太耗时了。提前非常感谢。

这是一个棘手的问题！我可以为您提供一个部分解决方案，希望这足以让您开始

您应该研究可以利用DateTime索引的pandas

rolling

方法。注意，据我所知，滚动函数只能查看前一个时间段，而不能查看未来的时间段。此解决方案根据

foo

和

bar

的一组合并时间，计算过去20分钟内出现的

bar

列的实例数，我相信这正是您所要求的

import pandas as pd
import numpy as np

# Attempting to generate some similar data
np.random.seed(0)
rng = pd.date_range('4/1/2013', periods=1000, freq='5T', name='Date')
df = pd.DataFrame({'Variable': np.random.choice(['S', 'A'], 1000)}, index=rng)
df1 = df.sample(frac=0.5)
df2 = df.sample(frac=0.5)

merged = df1.merge(df2, how='outer', left_index=True, right_index=True, suffixes=['_foo', '_bar'])

# pandas can't found objects, but can count bools
m = merged.notnull()

# Rolling functions can't count "after", only "before" or "center"
merged['Nr_20_bef'] = m.Variable_bar.rolling('20T').sum()

print(merged.head(10))

                    Variable_foo Variable_bar  Nr_20_bef
# Date
# 2013-04-01 00:05:00            A          NaN        0.0
# 2013-04-01 00:10:00            A          NaN        0.0
# 2013-04-01 00:15:00          NaN            S        1.0
# 2013-04-01 00:20:00            A            A        2.0
# 2013-04-01 00:25:00            A          NaN        2.0
# 2013-04-01 00:40:00          NaN            A        1.0
# 2013-04-01 00:45:00            A            A        2.0
# 2013-04-01 00:50:00          NaN            A        3.0
# 2013-04-01 01:05:00          NaN            A        2.0
# 2013-04-01 01:10:00            S            S        2.0

生成

Nr\u 20\u bef

列的速度非常快，在我两岁的笔记本电脑上，1000万行大约需要1秒的时间。例如，如果只想计算“S”字符，可以改为执行

m=merged==“S”

import pandas as pd
import numpy as np

# Attempting to generate some similar data
np.random.seed(0)
rng = pd.date_range('4/1/2013', periods=1000, freq='5T', name='Date')
df = pd.DataFrame({'Variable': np.random.choice(['S', 'A'], 1000)}, index=rng)
df1 = df.sample(frac=0.5)
df2 = df.sample(frac=0.5)

merged = df1.merge(df2, how='outer', left_index=True, right_index=True, suffixes=['_foo', '_bar'])

# pandas can't found objects, but can count bools
m = merged.notnull()

# Rolling functions can't count "after", only "before" or "center"
merged['Nr_20_bef'] = m.Variable_bar.rolling('20T').sum()

print(merged.head(10))

                    Variable_foo Variable_bar  Nr_20_bef
# Date
# 2013-04-01 00:05:00            A          NaN        0.0
# 2013-04-01 00:10:00            A          NaN        0.0
# 2013-04-01 00:15:00          NaN            S        1.0
# 2013-04-01 00:20:00            A            A        2.0
# 2013-04-01 00:25:00            A          NaN        2.0
# 2013-04-01 00:40:00          NaN            A        1.0
# 2013-04-01 00:45:00            A            A        2.0
# 2013-04-01 00:50:00          NaN            A        3.0
# 2013-04-01 01:05:00          NaN            A        2.0
# 2013-04-01 01:10:00            S            S        2.0