Python 如何在大熊猫中重新反驳而不是分组间隔_Python_Pandas_Datetime_Intervals

Python 如何在大熊猫中重新反驳而不是分组间隔

python pandas datetime

Python 如何在大熊猫中重新反驳而不是分组间隔,python,pandas,datetime,intervals,Python,Pandas,Datetime,Intervals,我有一个带有StartDate和endEndDate列的df df.loc[:,['StartDate','EndDate']].head() Out[92]: StartDate EndDate 0 2016-05-19 14:19:14.820002 2016-05-19 14:19:17.899999 1 2016-05-19 14:19:32.119999 2016-05-19 14:19:37.020002

我有一个带有

StartDate

和end

EndDate

列的

df

df.loc[:,['StartDate','EndDate']].head()
Out[92]: 
                    StartDate                    EndDate
0 2016-05-19 14:19:14.820002 2016-05-19 14:19:17.899999
1 2016-05-19 14:19:32.119999 2016-05-19 14:19:37.020002

我想得到一个任意频率的

df2

，对于每个箱子，在（StartDate，EndDate）间隔之间包含的该箱子中的时间量 e、 g

当然,

groupby(StartDate.date.dt)['Duration']

其中

“Duration”是“EndDate”-“StartDate”

不起作用

import numpy as np
import pandas as pd
df = pd.DataFrame({'StartDate':['2016-05-19 14:19:14.820002','2016-05-19 14:19:32.119999', '2016-05-19 14:19:17.899999'],
                   'EndDate':['2016-05-19 14:19:17.899999', '2016-05-19 14:19:37.020002', '2016-05-19 14:19:18.5']})

df2 = pd.melt(df, var_name='type', value_name='date')
df2['date'] = pd.to_datetime(df2['date'])
df2['sign'] = np.where(df2['type']=='StartDate', 1, -1)
min_date = df2['date'].min().to_period('1s').to_timestamp()
max_date = (df2['date'].max() + pd.Timedelta('1s')).to_period('1s').to_timestamp()
index = pd.date_range(min_date, df2['date'].max(), freq='1s').union(df2['date'])
df2 = df2.groupby('date').sum()
df2 = df2.reindex(index)
df2['weight'] = df2['sign'].fillna(0).cumsum()
df2['duration'] = 0
df2.iloc[:-1, df2.columns.get_loc('duration')] = (df2.index[1:] - df2.index[:-1]).total_seconds()
df2['duration'] = df2['duration'] * df2['weight']
df2 = df2.resample('1s').sum()

print(df2)

屈服

                     sign  weight  duration
2016-05-19 14:19:14   1.0     1.0  0.179998
2016-05-19 14:19:15   0.0     1.0  1.000000
2016-05-19 14:19:16   0.0     1.0  1.000000
2016-05-19 14:19:17   0.0     3.0  1.000000
2016-05-19 14:19:18  -1.0     1.0  0.500000
2016-05-19 14:19:19   0.0     0.0  0.000000
2016-05-19 14:19:20   0.0     0.0  0.000000
2016-05-19 14:19:21   0.0     0.0  0.000000
2016-05-19 14:19:22   0.0     0.0  0.000000
2016-05-19 14:19:23   0.0     0.0  0.000000
2016-05-19 14:19:24   0.0     0.0  0.000000
2016-05-19 14:19:25   0.0     0.0  0.000000
2016-05-19 14:19:26   0.0     0.0  0.000000
2016-05-19 14:19:27   0.0     0.0  0.000000
2016-05-19 14:19:28   0.0     0.0  0.000000
2016-05-19 14:19:29   0.0     0.0  0.000000
2016-05-19 14:19:30   0.0     0.0  0.000000
2016-05-19 14:19:31   0.0     0.0  0.000000
2016-05-19 14:19:32   1.0     1.0  0.880001
2016-05-19 14:19:33   0.0     1.0  1.000000
2016-05-19 14:19:34   0.0     1.0  1.000000
2016-05-19 14:19:35   0.0     1.0  1.000000
2016-05-19 14:19:36   0.0     1.0  1.000000
2016-05-19 14:19:37  -1.0     1.0  0.020002

主要思想是将

StartDate

和

EndDate

放在一列中，并分配 +每个

开始日期和-1
每个结束日期各1个：
df2 = pd.melt(df, var_name='type', value_name='date')
df2['date'] = pd.to_datetime(df2['date'])
df2['sign'] = np.where(df2['type']=='StartDate', 1, -1)
#         type                       date  sign
# 0  StartDate 2016-05-19 14:19:14.820002     1
# 1  StartDate 2016-05-19 14:19:32.119999     1
# 2    EndDate 2016-05-19 14:19:17.899999    -1
# 3    EndDate 2016-05-19 14:19:37.020002    -1

现在将索引设为date
，然后对数据帧重新编制索引，以1秒的频率包含所有时间戳：
min_date = df2['date'].min().to_period('1s').to_timestamp()
max_date = (df2['date'].max() + pd.Timedelta('1s')).to_period('1s').to_timestamp()
index = pd.date_range(min_date, df2['date'].max(), freq='1s').union(df2['date'])
df2 = df2.set_index('date')
df2 = df2.reindex(index)

#                                  type  sign
# 2016-05-19 14:19:14.000000        NaN   NaN
# 2016-05-19 14:19:14.820002  StartDate   1.0
# 2016-05-19 14:19:15.000000        NaN   NaN
# 2016-05-19 14:19:16.000000        NaN   NaN
# 2016-05-19 14:19:17.000000        NaN   NaN
# 2016-05-19 14:19:17.899999    EndDate  -1.0
# 2016-05-19 14:19:18.000000        NaN   NaN
# ...

在
符号

列中，用0填充NaN值并计算累积和：

df2['weight'] = df2['sign'].fillna(0).cumsum()
#                                  type  sign  weight
# 2016-05-19 14:19:14.000000        NaN   NaN     0.0
# 2016-05-19 14:19:14.820002  StartDate   1.0     1.0
# 2016-05-19 14:19:15.000000        NaN   NaN     1.0
# 2016-05-19 14:19:16.000000        NaN   NaN     1.0
# 2016-05-19 14:19:17.000000        NaN   NaN     1.0
# 2016-05-19 14:19:17.899999    EndDate  -1.0     0.0
# 2016-05-19 14:19:18.000000        NaN   NaN     0.0
# ...

计算每行之间的持续时间：

df2['duration'] = 0
df2.iloc[:-1, df2.columns.get_loc('duration')] = (df2.index[1:] - df2.index[:-1]).total_seconds()
df2['duration'] = df2['duration'] * df2['weight']

#                                  type  sign  weight  duration
# 2016-05-19 14:19:14.000000        NaN   NaN     0.0  0.000000
# 2016-05-19 14:19:14.820002  StartDate   1.0     1.0  0.179998
# 2016-05-19 14:19:15.000000        NaN   NaN     1.0  1.000000
# 2016-05-19 14:19:16.000000        NaN   NaN     1.0  1.000000
# 2016-05-19 14:19:17.000000        NaN   NaN     1.0  0.899999
# 2016-05-19 14:19:17.899999    EndDate  -1.0     0.0  0.000000
# 2016-05-19 14:19:18.000000        NaN   NaN     0.0  0.000000

最后，将数据帧重新采样到1秒的频率

df2 = df2.resample('1s').sum()

我从中学会了这个技巧。

我得到了ValueError：无法从重复的axis重新编制索引。如果存在相同的

StartDate

或

EndDate

，则可能会发生此错误。我已经修改了上面的代码来处理这种情况（使用

df2=df2.groupby（'date'）.sum（）

在

date

相同时聚合

符号）
df2 = df2.resample('1s').sum()