Python 满足条件时的分组时差_Python_Pandas_Dataframe_Pandas Groupby

Python 满足条件时的分组时差

python pandas dataframe

Python 满足条件时的分组时差,python,pandas,dataframe,pandas-groupby,Python,Pandas,Dataframe,Pandas Groupby,我正在使用结构化日志数据，其结构如下（模拟数据的pastebin片段，便于修补）：将熊猫作为pd导入 df=pd.read_csv（“https://pastebin.com/raw/qrqTMrGa") 打印（df） id日期信息信息有错误 0 123 2020-01-01 123 32 0 1 123 2020-01-02 2 43 0 2 123 2020-01-03

我正在使用结构化日志数据，其结构如下（模拟数据的pastebin片段，便于修补）：

将熊猫作为pd导入
df=pd.read_csv（“https://pastebin.com/raw/qrqTMrGa")
打印（df）
id日期信息信息有错误
0   123  2020-01-01         123          32        0
1   123  2020-01-02           2          43        0
2   123  2020-01-03          43           4        1
3   123  2020-01-04          43           4        0
4   123  2020-01-05          43           4        0
5   123  2020-01-06          43           4        0
6   123  2020-01-07          43           4        1
7   123  2020-01-08          43           4        0
8   232  2020-01-04          56           4        0
9   232  2020-01-05          97           1        0
10  232  2020-01-06          23          74        0
11  232  2020-01-07          91          85        1
12  232  2020-01-08          91          85        0
13  232  2020-01-09          91          85        0
14  232  2020-01-10          91          85        1

变量非常简单：

```
id
```
：被观察机器的id
```
日期
```
：观察日期
```
info\u a\u cnt
```
：特定类型信息事件的计数
```
info\u b\u cnt
```
：对于不同的事件类型，同上
```
有错误
```
：机器是否记录了任何错误

现在，我想按

id

对数据帧进行分组，以创建一个变量，存储错误事件发生前剩余的天数。所需的数据帧应如下所示：

id date info\u a\u cnt info\u b\u cnt有\u err days\u err
0   123  2020-01-01         123          32        0            2
1   123  2020-01-02           2          43        0            1
2   123  2020-01-03          43           4        1            0
3   123  2020-01-04          43           4        0            3
4   123  2020-01-05          43           4        0            2
5   123  2020-01-06          43           4        0            1
6   123  2020-01-07          43           4        1            0
7   232  2020-01-04          56           4        0            3
8   232  2020-01-05          97           1        0            2
9   232  2020-01-06          23          74        0            1
10  232  2020-01-07          91          85        1            0
11  232  2020-01-08          91          85        0            2
12  232  2020-01-09          91          85        0            1
13  232  2020-01-10          91          85        1            0

我很难用正确的分组函数找出正确的实现

编辑：当处理具有每日粒度的日期时，以下所有答案都非常有效。我想知道如何使下面的@jezrael解决方案适应包含时间戳的数据帧（日志将以15分钟的间隔进行批处理）：

我想知道如何调整@jezrael answer，以便在以下方面着陆：

id date info\u a\u cnt info\u b\u cnt有\u err mins\u to \u err
0   123  2020-01-01 12:00:00         123          32        0           30
1   123  2020-01-01 12:15:00           2          43        0           15
2   123  2020-01-01 12:30:00          43           4        1            0
3   123  2020-01-01 12:45:00          43           4        0           45
4   123  2020-01-01 13:00:00          43           4        0           30
5   123  2020-01-01 13:15:00          43           4        0           15
6   123  2020-01-01 13:30:00          43           4        1            0
7   123  2020-01-01 13:45:00          43           4        0           60
8   232  2020-01-04 17:00:00          56           4        0           45
9   232  2020-01-05 17:15:00          97           1        0           30
10  232  2020-01-06 17:30:00          23          74        0           15
11  232  2020-01-07 17:45:00          91          85        1            0
12  232  2020-01-08 18:00:00          91          85        0           30
13  232  2020-01-09 18:15:00          91          85        0           15
14  232  2020-01-10 18:30:00          91          85        1            0

按列

id

使用

升序=False

和带有but form back的助手系列-因此通过以下方式添加索引：

编辑：对于计数日期差异的累积和，使用自定义lambda函数和：

EDIT1:与除以

60一起使用

：

#some data sample cleaning
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz", parse_dates=['date'])
df['date'] = df['date'].apply(lambda x: x.replace(month=1, day=1))
print(df)

df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
                       .transform(lambda x: x.diff().dt.total_seconds().div(60).cumsum())
                       .fillna(0)
                       .to_numpy()[::-1])
print(df)


     id                date  info_a_cnt  info_b_cnt  has_err  days_to_err
0   123 2020-01-01 12:00:00         123          32        0         30.0
1   123 2020-01-01 12:15:00           2          43        0         15.0
2   123 2020-01-01 12:30:00          43           4        1          0.0
3   123 2020-01-01 12:45:00          43           4        0         45.0
4   123 2020-01-01 13:00:00          43           4        0         30.0
5   123 2020-01-01 13:15:00          43           4        0         15.0
6   123 2020-01-01 13:30:00          43           4        1          0.0
7   123 2020-01-01 13:45:00          43           4        0          0.0
8   232 2020-01-01 17:00:00          56           4        0         45.0
9   232 2020-01-01 17:15:00          97           1        0         30.0
10  232 2020-01-01 17:30:00          23          74        0         15.0
11  232 2020-01-01 17:45:00          91          85        1          0.0
12  232 2020-01-01 18:00:00          91          85        0         30.0
13  232 2020-01-01 18:15:00          91          85        0         15.0
14  232 2020-01-01 18:30:00          91          85        1          0.0

使用：

那确实是个错误，很好。刚刚编辑了表格。这就像一个符咒，但它使用了行计数。是否有任何方法可以使用实际时差（使用

date

列）？我想知道，在处理更细粒度的时间戳时，如何对其进行调整。[：：-1]两次是不必要的here@ansev-ya，在你的解决方案中是第二个

[：：-1]

做熊猫，因为不匹配索引顺序，什么都不那么安全。所以加上它更好，我想是的safe@ansev-好的，没问题快乐编码

g = f['has_err'].iloc[::-1].cumsum().iloc[::-1]
df['days_to_err'] = df.groupby(['id', g])['has_err'].cumcount(ascending=False)
print(df)
     id        date  info_a_cnt  info_b_cnt  has_err  days_to_err
0   123  2020-01-01         123          32        0            2
1   123  2020-01-02           2          43        0            1
2   123  2020-01-03          43           4        1            0
3   123  2020-01-04          43           4        0            3
4   123  2020-01-05          43           4        0            2
5   123  2020-01-06          43           4        0            1
6   123  2020-01-07          43           4        1            0
7   123  2020-01-08          43           4        0            0
8   232  2020-01-04          56           4        0            3
9   232  2020-01-05          97           1        0            2
10  232  2020-01-06          23          74        0            1
11  232  2020-01-07          91          85        1            0
12  232  2020-01-08          91          85        0            2
13  232  2020-01-09          91          85        0            1
14  232  2020-01-10          91          85        1            0

df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
                       .transform(lambda x: x.diff().dt.days.cumsum())
                       .fillna(0)
                       .to_numpy()[::-1])
print(df)
     id       date  info_a_cnt  info_b_cnt  has_err  days_to_err
0   123 2020-01-01         123          32        0          2.0
1   123 2020-01-02           2          43        0          1.0
2   123 2020-01-03          43           4        1          0.0
3   123 2020-01-04          43           4        0          3.0
4   123 2020-01-05          43           4        0          2.0
5   123 2020-01-06          43           4        0          1.0
6   123 2020-01-07          43           4        1          0.0
7   123 2020-01-08          43           4        0          0.0
8   232 2020-01-04          56           4        0          3.0
9   232 2020-01-05          97           1        0          2.0
10  232 2020-01-06          23          74        0          1.0
11  232 2020-01-07          91          85        1          0.0
12  232 2020-01-08          91          85        0          2.0
13  232 2020-01-09          91          85        0          1.0
14  232 2020-01-10          91          85        1          0.0

#some data sample cleaning
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz", parse_dates=['date'])
df['date'] = df['date'].apply(lambda x: x.replace(month=1, day=1))
print(df)

df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
                       .transform(lambda x: x.diff().dt.total_seconds().div(60).cumsum())
                       .fillna(0)
                       .to_numpy()[::-1])
print(df)


     id                date  info_a_cnt  info_b_cnt  has_err  days_to_err
0   123 2020-01-01 12:00:00         123          32        0         30.0
1   123 2020-01-01 12:15:00           2          43        0         15.0
2   123 2020-01-01 12:30:00          43           4        1          0.0
3   123 2020-01-01 12:45:00          43           4        0         45.0
4   123 2020-01-01 13:00:00          43           4        0         30.0
5   123 2020-01-01 13:15:00          43           4        0         15.0
6   123 2020-01-01 13:30:00          43           4        1          0.0
7   123 2020-01-01 13:45:00          43           4        0          0.0
8   232 2020-01-01 17:00:00          56           4        0         45.0
9   232 2020-01-01 17:15:00          97           1        0         30.0
10  232 2020-01-01 17:30:00          23          74        0         15.0
11  232 2020-01-01 17:45:00          91          85        1          0.0
12  232 2020-01-01 18:00:00          91          85        0         30.0
13  232 2020-01-01 18:15:00          91          85        0         15.0
14  232 2020-01-01 18:30:00          91          85        1          0.0

df2 = df[::-1]
df['days_to_err'] = df2.groupby(['id', df2['has_err'].eq(1).cumsum()]).cumcount()

     id        date  info_a_cnt  info_b_cnt  has_err  days_to_err
0   123  2020-01-01         123          32        0            2
1   123  2020-01-02           2          43        0            1
2   123  2020-01-03          43           4        1            0
3   123  2020-01-04          43           4        0            3
4   123  2020-01-05          43           4        0            2
5   123  2020-01-06          43           4        0            1
6   123  2020-01-07          43           4        1            0
7   123  2020-01-08          43           4        0            0
8   232  2020-01-04          56           4        0            3
9   232  2020-01-05          97           1        0            2
10  232  2020-01-06          23          74        0            1
11  232  2020-01-07          91          85        1            0
12  232  2020-01-08          91          85        0            2
13  232  2020-01-09          91          85        0            1
14  232  2020-01-10          91          85        1            0