Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/342.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 满足条件时的分组时差_Python_Pandas_Dataframe_Pandas Groupby - Fatal编程技术网

Python 满足条件时的分组时差

Python 满足条件时的分组时差,python,pandas,dataframe,pandas-groupby,Python,Pandas,Dataframe,Pandas Groupby,我正在使用结构化日志数据,其结构如下(模拟数据的pastebin片段,便于修补): 将熊猫作为pd导入 df=pd.read_csv(“https://pastebin.com/raw/qrqTMrGa") 打印(df) id日期信息信息有错误 0 123 2020-01-01 123 32 0 1 123 2020-01-02 2 43 0 2 123 2020-01-03

我正在使用结构化日志数据,其结构如下(模拟数据的pastebin片段,便于修补):

将熊猫作为pd导入
df=pd.read_csv(“https://pastebin.com/raw/qrqTMrGa")
打印(df)
id日期信息信息有错误
0   123  2020-01-01         123          32        0
1   123  2020-01-02           2          43        0
2   123  2020-01-03          43           4        1
3   123  2020-01-04          43           4        0
4   123  2020-01-05          43           4        0
5   123  2020-01-06          43           4        0
6   123  2020-01-07          43           4        1
7   123  2020-01-08          43           4        0
8   232  2020-01-04          56           4        0
9   232  2020-01-05          97           1        0
10  232  2020-01-06          23          74        0
11  232  2020-01-07          91          85        1
12  232  2020-01-08          91          85        0
13  232  2020-01-09          91          85        0
14  232  2020-01-10          91          85        1
变量非常简单:

  • id
    :被观察机器的id
  • 日期
    :观察日期
  • info\u a\u cnt
    :特定类型信息事件的计数
  • info\u b\u cnt
    :对于不同的事件类型,同上
  • 有错误
    :机器是否记录了任何错误
现在,我想按
id
对数据帧进行分组,以创建一个变量,存储错误事件发生前剩余的天数。所需的数据帧应如下所示:

id date info\u a\u cnt info\u b\u cnt有\u err days\u err
0   123  2020-01-01         123          32        0            2
1   123  2020-01-02           2          43        0            1
2   123  2020-01-03          43           4        1            0
3   123  2020-01-04          43           4        0            3
4   123  2020-01-05          43           4        0            2
5   123  2020-01-06          43           4        0            1
6   123  2020-01-07          43           4        1            0
7   232  2020-01-04          56           4        0            3
8   232  2020-01-05          97           1        0            2
9   232  2020-01-06          23          74        0            1
10  232  2020-01-07          91          85        1            0
11  232  2020-01-08          91          85        0            2
12  232  2020-01-09          91          85        0            1
13  232  2020-01-10          91          85        1            0
我很难用正确的分组函数找出正确的实现

编辑: 当处理具有每日粒度的日期时,以下所有答案都非常有效。我想知道如何使下面的@jezrael解决方案适应包含时间戳的数据帧(日志将以15分钟的间隔进行批处理):

:

我想知道如何调整@jezrael answer,以便在以下方面着陆:

id date info\u a\u cnt info\u b\u cnt有\u err mins\u to \u err
0   123  2020-01-01 12:00:00         123          32        0           30
1   123  2020-01-01 12:15:00           2          43        0           15
2   123  2020-01-01 12:30:00          43           4        1            0
3   123  2020-01-01 12:45:00          43           4        0           45
4   123  2020-01-01 13:00:00          43           4        0           30
5   123  2020-01-01 13:15:00          43           4        0           15
6   123  2020-01-01 13:30:00          43           4        1            0
7   123  2020-01-01 13:45:00          43           4        0           60
8   232  2020-01-04 17:00:00          56           4        0           45
9   232  2020-01-05 17:15:00          97           1        0           30
10  232  2020-01-06 17:30:00          23          74        0           15
11  232  2020-01-07 17:45:00          91          85        1            0
12  232  2020-01-08 18:00:00          91          85        0           30
13  232  2020-01-09 18:15:00          91          85        0           15
14  232  2020-01-10 18:30:00          91          85        1            0
按列
id
使用
升序=False
和带有but form back的助手系列-因此通过以下方式添加索引:

编辑:对于计数日期差异的累积和,使用自定义lambda函数和:

EDIT1:与除以
60一起使用

#some data sample cleaning
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz", parse_dates=['date'])
df['date'] = df['date'].apply(lambda x: x.replace(month=1, day=1))
print(df)

df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
                       .transform(lambda x: x.diff().dt.total_seconds().div(60).cumsum())
                       .fillna(0)
                       .to_numpy()[::-1])
print(df)


     id                date  info_a_cnt  info_b_cnt  has_err  days_to_err
0   123 2020-01-01 12:00:00         123          32        0         30.0
1   123 2020-01-01 12:15:00           2          43        0         15.0
2   123 2020-01-01 12:30:00          43           4        1          0.0
3   123 2020-01-01 12:45:00          43           4        0         45.0
4   123 2020-01-01 13:00:00          43           4        0         30.0
5   123 2020-01-01 13:15:00          43           4        0         15.0
6   123 2020-01-01 13:30:00          43           4        1          0.0
7   123 2020-01-01 13:45:00          43           4        0          0.0
8   232 2020-01-01 17:00:00          56           4        0         45.0
9   232 2020-01-01 17:15:00          97           1        0         30.0
10  232 2020-01-01 17:30:00          23          74        0         15.0
11  232 2020-01-01 17:45:00          91          85        1          0.0
12  232 2020-01-01 18:00:00          91          85        0         30.0
13  232 2020-01-01 18:15:00          91          85        0         15.0
14  232 2020-01-01 18:30:00          91          85        1          0.0
使用:


那确实是个错误,很好。刚刚编辑了表格。这就像一个符咒,但它使用了行计数。是否有任何方法可以使用实际时差(使用
date
列)?我想知道,在处理更细粒度的时间戳时,如何对其进行调整。[::-1]两次是不必要的here@ansev-ya,在你的解决方案中是第二个
[::-1]
做熊猫,因为不匹配索引顺序,什么都不那么安全。所以加上它更好,我想是的safe@ansev-好的,没问题快乐编码
g = f['has_err'].iloc[::-1].cumsum().iloc[::-1]
df['days_to_err'] = df.groupby(['id', g])['has_err'].cumcount(ascending=False)
print(df)
     id        date  info_a_cnt  info_b_cnt  has_err  days_to_err
0   123  2020-01-01         123          32        0            2
1   123  2020-01-02           2          43        0            1
2   123  2020-01-03          43           4        1            0
3   123  2020-01-04          43           4        0            3
4   123  2020-01-05          43           4        0            2
5   123  2020-01-06          43           4        0            1
6   123  2020-01-07          43           4        1            0
7   123  2020-01-08          43           4        0            0
8   232  2020-01-04          56           4        0            3
9   232  2020-01-05          97           1        0            2
10  232  2020-01-06          23          74        0            1
11  232  2020-01-07          91          85        1            0
12  232  2020-01-08          91          85        0            2
13  232  2020-01-09          91          85        0            1
14  232  2020-01-10          91          85        1            0
df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
                       .transform(lambda x: x.diff().dt.days.cumsum())
                       .fillna(0)
                       .to_numpy()[::-1])
print(df)
     id       date  info_a_cnt  info_b_cnt  has_err  days_to_err
0   123 2020-01-01         123          32        0          2.0
1   123 2020-01-02           2          43        0          1.0
2   123 2020-01-03          43           4        1          0.0
3   123 2020-01-04          43           4        0          3.0
4   123 2020-01-05          43           4        0          2.0
5   123 2020-01-06          43           4        0          1.0
6   123 2020-01-07          43           4        1          0.0
7   123 2020-01-08          43           4        0          0.0
8   232 2020-01-04          56           4        0          3.0
9   232 2020-01-05          97           1        0          2.0
10  232 2020-01-06          23          74        0          1.0
11  232 2020-01-07          91          85        1          0.0
12  232 2020-01-08          91          85        0          2.0
13  232 2020-01-09          91          85        0          1.0
14  232 2020-01-10          91          85        1          0.0
#some data sample cleaning
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz", parse_dates=['date'])
df['date'] = df['date'].apply(lambda x: x.replace(month=1, day=1))
print(df)

df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
                       .transform(lambda x: x.diff().dt.total_seconds().div(60).cumsum())
                       .fillna(0)
                       .to_numpy()[::-1])
print(df)


     id                date  info_a_cnt  info_b_cnt  has_err  days_to_err
0   123 2020-01-01 12:00:00         123          32        0         30.0
1   123 2020-01-01 12:15:00           2          43        0         15.0
2   123 2020-01-01 12:30:00          43           4        1          0.0
3   123 2020-01-01 12:45:00          43           4        0         45.0
4   123 2020-01-01 13:00:00          43           4        0         30.0
5   123 2020-01-01 13:15:00          43           4        0         15.0
6   123 2020-01-01 13:30:00          43           4        1          0.0
7   123 2020-01-01 13:45:00          43           4        0          0.0
8   232 2020-01-01 17:00:00          56           4        0         45.0
9   232 2020-01-01 17:15:00          97           1        0         30.0
10  232 2020-01-01 17:30:00          23          74        0         15.0
11  232 2020-01-01 17:45:00          91          85        1          0.0
12  232 2020-01-01 18:00:00          91          85        0         30.0
13  232 2020-01-01 18:15:00          91          85        0         15.0
14  232 2020-01-01 18:30:00          91          85        1          0.0
df2 = df[::-1]
df['days_to_err'] = df2.groupby(['id', df2['has_err'].eq(1).cumsum()]).cumcount()

     id        date  info_a_cnt  info_b_cnt  has_err  days_to_err
0   123  2020-01-01         123          32        0            2
1   123  2020-01-02           2          43        0            1
2   123  2020-01-03          43           4        1            0
3   123  2020-01-04          43           4        0            3
4   123  2020-01-05          43           4        0            2
5   123  2020-01-06          43           4        0            1
6   123  2020-01-07          43           4        1            0
7   123  2020-01-08          43           4        0            0
8   232  2020-01-04          56           4        0            3
9   232  2020-01-05          97           1        0            2
10  232  2020-01-06          23          74        0            1
11  232  2020-01-07          91          85        1            0
12  232  2020-01-08          91          85        0            2
13  232  2020-01-09          91          85        0            1
14  232  2020-01-10          91          85        1            0