Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/366.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 分组日期滚动中最新非空值的日期索引_Python_Pandas_Date_Group By_Rolling Computation - Fatal编程技术网

Python 分组日期滚动中最新非空值的日期索引

Python 分组日期滚动中最新非空值的日期索引,python,pandas,date,group-by,rolling-computation,Python,Pandas,Date,Group By,Rolling Computation,我正在尝试按组获取滚动时间窗口上值不为null的最新日期。它在没有分组的情况下运行得很好,但分组似乎会改变一切 以下是可复制的示例: import pandas as pd from datetime import datetime as dt import numpy as np df = pd.DataFrame({}) df["date"] = [dt(2020, 10, i+1) for i in range(10)] df["group"]

我正在尝试按组获取滚动时间窗口上值不为null的最新日期。它在没有分组的情况下运行得很好,但分组似乎会改变一切

以下是可复制的示例:

import pandas as pd
from datetime import datetime as dt
import numpy as np

df = pd.DataFrame({})

df["date"] = [dt(2020, 10, i+1) for i in range(10)]
df["group"] = ["a" if int(i/3) == (i/3) else "b" for i in range(10)]
df["value"] = [i if int(i/2) == (i/2) else np.nan for i in range(10)]
数据帧

        date group  value
0 2020-10-01     a    0.0
1 2020-10-02     b    NaN
2 2020-10-03     b    2.0
3 2020-10-04     a    NaN
4 2020-10-05     b    4.0
5 2020-10-06     b    NaN
6 2020-10-07     a    6.0
7 2020-10-08     b    NaN
8 2020-10-09     b    8.0
9 2020-10-10     a    NaN
目标产出:

        date group  value  output
0 2020-10-01     a    0.0  2020-10-01
1 2020-10-02     b    NaN  NaT
2 2020-10-03     b    2.0  2020-10-03
3 2020-10-04     a    NaN  2020-10-01
4 2020-10-05     b    4.0  2020-10-05
5 2020-10-06     b    NaN  2020-10-05
6 2020-10-07     a    6.0  2020-10-07
7 2020-10-08     b    NaN  2020-10-05
8 2020-10-09     b    8.0  2020-10-09
9 2020-10-10     a    NaN  2020-10-07
我的尝试:

df = df.set_index("date").sort_index(ascending = True)

def latest_non_null_value_index(x):
        y = x[np.isnan(x) == False]
        print(y.index)
        if len(y) > 0:
            return y.index[-1]
        else:
            return np.nan

latest_index = df\
        .groupby(["group"])\
        .rolling("35D")\
        ["value"]\
        .apply(lambda x: latest_non_null_value_index(x).timestamp())\
        .reset_index()
  
def to_datetime_from_timestamp(x):
  if pd.isnull(x) == False:
      return dt.fromtimestamp(x)
  else:
      return pd.NaT
           
latest_index["value"] = latest_index["value"]\
    .apply(to_datetime_from_timestamp)
我得到的是:

  group       date               value
0     a 2020-10-01 2020-10-01 02:00:00
1     a 2020-10-04 2020-10-01 02:00:00
2     a 2020-10-07 2020-10-03 02:00:00
3     a 2020-10-10 2020-10-03 02:00:00
4     b 2020-10-02                 NaT
5     b 2020-10-03 2020-10-06 02:00:00
6     b 2020-10-05 2020-10-07 02:00:00
7     b 2020-10-06 2020-10-07 02:00:00
8     b 2020-10-08 2020-10-07 02:00:00
9     b 2020-10-09 2020-10-10 02:00:00
你知道我错过了什么吗

编辑:而且我在获取最新值时似乎没有这个问题。。。这确实与索引有关


EDIT2:如果我能以某种方式将一个函数应用于两列,我可以将日期作为第二列,并获得一个解决方法

您可以使用带有“ffill”的
pd.fillna
来向前填充缺少的值

import pandas as pd
from datetime import datetime as dt
import numpy as np

df = pd.DataFrame({})

df["date"] = [dt(2020, 10, i+1) for i in range(10)]
df["group"] = ["a" if int(i/3) == (i/3) else "b" for i in range(10)]
df["value"] = [i if int(i/2) == (i/2) else np.nan for i in range(10)]

df = df.sort_values("date")  # Just make sure that row are properly ordered

date = df["date"].copy()
date[df.value.isna()] = pd.NaT
latest_index = date.groupby(df.group).fillna(method="ffill")
这不考虑滚动时间范围,但可以删除时间窗口之外的值,如下所示:

latest_index[(df.date - latest_index).dt.days > 35] = pd.NaT
df = df.set_index("date", drop=False)
df = df.sort_index()

date = pd.to_numeric(df["date"].copy())  # it wasn't letting me aggregate dates so we have to convert to float then back to dates
date[df.value.isna()] = None
latest_index = date.groupby(df.group).rolling("35D").max()
latest_index = pd.to_datetime(latest_index)
但这并不是非常整洁,因此您可以尝试对如下滚动窗口使用max聚合:

latest_index[(df.date - latest_index).dt.days > 35] = pd.NaT
df = df.set_index("date", drop=False)
df = df.sort_index()

date = pd.to_numeric(df["date"].copy())  # it wasn't letting me aggregate dates so we have to convert to float then back to dates
date[df.value.isna()] = None
latest_index = date.groupby(df.group).rolling("35D").max()
latest_index = pd.to_datetime(latest_index)

这将适用于无限时间窗口,但不适用于滚动窗口。让我试着用一个更好的例子来说明。仍然很有用。对不起,我忽略了滚动部分。因此,在您的示例中,您似乎不希望在一天中查看超过35天的非缺失值。对吗?如果是这样,您可以检查填充值和当前日期之间的差异是否超过35天```最新_指数[(df[“日期”]-最新_指数).dt.days>35]=pd.NaT“是的,这样做效果很好:)我仍然很想知道这些索引发生了什么……熊猫的默认设置是将GROUPBY列移动到索引中。通常,如果您确实希望这样做,您可以将
设置为_index=False
,但我认为这不起作用,因为您也在使用
滚动
。解决此问题的最简单方法是运行
latest\u index.reset\u index()