Python pandas.groupby对象上与时间相关的移动平均值_Python_Pandas_Moving Average

Python pandas.groupby对象上与时间相关的移动平均值

python pandas

Python pandas.groupby对象上与时间相关的移动平均值,python,pandas,moving-average,Python,Pandas,Moving Average,给定以下格式的数据帧： toy = pd.DataFrame({ 'id': [1,2,3, 1,2,3, 1,2,3], 'date': ['2015-05-13', '2015-05-13', '2015-05-13', '2016-02-12', '2016-02-12', '2016-02-12', '2018-07-23', '2018-07-23', '2018-07-23'], 'my_metric': [395,

给定以下格式的数据帧：

toy = pd.DataFrame({
'id': [1,2,3,
       1,2,3,
       1,2,3],
'date': ['2015-05-13', '2015-05-13', '2015-05-13', 
         '2016-02-12', '2016-02-12', '2016-02-12', 
         '2018-07-23', '2018-07-23', '2018-07-23'],
'my_metric': [395, 634, 165, 
              144, 305, 293, 
              23, 395, 242]
})
# Make sure 'date' has datetime format
toy.date = pd.to_datetime(toy.date)

my_metric

列包含一些（随机）指标，我希望根据列

id

在我指定的特定时间间隔内。我将这个时间间隔称为“回望时间”；可能是5分钟或者2年。为了确定哪些观察结果将包含在回溯计算中，我们使用

date

列（如果您愿意，它可以是索引）

令我沮丧的是，我发现使用pandas内置程序不容易执行这样的过程，因为我需要有条件地执行计算在

id

上，同时只能对回溯时间内的观察值进行计算（使用

日期

列进行检查）。因此，对于每个

id

date

组合，输出数据帧应该由一行组成，

my_metric

列现在是回望时间内（例如2年，包括今天的日期）持续的所有观察值的平均值

为清楚起见，我在使用2年回溯时间时，包含了一个具有所需输出格式的图（对于过大的图表示歉意）：

我有一个解决方案，但它没有使用特定的内置函数，可能是次优的（列表理解和单个for循环的组合）。我正在寻找的解决方案不会使用for循环，因此更具可扩展性/效率/速度

谢谢大家!

计算回溯时间：（当前年-2年）

现在，根据回望时间过滤数据帧并计算滚动平均值

In [1722]: toy['new_metric'] = ((toy.my_metric + toy[toy.date > lookback_time].groupby('id')['my_metric'].shift(1))/2).fillna(toy.my_metric)

In [1674]: toy.sort_values('id')
Out[1674]: 
        date  id  my_metric  new_metric
0 2015-05-13   1        395       395.0
3 2016-02-12   1        144       144.0
6 2018-07-23   1         23        83.5
1 2015-05-13   2        634       634.0
4 2016-02-12   2        305       305.0
7 2018-07-23   2        395       350.0
2 2015-05-13   3        165       165.0
5 2016-02-12   3        293       293.0
8 2018-07-23   3        242       267.5

因此，经过一番修补，我找到了一个可以充分概括的答案。我使用了一个稍微不同的“玩具”数据框（与我的案例稍微相关）。为完整起见，以下是数据：

现在考虑以下代码：

# Define a custom function which groups by time (using the index)
def rolling_average(x, dt):
    xt = x.sort_index().groupby(lambda x: x.time()).rolling(window=dt).mean()
    xt.index = xt.index.droplevel(0)
    return xt

dt='730D' # rolling average window: 730 days = 2 years

# Group by the 'id' column
g = toy.groupby('id')

# Apply the custom function
df = g.apply(rolling_average, dt=dt)

# Massage the data to appropriate format
df.index = df.index.droplevel(0)
df = df.reset_index().drop_duplicates(keep='last', subset=['id', 'date'])

结果如预期：

两年回顾时间在哪里？感谢您的努力@Mayank；但正如@jezrael所指出的，这根本不会一概而论。如果我们在回溯时间内有超过1个观测值（除了当前观测值），那么这个公式就完全崩溃了。我想挑战在于，我们事先不知道回溯时间内有多少行——我已经研究了pandas窗口函数，但由于某些原因，它们不能在分组数据帧上正常工作。@Magnus在一段时间内共享我的更新答案，并包含回望时间。@Magnus请检查我的更新答案。另外，我将

回溯时间

设置为过去2年，从数据帧中排除日期为

的行。我的输出基于此。老实说，我对这个解决方案的性能不太满意。如果有多个条目具有相同的“id”和“date”（但“my_metric”的值不同），则对于每个id日期“replicate”，数据帧将获得一个额外的行，其中包含中间结果——因此在最后一行中应用了“drop_duplicates”。

# Define a custom function which groups by time (using the index)
def rolling_average(x, dt):
    xt = x.sort_index().groupby(lambda x: x.time()).rolling(window=dt).mean()
    xt.index = xt.index.droplevel(0)
    return xt

dt='730D' # rolling average window: 730 days = 2 years

# Group by the 'id' column
g = toy.groupby('id')

# Apply the custom function
df = g.apply(rolling_average, dt=dt)

# Massage the data to appropriate format
df.index = df.index.droplevel(0)
df = df.reset_index().drop_duplicates(keep='last', subset=['id', 'date'])