Python 计算熊猫的7天保留数
我有一个数据框,它有两列——日期和id。我想计算每个日期的id数量,这些id在7天内稍后的日期重新出现。如果我在博士后做这件事,看起来会像:Python 计算熊猫的7天保留数,python,pandas,Python,Pandas,我有一个数据框,它有两列——日期和id。我想计算每个日期的id数量,这些id在7天内稍后的日期重新出现。如果我在博士后做这件事,看起来会像: SELECT df1.date, COUNT(DISTINCT df1.id) FROM df df1 INNER JOIN df df2 ON df1.id = df2.id AND df2.date BETWEEN df1.date + 1 AND df1.date + 7 GROUP BY df1.date; 对我来说,有
SELECT df1.date, COUNT(DISTINCT df1.id)
FROM df df1 INNER JOIN df df2
ON df1.id = df2.id AND
df2.date BETWEEN df1.date + 1 AND df1.date + 7
GROUP BY df1.date;
对我来说,有问题的是如何以一种快速、地道的方式将这句话翻译成熊猫,等等
我已经尝试过通过简单地创建一个滞后列并将原始列与滞后数据帧合并来保留一天。这当然有效。但是,对于7天的保留期,我需要创建7个数据帧并将它们合并在一起。就我而言,那是不合理的。(特别是因为我也想知道30天的天数。)
(我还应该指出,我的研究结果表明,在我的安装(pandas 0.14.0)上不起作用的合并行为,失败时出现错误消息
TypeError:Argument“values”的类型不正确(预期为numpy.ndarray,got Series)
。因此,似乎存在某种高级合并/加入行为,我显然不知道如何激活。)如果我理解正确,我认为您可以使用groupby/apply来实现。这有点棘手。所以我认为你有如下数据:
>>> df
date id y
0 2012-01-01 1 0.1
1 2012-01-03 1 0.3
2 2012-01-09 1 0.4
3 2012-01-12 1 0.0
4 2012-01-14 1 0.2
5 2012-01-16 1 0.4
6 2012-01-01 2 0.2
7 2012-01-02 2 0.1
8 2012-01-03 2 0.4
9 2012-01-04 2 0.6
10 2012-01-09 2 0.7
11 2012-01-10 2 0.4
>>> df.query('f7 > 1').groupby('date')['date'].count()
date
2012-01-01 2
2012-01-02 1
2012-01-03 2
2012-01-04 1
2012-01-09 2
2012-01-12 1
2012-01-14 1
我将在一个“id”组中创建一个滚动计数,该计数包含id在未来7天(包括当天)中出现的次数:
def count_forward7(g):
# Add column to the datframe so I can set date as the index
g['foo'] = 1
# New dataframe with daily frequency, so 7 rows = 7 days
# If there are no gaps in the dates you don't need to do this
x = g.set_index('date').resample('D')
# Do Andy Hayden Method for a forward looking rolling windows
# reverses the series and then reverses back the answer
fsum = pd.rolling_sum(x[::-1],window=7,min_periods=0)[::-1]
return pd.DataFrame(fsum[fsum.index.isin(g.date)].values,index=g.index)
>>> df['f7'] = df.groupby('id')[['date']].apply(count_forward7)
>>> df
date id y f7
0 2012-01-01 1 0.1 2
1 2012-01-03 1 0.3 2
2 2012-01-09 1 0.4 3
3 2012-01-12 1 0.0 3
4 2012-01-14 1 0.2 2
5 2012-01-16 1 0.4 1
6 2012-01-01 2 0.2 4
7 2012-01-02 2 0.1 3
8 2012-01-03 2 0.4 3
9 2012-01-04 2 0.6 3
10 2012-01-09 2 0.7 2
11 2012-01-10 2 0.4 1
现在,如果您想“为每个日期计算该日期上在7天内稍后日期重新出现的id数”,只需计算f7>1的每个日期:
>>> df['bool_f77'] = df['f7'] > 1
>>> df.groupby('date')['bool_f77'].sum()
2012-01-01 2
2012-01-02 1
2012-01-03 2
2012-01-04 1
2012-01-09 2
2012-01-10 0
2012-01-12 1
2012-01-14 1
2012-01-16 0
或类似以下内容:
>>> df
date id y
0 2012-01-01 1 0.1
1 2012-01-03 1 0.3
2 2012-01-09 1 0.4
3 2012-01-12 1 0.0
4 2012-01-14 1 0.2
5 2012-01-16 1 0.4
6 2012-01-01 2 0.2
7 2012-01-02 2 0.1
8 2012-01-03 2 0.4
9 2012-01-04 2 0.6
10 2012-01-09 2 0.7
11 2012-01-10 2 0.4
>>> df.query('f7 > 1').groupby('date')['date'].count()
date
2012-01-01 2
2012-01-02 1
2012-01-03 2
2012-01-04 1
2012-01-09 2
2012-01-12 1
2012-01-14 1