Python 按年度和周分组的唯一用户数_Python_Python 3.x_Pandas

Python 按年度和周分组的唯一用户数

python python-3.x pandas

Python 按年度和周分组的唯一用户数,python,python-3.x,pandas,Python,Python 3.x,Pandas,我有一个report_date列，我想按它的年-周进行聚合，并计算唯一用户的数量 import pandas as pd from io import StringIO datastring = StringIO("""\ report_date user_id 2015-12-01 1 2015-12-01 2 2015-12-01 2 2015-12-02 2 2015-12-02 3 2016-01-01

我有一个report_date列，我想按它的年-周进行聚合，并计算唯一用户的数量

import pandas as pd
from io import StringIO

datastring = StringIO("""\
report_date  user_id
2015-12-01         1
2015-12-01         2
2015-12-01         2
2015-12-02         2
2015-12-02         3
2016-01-01         1
""")

df = pd.read_table(datastring, sep='\s\s+', engine='python')
df['report_date'] = pd.to_datetime(df['report_date'])

我想要的输出：

2015-48    3
2016-00    1

我已经提出了一个解决方案（发布在下面），但当使用更大的数据集（>1MM行）时，速度相对较慢。我很好奇这个问题是否有更好的解决办法

(df.assign(report_week=lambda x: x.report_date.dt.strftime('%Y-%W'))
  .groupby('report_week')
  .user_id
  .nunique()
)

编辑最后，我修改了@EdChum的建议，删除了“2016-53”等案例，报告日期为

2016-01-01

，按周数模53分组：

df.groupby([df.report_date.dt.year, df.report_date.dt.week.mod(53)]).user_id.nunique()

由于您的列已经是datetime，因此不需要在字符串上转换为string和groupby，我们可以在组件上分组，然后只调用：

您可以试试

df.groupby（[df['report\u date'].dt.year，df['report\u date'].dt.week]）['user\u id'].nunique（）

它们似乎与此数据集的速度相同。我会在我的大数据集中尝试。是的，在我的大数据集中速度会快20倍[300ms vs.6s]。请把你的答案贴在下面。这怎么可能？你能发布数据和代码来显示吗？我通过修改最后一行编辑了问题中的数据集。您的代码为我打印：

2016 53 1

。这就好像

dt.week

使用了“零售”周。我认为这与第53周

2016-01-01

的日期有关，因为它溢出，日期

2016-01-04

作为周值。也许groupby中的第二个元素应该是：

df['report_date'].dt.week.mod（53）

？我不知道在这种情况下你应该怎么做，因为从某种意义上讲，这在技术上是正确的

In [108]:
df.groupby([df['report_date'].dt.year, df['report_date'].dt.week])['user_id'].nunique()

Out[108]:
report_date  report_date
2015         49             3
2016         53             1
Name: user_id, dtype: int64