Python 获取大熊猫不同群体的事件总数_Python_Pandas

Python 获取大熊猫不同群体的事件总数

python pandas

Python 获取大熊猫不同群体的事件总数,python,pandas,Python,Pandas,我有一个这样的结构： country product date_install date_purchase user_id BR yearly 2020-11-01 2020-11-01 10660236 CA monthly 2020-11-01 2020-11-01 10649441 US yearly 2020-11-01 trialed 10660

我有一个这样的结构：

country product     date_install    date_purchase   user_id
BR      yearly      2020-11-01      2020-11-01      10660236
CA      monthly     2020-11-01      2020-11-01      10649441
US      yearly      2020-11-01      trialed         10660272
IT      monthly     2020-11-01      2020-11-01      10657634
AE      monthly     2020-11-01      2020-11-01      10661442
IT      monthly     2020-11-01      trialed         10657634
AE      monthly     2020-11-01      trialed         10661442

我正在尝试获取每个

国家、产品、日期的购买/安装比率以及实际安装和购买数量date\u install
是一个安装日期，date\u purchase
确定购买日期和购买已经发生，date\u purchase
中的trialed
值表示没有为具有用户id的用户进行购买
所需的输出应如下所示：
country product     date_install        installs    purchases       ratio
US      daily       2021-02-05          100         20              0.2
US      monthly     2021-02-05          100         50              0.5
US      yearly      2021-02-05          100         50              0.5             
US      trialed     2021-02-05          100         0               0    
# the next day
US      daily       2021-02-06          500         50              0.1
US      monthly     2021-02-06          500         100             0.2
US      yearly      2021-02-06          500         250             0.5             
US      trialed     2021-02-06          500         0               0    
# the rest of the countries & the rest of the days

安装
将是当天、国家/地区和产品的总安装日期
计数，购买
将是每天、国家/地区和产品的总购买日期
事件数
这个想法是，对于给定的国家和某天，人们已经安装了一个应用程序，其中一些人已经购买了产品，而另一些人没有。那些已经购买的人有一个日期的date\u购买
value，而那些没有trial的人有一个日期的value。但安装应用程序的用户总数是每个国家、产品和安装日期的date\u安装数
我尝试的是：
exp = df.groupby(['country','product','date_install']).count()
.sort_values('date_install',ascending=False).reset_index()

exp.groupby(['country','product','date_install'])['date_purchase'].sum().reset_index()
exp['total_installs'] = exp.groupby(['country','product','date_install'])['date_purchase'].sum().reset_index()

但我有一个错误：
ValueError：传递的项目数错误4，放置意味着1
我不认为我试图实现这一目标的方式是正确的。实现预期结果的最佳方式/逻辑是什么
更新
使用@jezrael的答案后：
df['date_purchase'] = df['date_purchase'].replace('trialed', np.nan)

exp = (df.groupby(['country','product','date_install']).agg(installs = ('date_purchase','size'), purchases = ('date_purchase','count')))
exp['ratio'] = exp['purchases'].div(exp['installs'])
exp = exp.reset_index()

返回
country     product         date_install        installs    purchases   ratio
US          catalog30US     2020-11-18          1           1           1.0
US          trialed         2020-11-18          4924        0           0.0
US          renders.100     2020-11-18          2           2           1.0
US          renders.20      2020-11-18          3           3           1.0
US          monthly         2020-11-18          37          37          1.0
US          yearly          2020-11-18          6           6           1.0
US          textures        2020-11-18          1           1           1.0

这是不正确的，因为每行中的安装数
应该是给定的国家/地区和日期\u安装数
的总安装数之和
在返回中，我更新了country&day的安装值需要是country&day的所有安装的总和，在这种情况下，安装的每个值需要是1+4924+2+3+37+6+1
，这将是给定国家和日期的真实安装，然后比率就有意义了，现在installs==purchases
，但事实并非如此。我试图回答：对于给定的日期和国家，有多少人安装和购买了不同的产品，他们的比例是多少
我需要它是：
country     product         date_install        installs    purchases   ratio
US          catalog30US     2020-11-18          4974        1           1 / 4974
US          trialed         2020-11-18          4974        0           0.0
US          renders.100     2020-11-18          4974        2           2 / 4974
US          renders.20      2020-11-18          4974        3           3 / 4974
US          monthly         2020-11-18          4974        37          37 / 4974
US          yearly          2020-11-18          4974        6           6 / 4974
US          textures        2020-11-18          4974        1           1 / 4974

我认为，对于缺少值的计数和排除缺少值的计数，需要使用聚合方式进行聚合，然后除以列：
df['date_purchase'] = df['date_purchase'].replace('trialed', np.nan)

exp = (df.groupby(['country','product','date_install'])
         .agg(installs = ('date_purchase','size'), purchases = ('date_purchase','count')))

#sum per country and install date
exp['installs'] = exp.groupby(['country','date_install'])['installs'].transform('sum')
exp['ratio'] = exp['purchases'].div(exp['installs'])

exp = exp.reset_index()
print (exp)

我收到一个错误：'purch for countases'
@JonasPalačionis-oops，打字错误。我更新了我的问题，你知道问题可能是什么吗？这不是真的，因为每行的安装应该是给定国家/地区和日期安装的总安装数。
对不起，不明白。在返回中
我已经更新了国家和日期的安装
值
需要是国家和日期的所有安装
的总和，在这种情况下，安装
的每个值都需要是1+4924+2+3+37+6+1
，对于给定的国家/地区和日期
，哪个才是真正的安装
，然后比率才有意义，现在安装
=购买
，这是不正确的。我试图回答：对于给定的日期和国家，有多少人安装和购买了不同的产品，他们的比例是多少。
df['date_purchase'] = df['date_purchase'].replace('trialed', np.nan)

exp = (df.groupby(['country','product','date_install'])
         .agg(installs = ('date_purchase','size'), purchases = ('date_purchase','count')))

#sum per country and install date
exp['installs'] = exp.groupby(['country','date_install'])['installs'].transform('sum')
exp['ratio'] = exp['purchases'].div(exp['installs'])

exp = exp.reset_index()
print (exp)