Python 新老客户数据透视表
我想做一个支点,在那里我可以看到有多少新老客户前来购买。 预期产出:Python 新老客户数据透视表,python,pandas,numpy,pandas-groupby,Python,Pandas,Numpy,Pandas Groupby,我想做一个支点,在那里我可以看到有多少新老客户前来购买。 预期产出: customer purchase_id payment_status price currency payment_date 1 Andy 6 REPAID 100 GBP 2020-04-16 2 Randy 10 IN_PROGRESS 10000 SEK 2020-04-17
customer purchase_id payment_status price currency payment_date
1 Andy 6 REPAID 100 GBP 2020-04-16
2 Randy 10 IN_PROGRESS 10000 SEK 2020-04-17
我被困在:
new_customers old_customers
Jan 1 3
Feb 5 2
我不知道如何将df2
与df
集成,其中在df
中,客户可以出现多次,并具有不同的购买id、价格和付款日期。。
代码的最后一部分可能如下所示:
df['year']=df['payment_date'].dt.year
df['month']=df['payment_date'].dt.month
df2=pd.DataFrame(df.groupby("customer", sort=False)["purchase_id"].count())
df2=number_of_purchase.reset_index()
df2.columns = ['merchant_code','number_of_purchase']
df2['repeat_customer']=np.where(df['number_of_purchase']>1,'old_customers','new_customers')
但是请随意更改我的代码,输出更重要。- 您的示例数据没有足够的特性,因此我生成了一个与该结构匹配的随机数据集
- 处理月初的工作要简单得多,所以只需转到这些
- 新客户需要一个定义,我从您的代码中暗示了它的定义
- 用这个定义直接计算
- 最后,根据需要重新调整结果的结构
- 您可能需要在输出DF中格式化月份
df.groupby(["year","month", "repeat_customers"])["repeat_customers"].count()
import numpy as np
d = pd.date_range("01-Jan-2020", periods=10, freq="W")
c = ['tenetur', 'quae', 'rem', 'maxime', 'sunt']
df = pd.DataFrame({"customer":np.random.choice(c, len(d)),
"purchase_id":np.random.randint(1,10, len(d)),
"payment_status":np.random.choice(["REPAID","IN_PROGRESS"],len(d)),
"price":np.random.randint(100,10000, len(d)),
"currency":np.random.choice(["GBP","SEK"],len(d)),
"payment_date":d})
# only interested with month start
df2 = (df.assign(ms=df.payment_date - pd.to_timedelta(df.payment_date.dt.day-1, "d"),
# find first time a customer made a purchase
fms=lambda dfa: dfa.groupby("customer")["ms"].transform("first"),
# if month of purchase and first month customer made a purchase are same, new ...
new_customer=lambda dfa: np.where(dfa.ms==dfa.fms, "new_customers", "old_customers")
)
# with the prep it's a simple count
.groupby(["ms","new_customer"])["customer"].count()
# format the results
.to_frame().unstack(1).fillna(0).droplevel(0, axis=1).rename_axis("", axis=1)
)