Python 使用循环数据创建数据帧组
我有一些定价数据如下所示:Python 使用循环数据创建数据帧组,python,pandas,Python,Pandas,我有一些定价数据如下所示: import pandas as pd df=pd.DataFrame([['A','1', 2015-02-01, 20.00, 20.00, 5], ['A','1', 2015-02-06, 16.00, 20.00, 8], ['A','1', 2015-02-14, 14.00, 20.00, 34], ['A','1', 2015-03-20, 20.00
import pandas as pd
df=pd.DataFrame([['A','1', 2015-02-01, 20.00, 20.00, 5],
['A','1', 2015-02-06, 16.00, 20.00, 8],
['A','1', 2015-02-14, 14.00, 20.00, 34],
['A','1', 2015-03-20, 20.00, 20.00, 5],
['A','1', 2015-03-25, 15.00, 20.00, 15],
['A','2', 2015-02-01, 75.99, 100.00, 22],
['A','2', 2015-02-23, 100.00, 100.00, 30],
['A','2', 2015-03-25, 65.00, 100.00, 64],
['B','3', 2015-04-01, 45.00, 45.00, 15],
['B','3', 2015-04-16, 40.00, 45.00, 2],
['B','3', 2015-04-18, 45.00, 45.00, 30],
['B','4', 2015-07-25, 5.00, 10.00, 55]],
columns=['dept','sku', 'date', 'price', 'orig_price', 'days_at_price'])
print(df)
dept sku date price orig_price days_at_price
0 A 1 2015-02-01 20.00 20.00 5
1 A 1 2015-02-06 16.00 20.00 8
2 A 1 2015-02-14 14.00 20.00 34
3 A 1 2015-03-20 20.00 20.00 5
4 A 1 2015-03-25 15.00 20.00 15
5 A 2 2015-02-01 75.99 100.00 22
6 A 2 2015-02-23 100.00 100.00 30
7 A 2 2015-03-25 65.00 100.00 64
8 B 3 2015-04-01 45.00 45.00 15
9 B 3 2015-04-16 40.00 45.00 2
10 B 3 2015-04-18 45.00 45.00 30
11 B 4 2015-07-25 5.00 10.00 55
我想描述定价周期,它可以定义为一个sku从原价到促销价或多个促销价并返回原价的时间段。一个周期必须从原始价格开始。可以包括价格永远不变的周期,以及价格减少且永远不会返回的周期。但低于原价的初始价格不会被视为一个周期。对于上述df,我希望得到的结果是:
dept sku cycle orig_price_days promo_days
0 A 1 1 5 42
1 A 1 2 5 15
2 A 2 1 30 64
3 B 3 1 15 2
4 B 3 2 30 0
我使用了groupby和sum,但不太清楚如何定义一个循环并相应地计算行的总数。任何帮助都将不胜感激。尝试使用loc而不是groupby-您希望在一段时间内获得大量SKU,而不是聚合组。适度使用的for循环在这里也有帮助,不会特别像熊猫。至少,像我一样,你考虑在唯一的数组切片上进行循环。
df['cycle'] = -1 # create a column for the cycle
skus = df.sku.unique() # get unique skus for iteration
for sku in skus:
# Get the start date for each cycle for this sku
# NOTE that we define cycles as beginning
# when the price equals the original price
# This avoids the mentioned issue that a cycle should not start
# if initial is less than original.
cycle_start_dates = df.loc[(df.sku == sku]) & \
(df.price == df.orig_price),
'date'].tolist()
# append a terminal date
cycle_start_dates.append(df.date.max()+timedelta(1))
# Assign the cycle values
for i in range(len(cycle_start_dates) - 1):
df.loc[(df.sku == sku) & \
(cycle_start_dates[i] <= df.date) & \
(df.date < cycle_start_dates[i+1]), 'cycle'] = i+1
一旦有了cycle列,聚合就变得相对简单了。此多重聚合:
df.groupby(['dept', 'sku','cycle'])['days_at_price']\
.agg({'orig_price_days': lambda x: x[:1].sum(),
'promo_days': lambda x: x[1:].sum()
})\
.reset_index()
将为您提供所需的结果:
dept sku cycle promo_days orig_price_days
0 A 1 1 42 5
1 A 1 2 15 5
2 A 2 -1 0 22
3 A 2 1 64 30
4 B 3 1 2 15
5 B 3 2 0 30
6 B 4 -1 0 55
请注意,对于预周期,这有额外的-1值,低于原始定价。我非常接近于产生所需的最终结果
# add a column to track whether price is above/below/equal to orig
df.loc[:,'reg'] = np.sign(df.price - df.orig_price)
# remove row where first known price for sku is promo
df_gp = df.groupby(['dept', 'sku'])
df = df[~((df_gp.cumcount() == 0) & (df.reg == -1))]
# enumerate all the individual pricing cycles
df.loc[:,'cycle'] = (df.reg == 0).cumsum()
# group/aggregate to get days at orig vs. promo pricing
cycles = df.groupby(['dept', 'sku', 'cycle'])['days_at_price'].agg({'promo_days': lambda x: x[1:].sum(), 'reg_days':lambda x: x[:1].sum()})
print cycles.reset_index()
dept sku cycle reg_days promo_days
0 A 1 1 5 42
1 A 1 2 5 15
2 A 2 3 30 64
3 B 3 4 15 2
4 B 3 5 30 0
我唯一不能完全理解的部分是如何在groupby之前重新启动每个sku的周期号。正如您的示例中所指定的,日期、价格、原价和天价都是字符串。我不确定这是否是你想要的。。。sku也是,但这似乎不是什么问题。谢谢@pml,这是无意的。现在修复。感谢@pml-使用loc索引对SKU进行分块是有意义的。我应该提到我的全套有几百万行。我现在将此应用于它,但要通过嵌套循环需要很长时间。有没有办法更快地完成周期开始日期?查看ffill方法-为周期分配开始值-例如,每个间隔1、2、3,然后用分配的周期值向下填充该列。或者,您可以按位置拆分数据帧,然后并行化进程。因为每个位置片都是独立的,所以也应该很简单。或者,您可以将df.sku==sku&\cycle\u start\u日期组合在一起[i]感谢@pml提供的提示/技巧。我使用了其中的一些来接近我的答案,我将发布。
# add a column to track whether price is above/below/equal to orig
df.loc[:,'reg'] = np.sign(df.price - df.orig_price)
# remove row where first known price for sku is promo
df_gp = df.groupby(['dept', 'sku'])
df = df[~((df_gp.cumcount() == 0) & (df.reg == -1))]
# enumerate all the individual pricing cycles
df.loc[:,'cycle'] = (df.reg == 0).cumsum()
# group/aggregate to get days at orig vs. promo pricing
cycles = df.groupby(['dept', 'sku', 'cycle'])['days_at_price'].agg({'promo_days': lambda x: x[1:].sum(), 'reg_days':lambda x: x[:1].sum()})
print cycles.reset_index()
dept sku cycle reg_days promo_days
0 A 1 1 5 42
1 A 1 2 5 15
2 A 2 3 30 64
3 B 3 4 15 2
4 B 3 5 30 0