Python 使用循环数据创建数据帧组_Python_Pandas

Python 使用循环数据创建数据帧组

python pandas

Python 使用循环数据创建数据帧组,python,pandas,Python,Pandas,我有一些定价数据如下所示： import pandas as pd df=pd.DataFrame([['A','1', 2015-02-01, 20.00, 20.00, 5], ['A','1', 2015-02-06, 16.00, 20.00, 8], ['A','1', 2015-02-14, 14.00, 20.00, 34], ['A','1', 2015-03-20, 20.00

我有一些定价数据如下所示：

import pandas as pd
df=pd.DataFrame([['A','1', 2015-02-01, 20.00, 20.00, 5],
                 ['A','1', 2015-02-06, 16.00, 20.00, 8],
                 ['A','1', 2015-02-14, 14.00, 20.00, 34],
                 ['A','1', 2015-03-20, 20.00, 20.00, 5],
                 ['A','1', 2015-03-25, 15.00, 20.00, 15],
                 ['A','2', 2015-02-01, 75.99, 100.00, 22],
                 ['A','2', 2015-02-23, 100.00, 100.00, 30],
                 ['A','2', 2015-03-25, 65.00, 100.00, 64],
                 ['B','3', 2015-04-01, 45.00, 45.00, 15],
                 ['B','3', 2015-04-16, 40.00, 45.00, 2],
                 ['B','3', 2015-04-18, 45.00, 45.00, 30],
                 ['B','4', 2015-07-25, 5.00, 10.00, 55]],
                 columns=['dept','sku', 'date', 'price', 'orig_price', 'days_at_price'])
print(df)

   dept sku        date   price orig_price days_at_price
0     A   1  2015-02-01   20.00      20.00             5
1     A   1  2015-02-06   16.00      20.00             8
2     A   1  2015-02-14   14.00      20.00            34
3     A   1  2015-03-20   20.00      20.00             5
4     A   1  2015-03-25   15.00      20.00            15
5     A   2  2015-02-01   75.99     100.00            22
6     A   2  2015-02-23  100.00     100.00            30
7     A   2  2015-03-25   65.00     100.00            64
8     B   3  2015-04-01   45.00      45.00            15
9     B   3  2015-04-16   40.00      45.00             2
10    B   3  2015-04-18   45.00      45.00            30
11    B   4  2015-07-25    5.00      10.00            55

我想描述定价周期，它可以定义为一个sku从原价到促销价或多个促销价并返回原价的时间段。一个周期必须从原始价格开始。可以包括价格永远不变的周期，以及价格减少且永远不会返回的周期。但低于原价的初始价格不会被视为一个周期。对于上述df，我希望得到的结果是：

 dept sku cycle orig_price_days promo_days
0   A   1     1               5         42
1   A   1     2               5         15
2   A   2     1              30         64
3   B   3     1              15          2
4   B   3     2              30          0

我使用了groupby和sum，但不太清楚如何定义一个循环并相应地计算行的总数。任何帮助都将不胜感激。

尝试使用loc而不是groupby-您希望在一段时间内获得大量SKU，而不是聚合组。适度使用的for循环在这里也有帮助，不会特别像熊猫。至少，像我一样，你考虑在唯一的数组切片上进行循环。

df['cycle'] = -1  # create a column for the cycle
skus = df.sku.unique()  # get unique skus for iteration

for sku in skus:
    # Get the start date for each cycle for this sku
    # NOTE that we define cycles as beginning
    #   when the price equals the original price
    # This avoids the mentioned issue that a cycle should not start
    #   if initial is less than original.
    cycle_start_dates = df.loc[(df.sku == sku]) & \
                               (df.price == df.orig_price),
                               'date'].tolist()

    # append a terminal date
    cycle_start_dates.append(df.date.max()+timedelta(1))

    # Assign the cycle values
    for i in range(len(cycle_start_dates) - 1):
        df.loc[(df.sku == sku) & \
               (cycle_start_dates[i] <= df.date) & \
               (df.date < cycle_start_dates[i+1]), 'cycle'] = i+1

一旦有了cycle列，聚合就变得相对简单了。此多重聚合：

df.groupby(['dept', 'sku','cycle'])['days_at_price']\
  .agg({'orig_price_days': lambda x: x[:1].sum(),
        'promo_days': lambda x: x[1:].sum()
       })\
  .reset_index()

将为您提供所需的结果：

  dept sku  cycle  promo_days  orig_price_days
0    A   1      1          42                5
1    A   1      2          15                5
2    A   2     -1           0               22
3    A   2      1          64               30
4    B   3      1           2               15
5    B   3      2           0               30
6    B   4     -1           0               55

请注意，对于预周期，这有额外的-1值，低于原始定价。

我非常接近于产生所需的最终结果

# add a column to track whether price is above/below/equal to orig
df.loc[:,'reg'] = np.sign(df.price - df.orig_price)

# remove row where first known price for sku is promo
df_gp = df.groupby(['dept', 'sku'])
df = df[~((df_gp.cumcount() == 0) & (df.reg == -1))]

# enumerate all the individual pricing cycles
df.loc[:,'cycle'] = (df.reg == 0).cumsum()

# group/aggregate to get days at orig vs. promo pricing
cycles = df.groupby(['dept', 'sku', 'cycle'])['days_at_price'].agg({'promo_days': lambda x: x[1:].sum(), 'reg_days':lambda x: x[:1].sum()})

print cycles.reset_index()

  dept  sku  cycle  reg_days  promo_days
0    A    1      1         5          42
1    A    1      2         5          15
2    A    2      3        30          64
3    B    3      4        15           2
4    B    3      5        30           0

我唯一不能完全理解的部分是如何在groupby之前重新启动每个sku的周期号。

正如您的示例中所指定的，日期、价格、原价和天价都是字符串。我不确定这是否是你想要的。。。sku也是，但这似乎不是什么问题。谢谢@pml，这是无意的。现在修复。感谢@pml-使用loc索引对SKU进行分块是有意义的。我应该提到我的全套有几百万行。我现在将此应用于它，但要通过嵌套循环需要很长时间。有没有办法更快地完成周期开始日期？查看ffill方法-为周期分配开始值-例如，每个间隔1、2、3，然后用分配的周期值向下填充该列。或者，您可以按位置拆分数据帧，然后并行化进程。因为每个位置片都是独立的，所以也应该很简单。或者，您可以将df.sku==sku&\cycle\u start\u日期组合在一起[i]感谢@pml提供的提示/技巧。我使用了其中的一些来接近我的答案，我将发布。

# add a column to track whether price is above/below/equal to orig
df.loc[:,'reg'] = np.sign(df.price - df.orig_price)

# remove row where first known price for sku is promo
df_gp = df.groupby(['dept', 'sku'])
df = df[~((df_gp.cumcount() == 0) & (df.reg == -1))]

# enumerate all the individual pricing cycles
df.loc[:,'cycle'] = (df.reg == 0).cumsum()

# group/aggregate to get days at orig vs. promo pricing
cycles = df.groupby(['dept', 'sku', 'cycle'])['days_at_price'].agg({'promo_days': lambda x: x[1:].sum(), 'reg_days':lambda x: x[:1].sum()})

print cycles.reset_index()

  dept  sku  cycle  reg_days  promo_days
0    A    1      1         5          42
1    A    1      2         5          15
2    A    2      3        30          64
3    B    3      4        15           2
4    B    3      5        30           0