Python 获得'；销售视窗'；对于熊猫中的每个产品类别？_Python_Pandas_Analytics

Python 获得'；销售视窗'；对于熊猫中的每个产品类别？

python pandas

Python 获得'；销售视窗'；对于熊猫中的每个产品类别？,python,pandas,analytics,Python,Pandas,Analytics,因此，我的dataframe拥有多年来许多产品的销售详细信息，图表如下所示：我正试图找出每种产品的销售窗口到目前为止，我所尝试的：我想到的方法是获得每年六个月间隔的最小、中位数和最大日期值，并宣布（最小到中位数）为最差销售期，中位数到最大值为该产品的最佳销售窗口。我现在使用的代码已经有六个月了，但我也希望能在一年内得到它。无论哪种方法最有效： def dater(date): print(date) if type(date)==float: return

因此，我的dataframe拥有多年来许多产品的销售详细信息，图表如下所示：

我正试图找出每种产品的销售窗口

到目前为止，我所尝试的：我想到的方法是获得每年六个月间隔的最小、中位数和最大日期值，并宣布（最小到中位数）为最差销售期，中位数到最大值为该产品的最佳销售窗口。我现在使用的代码已经有六个月了，但我也希望能在一年内得到它。无论哪种方法最有效：

def dater(date):
    print(date)
    if type(date)==float:
        return '-'
    months = ['','Jan', 'Feb', 'Mar', 'Apr', 'May','Jun', 'Jul', 'Aug','Sep', 'Oct', 'Nov', 'Dec']
    period = ['Start', 'Mid', 'End','End']
    return months[date.month]+' '+period[date.day//10]


def grpRes(grp):
    return pd.Series([grp.Date.min(), grp.Date.max(), grp.Amount.mean()],
        index=['start', 'end', 'value'])


best_windows = pd.DataFrame(columns = df.select_dtypes(exclude='object').columns)
for col in df.select_dtypes(exclude='object').columns:
    for year in ['2017', '2018', '2019', '2020']:
        print(f'For year {year} and category {col}')
        temp = df.loc[year,col][df[col]>=df[col].quantile(0.7)]
        print('temp created')
        if len(temp)>0:
            du = temp.reset_index().rename(columns = {'order_start_date': 'Date', col:'Amount'})
            res = du.groupby(du.Date.diff().dt.days.fillna(1, downcast='infer')
                .gt(20).cumsum()).apply(grpRes)
            res.index.name = 'chunk'
            for row in res.iterrows():
                print(row)
                best_windows.loc[year+' Window: '+str(row[0]+1)+' start',col] = row[1].start.date().strftime('%d-%m-%Y')

然后，我根据所有年份的值定义窗口，作为窗口的开始范围和结束范围。但这似乎是一种可怕的做法。尽管如此，我还是给出了不同年份的日期范围，如下所示：

    2017 Window: 1 end  2017 Window: 1 start    2017 Window: 2 end  2017 Window: 2 start    2018 Window: 1 end  2018 Window: 1 start    2018 Window: 2 end  2018 Window: 2 start    2018 Window: 3 end  2018 Window: 3 start    2019 Window: 1 end  2019 Window: 1 start    2019 Window: 2 end  2019 Window: 2 start    2019 Window: 3 end  2019 Window: 3 start    2020 Window: 1 end  2020 Window: 1 start    2020 Window: 2 end  2020 Window: 2 start    2020 Window: 3 end  2020 Window: 3 start    2020 Window: 4 end  2020 Window: 4 start
B                                           31-12-2019  08-11-2019                  09-01-2020  01-01-2020  31-07-2020  11-02-2020              
D   12-06-2017  13-05-2017  14-10-2017  16-08-2017  13-06-2018  24-05-2018  20-08-2018  11-07-2018  03-11-2018  27-09-2018  10-11-2019  22-10-2019  31-12-2019  28-12-2019          31-07-2020  01-01-2020                      
H                   06-04-2018  23-03-2018  09-08-2018  27-06-2018  16-11-2018  02-11-2018  25-05-2019  21-04-2019  15-08-2019  12-07-2019  31-12-2019  30-10-2019  31-07-2020  01-01-2020                      
J   12-02-2017  15-01-2017  31-12-2017  25-10-2017  11-02-2018  01-01-2018  31-12-2018  12-10-2018          24-02-2019  01-01-2019  31-12-2019  10-10-2019          04-02-2020  01-01-2020                      
L                   08-11-2018  03-11-2018  31-12-2018  06-12-2018          07-03-2019  01-01-2019  01-05-2019  24-04-2019  31-12-2019  02-09-2019  06-03-2020  01-01-2020  19-04-2020  10-04-2020  14-05-2020  10-05-2020  31-07-2020  26-07-2020
LO  31-12-2017  06-09-2017          03-01-2018  01-01-2018  31-12-2018  23-09-2018          10-02-2019  01-01-2019  31-12-2019  25-09-2019          11-02-2020  01-01-2020                      
M   11-09-2017  15-01-2017          15-10-2018  03-07-2018                  02-05-2019  22-04-2019  24-11-2019  18-11-2019          13-05-2020  28-03-2020  23-07-2020  21-06-2020              
P   03-05-2017  21-01-2017  19-10-2017  11-08-2017  23-04-2018  31-01-2018  10-10-2018  02-08-2018          23-04-2019  23-02-2019  06-10-2019  04-09-2019          04-04-2020  29-02-2020                      
S   26-07-2017  24-03-2017          01-07-2018  25-03-2018                  01-05-2019  18-04-2019  10-08-2019  23-05-2019          31-07-2020  01-04-2020                      
SH  12-08-2017  07-05-2017          11-08-2018  05-05-2018                  10-08-2019  01-05-2019                  31-07-2020  29-04-2020                      
SK                                          31-12-2019  12-12-2019                  01-01-2020  01-01-2020  31-07-2020  24-05-2020              
SKO 26-09-2017  01-05-2017          19-09-2018  03-05-2018                  25-07-2019  09-07-2019                  31-07-2020  04-05-2020                      
SL  10-06-2017  24-05-2017          06-05-2018  06-05-2018  16-07-2018  31-05-2018          01-08-2019  12-03-2019                  31-07-2020  16-02-2020                      
U                                           17-05-2019  18-04-2019  24-06-2019  10-06-2019          01-06-2020  27-03-2020  31-07-2020  25-06-2020              
V   13-02-2017  15-01-2017  31-12-2017  14-09-2017  05-03-2018  01-01-2018  31-12-2018  25-09-2018          19-02-2019  01-01-2019  31-12-2019  22-10-2019          22-01-2020  01-01-2020

现在，我可以使用我编写的dater函数将其转换为月份&在精确的月份窗口中：

best\u windows=best\u windows.transpose（）.applymap（dater）

但这给了我全年的解决方案，而不是一个单一的销售窗口

理想情况下，我想要实现的目标是：每年每个产品的畅销窗口和最差窗口，我可以说，嘿，在每年的这个时候，这个产品很受欢迎（例如，像产品A在3月底到6月中旬销售最好），由图中所示的%销售曲线的波峰/波谷松散地定义，理想情况下，过渡期以及对每种产品的销售窗口有更好的直觉

数据样本：我的数据如下所示。请注意，这些是基于每个类别所代表的总销售额的%s。我说的%是指总销售额的%。假设总销售额为10美元。其中产品A的售价为5美元，B为3美元，C为2美元。那么%的值为：A=50%，B=30%，C=20%。当然，只有当我尝试添加一整年数据的产品不止一种时，这才有效，因为它可以更好地解释我的数据中的季节性，这在较小的样本中无法检测到

链接：

像这样的东西怎么样：

# usng sin to generate seasonal data
period = 365 * 4
dates = pd.date_range('2016-01-01', periods=period)

np.random.seed(42)
pure = np.sin(np.linspace(6, 30, period))
noise = np.random.normal(0, 1, period)
signal = pure + 20 + noise

df = pd.DataFrame({'date': dates, 'signal': signal}).set_index('date')
df['smoothed'] = df['signal'].rolling(30).mean()

# get best/worst selling months
# rolling max/min method
threshold = 0.97
window = 320
df['best'] = df['smoothed'].where( df['smoothed'] > df['smoothed'].rolling(window).max() * threshold, other=np.nan)
df['worst'] = df['smoothed'].where( df['smoothed'] < df['smoothed'].rolling(window).min() / threshold, other=np.nan)
df.iloc[365:, 1:].plot(figsize=(14,10))

#使用sin生成季节数据
周期=365*4
日期=pd.日期范围（'2016-01-01'，期间=期间）
np.随机种子（42）
pure=np.sin（np.linspace（6,30,period））
噪声=np.随机.正常（0，1，周期）
信号=纯+20+噪声
df=pd.DataFrame（{'date'：dates，'signal'：signal}）。set_index（'date'）
df['smooted']=df['signal'].滚动（30）.平均值（）
#获得最佳/最差销售月份
#滚动最大/最小法
阈值=0.97
窗口=320
df['best']=df['smooted'].where（df['smooted']>df['smooted'].rolling（window.max（）*threshold，other=np.nan）
df['west']=df['smooted'].where（df['smooted']



滚动最大/最小位并不完美，但如果年度最大/最小值每年都有显著变化，则滚动最大/最小位是必要的。使用这种方法，您还必须忽略第一年的数据
下一种方法通过首先分别拉动年度最大/最小值来解决这些问题：
# annual max/min method
threshold = 0.97
df['max'], df['min'] = df['smoothed'].max(), df['smoothed'].min()
df['best'] = df['smoothed'].where( df['smoothed'] > df['max'] * threshold, other=np.nan)
df['worst'] = df['smoothed'].where( df['smoothed'] < df['min'] / threshold, other=np.nan)
df.iloc[365:, 1:-2].plot(figsize=(14,10))

#年最大/最小值法
阈值=0.97
df['max']，df['min']=df['smooted'].max（），df['smooted'].min（）
df['best']=df['smooted'].其中（df['smooted']>df['max']*阈值，其他=np.nan）
df['west']=df['smooted']。其中（df['smooted']

像这样的东西怎么样：
# usng sin to generate seasonal data
period = 365 * 4
dates = pd.date_range('2016-01-01', periods=period)

np.random.seed(42)
pure = np.sin(np.linspace(6, 30, period))
noise = np.random.normal(0, 1, period)
signal = pure + 20 + noise

df = pd.DataFrame({'date': dates, 'signal': signal}).set_index('date')
df['smoothed'] = df['signal'].rolling(30).mean()

# get best/worst selling months
# rolling max/min method
threshold = 0.97
window = 320
df['best'] = df['smoothed'].where( df['smoothed'] > df['smoothed'].rolling(window).max() * threshold, other=np.nan)
df['worst'] = df['smoothed'].where( df['smoothed'] < df['smoothed'].rolling(window).min() / threshold, other=np.nan)
df.iloc[365:, 1:].plot(figsize=(14,10))

#使用sin生成季节数据
周期=365*4
日期=pd.日期范围（'2016-01-01'，期间=期间）
np.随机种子（42）
pure=np.sin（np.linspace（6,30,period））
噪声=np.随机.正常（0，1，周期）
信号=纯+20+噪声
df=pd.DataFrame（{'date'：dates，'signal'：signal}）。set_index（'date'）
df['smooted']=df['signal'].滚动（30）.平均值（）
#获得最佳/最差销售月份
#滚动最大/最小法
阈值=0.97
窗口=320
df['best']=df['smooted'].where（df['smooted']>df['smooted'].rolling（window.max（）*threshold，other=np.nan）
df['west']=df['smooted'].where（df['smooted']


滚动最大/最小位并不完美，但如果年度最大/最小值每年都有显著变化，则滚动最大/最小位是必要的。使用这种方法，您还必须忽略第一年的数据
下一种方法通过首先分别拉动年度最大/最小值来解决这些问题：
# annual max/min method
threshold = 0.97
df['max'], df['min'] = df['smoothed'].max(), df['smoothed'].min()
df['best'] = df['smoothed'].where( df['smoothed'] > df['max'] * threshold, other=np.nan)
df['worst'] = df['smoothed'].where( df['smoothed'] < df['min'] / threshold, other=np.nan)
df.iloc[365:, 1:-2].plot(figsize=(14,10))

#年最大/最小值法
阈值=0.97
df['max']，df['min']=df['smooted'].max（），df['smooted'].min（）
df['best']=df['smooted'].其中（df['smooted']>df['max']*阈值，其他=np.nan）
df['west']=df['smooted']。其中（df['smooted']

我认为首先要考虑的是，您是想要一个静态模型，还是想要一种自我更新的模型
我的建议是使用静态模型作为目前为止积累的所有数据，以获得产品的畅销窗口和最畅销窗口，并将其作为下一年的建议。发布您可以再次更新您的推荐
接下来，你需要决定什么是好的，什么是坏的。可能是这样的，前20%的分数是好的，后20%的分数是坏的。我们称之为阈值T百分位
现在来看主要部分，所以你们的假设是，当一种产品的销售额百分比高（高于T）或低（低于T）时，每年都有固定的窗口。
因此，首先，我们需要得到一年中每一天的平均值（你也可以拟合回归模型，而不是进行平均，这将使事情变得平稳，使你的预测更加稳健）
然后，无论平均/预测销售曲线在何处穿过T百分位，我们都会开始区间，并在再次穿过时停止
def get_thresh_crossing_intervals(arr):
    crossings = np.diff(np.sign(arr))
    # You might also want to wrap arrays to cover spans around end of year
    ends = np.where(crossings == -2)[0]
    starts = np.where(crossings == 2)[0][:len(ends)]  
    return list(zip(starts, ends))


def post_process_intervals(intervals):
    return [(p, q) for p, q in intervals if q-p>=7]


def get_col_intervals(df, col, top_thresh=0.2, bot_thresh=0.2):
    # Get quantile based thresholds
    top_qnt = df[col].quantile(1 - top_thresh)
    bot_qnt = df[col].quantile(bot_thresh)
    
    # Make threshold as zero line
    top_df = df[col] - top_qnt 
    bot_df = df[col] - bot_qnt
    
    # Get top crossings and intervals
    top_intervals = get_thresh_crossing_intervals(top_df)
    bot_intervals = get_thresh_crossing_intervals(bot_df)
    
    # Some post processings (e.g. only keep intervals with more than a week)
    top_intervals = post_process_intervals(top_intervals)
    bot_intervals = post_process_intervals(bot_intervals)
    
    return {'top_intervals': top_intervals, 'bot_intervals': bot_intervals}

product_intervals = {}
for col in ["A", "B"]:
    product_intervals[col] = get_col_intervals(dfg, col)


product_intervals

此外，我们只保留超过一定长度的间隔，否则我们会将其删除或删除