在Python中按月比较销售情况?

在Python中按月比较销售情况?,python,python-3.x,pandas,python-2.7,Python,Python 3.x,Pandas,Python 2.7,df 我有一个2gb左右的大数据集,我需要做的是每月比较每个食品id的销售额。 如果某个食品id从本月到下个月的销售差额为1000,则标记该月 输出 Food_Id Month_yr Qty Sales 0 1 November_18 5 1920 1 2 November_18 6 2850 2 2 November_18 8 3852 3 1 N

df

我有一个2gb左右的大数据集,我需要做的是每月比较每个食品id的销售额。 如果某个食品id从本月到下个月的销售差额为1000,则标记该月

输出

  Food_Id    Month_yr       Qty   Sales
0     1       November_18     5    1920
1     2       November_18     6    2850
2     2       November_18     8    3852
3     1       November_18     6    1920
4     2       November_18     7    2650
5     1       November_18     2    3952
6     1       November_18     3    1320
7     2       November_18     8    2650
8     1       November_18     9    3152
9     1       December_18     5    1920
10    2       December_18     6    2150
11    2       December_18     8    3852
13    1       December_18     6    4920
14    2       December_18     6    3690
15    2       December_18     2    8952
16    1       December_18     7    7340
17    1       December_18     4    3650
18    2       December_18     9    8152
19    1       January_19      5    1920
20    2       January_19      6    8150
21    2       January_19      8    3852
22    1       January_19      1    3920
23    2       January_19      3    2690
24    2       January_19      2    8952
25    1       January_19      2    7340
26    1       January_19      4    5630
27    2       January_19      7    6152

由于数据量大,请说明如何处理大量数据。

如有必要,首先将列转换为日期时间,并按以下方式排序:

然后聚合
sum
并通过以下方式获得每组的差异,最后设置
Flag
列:

df=df.groupby(['Food\u Id','Month\u yr',sort=False,as\u index=False)['Sales'].sum()
df['diff_frm_lst_month']=df.groupby('Food_Id')['Sales'].diff()
掩码=[df['diff_frm_lst_month']>1000,df['diff_frm_lst_month']<1000]
VAL=['超过1000','小于1000']
df['Flag']=np.select(掩码、VAL、np.nan)
打印(df)
食品标识月份年销售额差异首个月标志
11月0日18 12264楠楠楠楠
2002年11月1日11月2日
2 12月1日18 17830 5566.0超过1000
3 12月2日18 26796 14794.0超过1000
4 18810年1月1日980.0小于1000
5月2日19 29796 3000.0超过1000

您试过什么吗?请分享
   Food_Id    Month_yr      Sales   diff_frm_lst_month  Flag
0    1        November_18   12264      Null             Null
1    2        November_18   12002      Null             Null
2    1        December_18   17830      5566            more than 1000
3    2        December_18   26794      14792           more than 1000
4    1        January_18    18800      970             less than 1000
5    2        January_18    29796      3002            more than 1000
df['date'] = pd.to_datetime(df['Month_yr'], format='%B_%y')
df = df.sort_values(['date'])
df = df.groupby(['Food_Id','Month_yr'], sort=False, as_index=False)['Sales'].sum()
df['diff_frm_lst_month'] = df.groupby('Food_Id')['Sales'].diff()

masks = [df['diff_frm_lst_month'] > 1000, df['diff_frm_lst_month'] < 1000]
vals = ['more than 1000','less than 1000']
df['Flag'] = np.select(masks, vals, np.nan)

print (df)
   Food_Id     Month_yr  Sales  diff_frm_lst_month            Flag
0        1  November_18  12264                 NaN             nan
1        2  November_18  12002                 NaN             nan
2        1  December_18  17830              5566.0  more than 1000
3        2  December_18  26796             14794.0  more than 1000
4        1   January_19  18810               980.0  less than 1000
5        2   January_19  29796              3000.0  more than 1000