在Python中按月比较销售情况?
df 我有一个2gb左右的大数据集,我需要做的是每月比较每个食品id的销售额。 如果某个食品id从本月到下个月的销售差额为1000,则标记该月 输出在Python中按月比较销售情况?,python,python-3.x,pandas,python-2.7,Python,Python 3.x,Pandas,Python 2.7,df 我有一个2gb左右的大数据集,我需要做的是每月比较每个食品id的销售额。 如果某个食品id从本月到下个月的销售差额为1000,则标记该月 输出 Food_Id Month_yr Qty Sales 0 1 November_18 5 1920 1 2 November_18 6 2850 2 2 November_18 8 3852 3 1 N
Food_Id Month_yr Qty Sales
0 1 November_18 5 1920
1 2 November_18 6 2850
2 2 November_18 8 3852
3 1 November_18 6 1920
4 2 November_18 7 2650
5 1 November_18 2 3952
6 1 November_18 3 1320
7 2 November_18 8 2650
8 1 November_18 9 3152
9 1 December_18 5 1920
10 2 December_18 6 2150
11 2 December_18 8 3852
13 1 December_18 6 4920
14 2 December_18 6 3690
15 2 December_18 2 8952
16 1 December_18 7 7340
17 1 December_18 4 3650
18 2 December_18 9 8152
19 1 January_19 5 1920
20 2 January_19 6 8150
21 2 January_19 8 3852
22 1 January_19 1 3920
23 2 January_19 3 2690
24 2 January_19 2 8952
25 1 January_19 2 7340
26 1 January_19 4 5630
27 2 January_19 7 6152
由于数据量大,请说明如何处理大量数据。如有必要,首先将列转换为日期时间,并按以下方式排序: 然后聚合
sum
并通过以下方式获得每组的差异,最后设置Flag
列:
df=df.groupby(['Food\u Id','Month\u yr',sort=False,as\u index=False)['Sales'].sum()
df['diff_frm_lst_month']=df.groupby('Food_Id')['Sales'].diff()
掩码=[df['diff_frm_lst_month']>1000,df['diff_frm_lst_month']<1000]
VAL=['超过1000','小于1000']
df['Flag']=np.select(掩码、VAL、np.nan)
打印(df)
食品标识月份年销售额差异首个月标志
11月0日18 12264楠楠楠楠
2002年11月1日11月2日
2 12月1日18 17830 5566.0超过1000
3 12月2日18 26796 14794.0超过1000
4 18810年1月1日980.0小于1000
5月2日19 29796 3000.0超过1000
您试过什么吗?请分享
Food_Id Month_yr Sales diff_frm_lst_month Flag
0 1 November_18 12264 Null Null
1 2 November_18 12002 Null Null
2 1 December_18 17830 5566 more than 1000
3 2 December_18 26794 14792 more than 1000
4 1 January_18 18800 970 less than 1000
5 2 January_18 29796 3002 more than 1000
df['date'] = pd.to_datetime(df['Month_yr'], format='%B_%y')
df = df.sort_values(['date'])
df = df.groupby(['Food_Id','Month_yr'], sort=False, as_index=False)['Sales'].sum()
df['diff_frm_lst_month'] = df.groupby('Food_Id')['Sales'].diff()
masks = [df['diff_frm_lst_month'] > 1000, df['diff_frm_lst_month'] < 1000]
vals = ['more than 1000','less than 1000']
df['Flag'] = np.select(masks, vals, np.nan)
print (df)
Food_Id Month_yr Sales diff_frm_lst_month Flag
0 1 November_18 12264 NaN nan
1 2 November_18 12002 NaN nan
2 1 December_18 17830 5566.0 more than 1000
3 2 December_18 26796 14794.0 more than 1000
4 1 January_19 18810 980.0 less than 1000
5 2 January_19 29796 3000.0 more than 1000