Python 熊猫-在保留列/索引值的同时向数据框添加缺失的日期?
我有一个熊猫数据框架,它包含日期、客户、物品,然后是购买的美元价值Python 熊猫-在保留列/索引值的同时向数据框添加缺失的日期?,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个熊猫数据框架,它包含日期、客户、物品,然后是购买的美元价值 date customer product amt 1/1/2017 tim apple 3 1/1/2017 jim melon 2 1/1/2017 tom apple 5 1/1/2017 tom melon 4 1/4/2017 tim
date customer product amt
1/1/2017 tim apple 3
1/1/2017 jim melon 2
1/1/2017 tom apple 5
1/1/2017 tom melon 4
1/4/2017 tim melon 3
1/4/2017 jim apple 2
1/4/2017 tom melon 1
1/4/2017 tom orange 4
我只是想看看性能,但我想提前填写最小和最大日期范围内的所有日期,并为每个产品的每个客户填写
比如:
date customer product amt
1/1/2017 tim apple 3
1/1/2017 tim melon 0
1/1/2017 tim orange 0
1/1/2017 jim melon 2
1/1/2017 jim apple 0
1/1/2017 jim orange 0
1/1/2017 tom apple 5
1/1/2017 tom melon 4
1/1/2017 tom orange 0
1/2/2017 tim apple 0
1/2/2017 tim melon 0
1/2/2017 tim orange 0
1/2/2017 jim melon 0
1/2/2017 jim apple 0
1/2/2017 jim orange 0
1/2/2017 tom apple 0
1/2/2017 tom melon 0
1/2/2017 tom orange 0
1/3/2017 tim apple 0
1/3/2017 tim melon 0
1/3/2017 tim orange 0
1/3/2017 jim melon 0
1/3/2017 jim apple 0
1/3/2017 jim orange 0
1/3/2017 tom apple 0
1/3/2017 tom melon 0
1/3/2017 tom orange 0
1/4/2017 tim melon 3
1/4/2017 tim apple 0
1/4/2017 tim orange 0
1/4/2017 jim apple 2
1/4/2017 jim melon 0
1/4/2017 jim orange 0
1/4/2017 tom melon 1
1/4/2017 tom orange 4
1/4/2017 tom apple 0
我知道我可以根据最大和最小日期创建重新索引,但这也会使我的客户和产品值为0。还有别的办法吗?我是不是漏了一步什么的?感谢您的帮助,您可以这样做:
In [63]: dates = pd.date_range(df['date'].min(), df['date'].max())
In [64]: idx = pd.MultiIndex.from_product((dates,
df['customer'].unique(),
df['product'].unique()))
In [72]: (df.set_index(['date','customer','product'])
.reindex(idx, fill_value=0)
.reset_index()
.set_axis(df.columns, axis=1, inplace=False))
Out[72]:
date customer product amt
0 2017-01-01 tim apple 3
1 2017-01-01 tim melon 0
2 2017-01-01 tim orange 0
3 2017-01-01 jim apple 0
4 2017-01-01 jim melon 2
5 2017-01-01 jim orange 0
6 2017-01-01 tom apple 5
7 2017-01-01 tom melon 4
8 2017-01-01 tom orange 0
9 2017-01-02 tim apple 0
.. ... ... ... ...
26 2017-01-03 tom orange 0
27 2017-01-04 tim apple 0
28 2017-01-04 tim melon 3
29 2017-01-04 tim orange 0
30 2017-01-04 jim apple 2
31 2017-01-04 jim melon 0
32 2017-01-04 jim orange 0
33 2017-01-04 tom apple 0
34 2017-01-04 tom melon 1
35 2017-01-04 tom orange 4
[36 rows x 4 columns]
请注意,这需要多次使用
堆栈
和取消堆栈
df.set_index(['date','customer','product']).amt.unstack(-3).\
reindex(columns=pd.date_range(df['date'].min(),
df['date'].max()),fill_value=0).\
stack(dropna=False).unstack().stack(dropna=False).\
unstack('customer').stack(dropna=False).reset_index().\
fillna(0).sort_values(['level_1','customer','product'])
Out[314]:
product level_1 customer 0
0 apple 2017-01-01 jim 0.0
12 melon 2017-01-01 jim 2.0
24 orange 2017-01-01 jim 0.0
1 apple 2017-01-01 tim 3.0
13 melon 2017-01-01 tim 0.0
25 orange 2017-01-01 tim 0.0
2 apple 2017-01-01 tom 5.0
14 melon 2017-01-01 tom 4.0
26 orange 2017-01-01 tom 0.0
3 apple 2017-01-02 jim 0.0
15 melon 2017-01-02 jim 0.0
27 orange 2017-01-02 jim 0.0
4 apple 2017-01-02 tim 0.0
16 melon 2017-01-02 tim 0.0
28 orange 2017-01-02 tim 0.0
5 apple 2017-01-02 tom 0.0
17 melon 2017-01-02 tom 0.0
29 orange 2017-01-02 tom 0.0
6 apple 2017-01-03 jim 0.0
18 melon 2017-01-03 jim 0.0
30 orange 2017-01-03 jim 0.0
7 apple 2017-01-03 tim 0.0
19 melon 2017-01-03 tim 0.0
31 orange 2017-01-03 tim 0.0
8 apple 2017-01-03 tom 0.0
20 melon 2017-01-03 tom 0.0
32 orange 2017-01-03 tom 0.0
9 apple 2017-01-04 jim 2.0
21 melon 2017-01-04 jim 0.0
33 orange 2017-01-04 jim 0.0
10 apple 2017-01-04 tim 0.0
22 melon 2017-01-04 tim 3.0
34 orange 2017-01-04 tim 0.0
11 apple 2017-01-04 tom 0.0
23 melon 2017-01-04 tom 1.0
35 orange 2017-01-04 tom 4.0
可能是因为我的SQL心态,在扩展的助手数据文件中考虑左联接<代码>合并<代码>:
helper_df_list = [pd.DataFrame({'date': pd.date_range(df['date'].min(), df['date'].max()),
'customer': c, 'product': p })
for c in df['customer'].unique()
for p in df['product'].unique()]
helper_df = pd.concat(helper_df_list, ignore_index=True)
final_df = pd.merge(helper_df, df, on=['date', 'customer', 'product'], how='left')\
.fillna(0).sort_values(['date', 'customer']).reset_index(drop=True)
输出
print(final_df)
# customer date product amt
# 0 jim 2017-01-01 apple 0.0
# 1 jim 2017-01-01 melon 2.0
# 2 jim 2017-01-01 orange 0.0
# 3 tim 2017-01-01 apple 3.0
# 4 tim 2017-01-01 melon 0.0
# 5 tim 2017-01-01 orange 0.0
# 6 tom 2017-01-01 apple 5.0
# 7 tom 2017-01-01 melon 4.0
# 8 tom 2017-01-01 orange 0.0
# 9 jim 2017-01-02 apple 0.0
# 10 jim 2017-01-02 melon 0.0
# 11 jim 2017-01-02 orange 0.0
# 12 tim 2017-01-02 apple 0.0
# 13 tim 2017-01-02 melon 0.0
# 14 tim 2017-01-02 orange 0.0
# 15 tom 2017-01-02 apple 0.0
# 16 tom 2017-01-02 melon 0.0
# 17 tom 2017-01-02 orange 0.0
# 18 jim 2017-01-03 apple 0.0
# 19 jim 2017-01-03 melon 0.0
# 20 jim 2017-01-03 orange 0.0
# 21 tim 2017-01-03 apple 0.0
# 22 tim 2017-01-03 melon 0.0
# 23 tim 2017-01-03 orange 0.0
# 24 tom 2017-01-03 apple 0.0
# 25 tom 2017-01-03 melon 0.0
# 26 tom 2017-01-03 orange 0.0
# 27 jim 2017-01-04 apple 2.0
# 28 jim 2017-01-04 melon 0.0
# 29 jim 2017-01-04 orange 0.0
# 30 tim 2017-01-04 apple 0.0
# 31 tim 2017-01-04 melon 3.0
# 32 tim 2017-01-04 orange 0.0
# 33 tom 2017-01-04 apple 0.0
# 34 tom 2017-01-04 melon 1.0
# 35 tom 2017-01-04 orange 4.0
让我们使用itertools中的
product
,pd.date\u range
,以及merge
:
from itertools import product
daterange = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
d1 = pd.DataFrame(list(product(daterange,
df['customer'].unique(),
df['product'].unique())),
columns=['date', 'customer', 'product'])
d1.merge(df, on=['date', 'customer', 'product'], how='left').fillna(0)
输出:
date customer product amt
0 2017-01-01 tim apple 3.0
1 2017-01-01 tim melon 0.0
2 2017-01-01 tim orange 0.0
3 2017-01-01 jim apple 0.0
4 2017-01-01 jim melon 2.0
5 2017-01-01 jim orange 0.0
6 2017-01-01 tom apple 5.0
7 2017-01-01 tom melon 4.0
8 2017-01-01 tom orange 0.0
9 2017-01-02 tim apple 0.0
10 2017-01-02 tim melon 0.0
11 2017-01-02 tim orange 0.0
12 2017-01-02 jim apple 0.0
13 2017-01-02 jim melon 0.0
14 2017-01-02 jim orange 0.0
15 2017-01-02 tom apple 0.0
16 2017-01-02 tom melon 0.0
17 2017-01-02 tom orange 0.0
18 2017-01-03 tim apple 0.0
19 2017-01-03 tim melon 0.0
20 2017-01-03 tim orange 0.0
21 2017-01-03 jim apple 0.0
22 2017-01-03 jim melon 0.0
23 2017-01-03 jim orange 0.0
24 2017-01-03 tom apple 0.0
25 2017-01-03 tom melon 0.0
26 2017-01-03 tom orange 0.0
27 2017-01-04 tim apple 0.0
28 2017-01-04 tim melon 3.0
29 2017-01-04 tim orange 0.0
30 2017-01-04 jim apple 2.0
31 2017-01-04 jim melon 0.0
32 2017-01-04 jim orange 0.0
33 2017-01-04 tom apple 0.0
34 2017-01-04 tom melon 1.0
35 2017-01-04 tom orange 4.0
您想使用answer中的
ffill
方法,但这不填充日期?并且不保留列客户和product@DickThompson运行它您收到了什么样的错误代码?@DickThompson抱歉,输入错误,更改为\,您将获得结果ok-如果我有另一个列分类器,我会怎么做?你喜欢付款方式吗?只需添加到索引?@DickThompson是的,您可能需要另一轮堆栈并取消堆栈这就是我所做的:df.set_索引(['date','customer','product','type']).amt.unstack(-4)。\reindex(columns=pd.date_range(df['date'].min(),df['date'].max()),fill_value=0)。\stack(dropna=False)。unstack().stack(dropna=False).\unstack('customer')。stack(dropna=False)。reset_index()。\unstack('product')。stack(dropna=False)。reset_index()。\fillna(0)。排序_值(['level_1','customer','product','type'])
,但我得到了一个堆叠错误。还有什么我需要补充的吗?