Python 熊猫重新索引以填充缺失的日期，还是更好的填充方法？_Python_Python 3.x_Pandas_Pandas Groupby

Python 熊猫重新索引以填充缺失的日期，还是更好的填充方法？

python python-3.x pandas

Python 熊猫重新索引以填充缺失的日期，还是更好的填充方法？,python,python-3.x,pandas,pandas-groupby,Python,Python 3.x,Pandas,Pandas Groupby,我的数据是工厂的缺勤记录。有些日子没有缺席，因此没有记录当天的数据或日期。然而，在显示的其他示例中，这一点变得令人毛骨悚然，因为各种原因，在任何一天都可能有几次缺席。数据中的日期与记录的比率并不总是1:1 我希望的结果是这样的： (index) Shift Description Instances (SUM) 01-01-14 2nd Baker Discipline 0 01-01-14 2nd Baker Vacation

我的数据是工厂的缺勤记录。有些日子没有缺席，因此没有记录当天的数据或日期。然而，在显示的其他示例中，这一点变得令人毛骨悚然，因为各种原因，在任何一天都可能有几次缺席。数据中的日期与记录的比率并不总是1:1

我希望的结果是这样的：

(index)    Shift        Description     Instances (SUM)
01-01-14   2nd Baker    Discipline      0
01-01-14   2nd Baker    Vacation        0
01-01-14   1st Cooks    Discipline      0
01-01-14   1st Cooks    Vacation        0
01-02-14   2nd Baker    Discipline      4
01-02-14   2nd Baker    Vacation        3
01-02-14   1st Cooks    Discipline      3
01-02-14   1st Cooks    Vacation        3

等等。其理念是，所有班次和描述都将具有时间段内所有天数的值（在本例中为2014年1月1日-2014年12月31日）

我已经读过几个例子，最接近于实现这一点的是

但是，当我取消注释

ts=ts.reindex（idx，fill_value='NaN'）

时，我会收到错误消息。我已经尝试了至少10种其他方法来完成我想要做的事情，所以我不是100%确定这是一条正确的道路，但它似乎已经让我离任何一种进步最近了

以下是一些示例数据：

Description Unexcused   Instances   Date        Shift
Discipline  FALSE              1    Jan 2 2014  2nd Baker
Vacation    TRUE               2    Jan 2 2014  1st Cooks
Discipline  FALSE              3    Jan 2 2014  2nd Baker
Vacation    TRUE               1    Jan 2 2014  1st Cooks
Discipline  FALSE              2    Apr 8 2014  2nd Baker
Vacation    TRUE               3    Apr 8 2014  1st Cooks
Discipline  FALSE              1    Jun 1 2014  2nd Baker
Vacation    TRUE               2    Jun 1 2014  1st Cooks
Discipline  FALSE              3    Jun 1 2014  2nd Baker
Vacation    TRUE               1    Jun 1 2014  1st Cooks
Vacation    TRUE               2    Jul 5 2014  1st Cooks
Discipline  FALSE              3    Jul 5 2014  2nd Baker
Vacation    TRUE               2    Dec 3 2014  1st Cooks

提前感谢你的帮助，我是一个新手，2天没有多大进展。我真的很感激这里的人们如何帮助回答问题，但最重要的是指导他们解决问题的方法。像我这样的新手非常感谢大家分享的智慧。

我想你只是在使用datetime时遇到了问题，这种方法对我很有效

ts.set_index(['Date'],inplace=True)
ts.index = pd.to_datetime(ts.index,format='%b %d %Y')
d2 = pd.DataFrame(index=pd.date_range('2014-01-01','2014-12-31'))

print ts.join(d2,how='right')

事实上，你已经非常接近你想要的了（假设我正确理解了你想要的输出）。请参见我对上述代码的补充：

import pandas as pd

ts = pd.read_csv('Absentee_Data_2.csv', encoding = 'utf-8',parse_dates=[3],index_col=3,dayfirst=True, sep=",")

idx =  pd.date_range('01.01.2009', '12.31.2017')

ts.index = pd.DatetimeIndex(ts.index)
#ts = ts.reindex(idx, fill_value='NaN')
df = pd.DataFrame(index = idx)
df1 = df.join(ts, how='left')
df2 = df1.copy()
df3 = df1.copy()
df4 = df1.copy()
dict1 = {'Description': 'Discipline', 'Instances': 0, 'Shift': '1st Cooks'}
df1 = df1.fillna(dict1)
dict1["Description"] = "Vacation"
df2 = df2.fillna(dict1)
dict1["Shift"] = "2nd Baker"
df3 = df3.fillna(dict1)
dict1["Description"] = "Discipline"
df4 = df4.fillna(dict1)
df_with_duplicates = pd.concat([df1,df2,df3,df4])
final_res = df_with_duplicates.reset_index().drop_duplicates(subset=["index"] + list(dict1.keys())).set_index("index").drop("Unexcused", axis=1)

基本上你要补充的是：

复制使用
```
ts
```
创建的几乎为空的df的4倍（
```
df1
```
）
```
fillna（dict1）
```
允许使用静态值填充列中的所有NaN
连接4个dfs，我们仍然需要删除一些重复项，因为csv中的原始值重复了4次
删除重复项，我们需要索引来保持添加的值，因此
```
reset\u index
```
后面跟着`set\u index（“index”）
最后删除未经许可的列

最后是一些输出：

In [5]: final_res["2013-01-2"]
Out[5]: 
           Description  Instances      Shift
index                                       
2013-01-02  Discipline        0.0  1st Cooks
2013-01-02    Vacation        0.0  1st Cooks
2013-01-02    Vacation        0.0  2nd Baker
2013-01-02  Discipline        0.0  2nd Baker

In [6]: final_res["2014-01-2"]
Out[6]: 
           Description  Instances       Shift
index                                        
2014-01-02  Discipline        1.0   2nd Baker
2014-01-02    Vacation        2.0   1st Cooks
2014-01-02  Discipline        3.0   2nd Baker
2014-01-02    Vacation        1.0   1st Cooks
1

尝试这个解决方案，但我不断得到以下错误：“类型错误：只能在以下代码行中将列表（而不是“dict_键”）连接到列表”：“final_res=（df_with_duplicates.reset_index（）。drop_duplicates（subset=[“index”]+dict1.keys（））。set_index（“index”）。drop（“unexecused”，axis=1））。有什么建议吗？谢谢你，也谢谢你的解释：）@SDS My bad输入了一个小错误，你需要将

dict1

的键转换成一个列表，所以它应该是

subs‌t=[“index”]+列表（dict1.keys（））

，我编辑了我的post@SDS如果您认为已提供答案，请将其标记为已接受。它有助于将注意力集中在未回答的问题上。如果答案没有帮助，你能提供一个关于缺少什么的反馈吗？两个答案都有效，但这一个对我来说更容易理解，并循环使用我的真实数据。我确实需要做一些进一步的操作和思考，但这最终是我使用的答案。对我来说也是如此！我有一个900000行的“smeidum”数据框，希望在数据透视之前添加缺失的日期。谢谢

In [5]: final_res["2013-01-2"]
Out[5]: 
           Description  Instances      Shift
index                                       
2013-01-02  Discipline        0.0  1st Cooks
2013-01-02    Vacation        0.0  1st Cooks
2013-01-02    Vacation        0.0  2nd Baker
2013-01-02  Discipline        0.0  2nd Baker

In [6]: final_res["2014-01-2"]
Out[6]: 
           Description  Instances       Shift
index                                        
2014-01-02  Discipline        1.0   2nd Baker
2014-01-02    Vacation        2.0   1st Cooks
2014-01-02  Discipline        3.0   2nd Baker
2014-01-02    Vacation        1.0   1st Cooks
1