Python 如何使用数据框中的另一个数据框填充数据框中缺少的值_Python_Pandas

Python 如何使用数据框中的另一个数据框填充数据框中缺少的值

python pandas

Python 如何使用数据框中的另一个数据框填充数据框中缺少的值,python,pandas,Python,Pandas,我的df如下所示： sprint sprint_created ------ ----------- S100 2020-01-01 S101 2020-01-10 NULL 2020-01-20 NULL 2020-01-31 S101 2020-01-10 ... 在上面的df中，您可以看到一些sprint值是NULL 我有另一个df2具有sprint日期范围： sprint sprint_start sprint_end -

我的

df

如下所示：

sprint   sprint_created
------   -----------
S100     2020-01-01    
S101     2020-01-10
NULL     2020-01-20
NULL     2020-01-31
S101     2020-01-10
...

在上面的

df

中，您可以看到一些

sprint

值是

NULL

我有另一个

df2

具有

sprint

日期范围：

sprint   sprint_start   sprint_end
------   -----------    ----------
S100     2020-01-01     2020-01-09    
S101     2020-01-10     2020-01-19  
S102     2020-01-20     2020-01-29  
S103     2020-01-30     2020-02-09  
S104     2020-02-10     2020-02-19  
...

如何通过比较

df2

中的数据来映射这些数据并在

df

中填写

NULL

值

请注意，

df

和

df2

的形状不同。

我在df中总结了重复的sprint（可以删除第一个数据帧）。如果不是，请提出其他建议。根据我对您提供的两个dfs的比较，我使用合并asof和一天容差。如有，则另行通知

df.assign（sprint=pd.merge\u asof（df.drop\u duplicates（keep='first'）），df1，left\u on=“sprint\u created”，right\u on=“sprint\u start”，tolerance=pd.Timedelta（“1D”）['sprint\u y']）.dropna（）

如果您的框架有合法的多个sprint，如上面的注释所述。请试一试

g=df.assign(sprint=pd.merge_asof( df.drop_duplicates(keep='first'), df1, left_on="sprint_created", right_on="sprint_start", tolerance=pd.Timedelta("1D"))['sprint_y'])
g.loc[g.sprint.isna(), 'sprint']=g.groupby('sprint_created').sprint.ffill()
print(g)



sprint sprint_created
0   S100     2020-01-01
1   S101     2020-01-10
2   S102     2020-01-20
3   S103     2020-01-31
4   S101     2020-01-10

一种方法是

melt

和

重新采样您的df2
，并创建一个字典将映射回df1
：
#make sure columns are in datetime format
df1['sprint_created'] = pd.to_datetime(df1['sprint_created'])
df2['sprint_start'] = pd.to_datetime(df2['sprint_start'])
df2['sprint_end'] = pd.to_datetime(df2['sprint_end'])

#melt dataframe of the two date columns and resample by group
new = (df2.melt(id_vars='sprint').drop('variable', axis=1).set_index('value')
          .groupby('sprint', group_keys=False).resample('D').ffill().reset_index())

#create dictionary of date and the sprint and map back to df1
dct = dict(zip(new['value'], new['sprint']))
df1['sprint'] = df1['sprint_created'].map(dct)
#or df1['sprint'] = df1['sprint'].fillna(df1['sprint_created'].map(dct))
df1
Out[1]: 
  sprint sprint_created
0   S100     2020-01-01
1   S101     2020-01-10
2   S102     2020-01-20
3   S103     2020-01-31
4   S101     2020-01-10

两个数据帧是否保证使用相同的基于行的索引？i、 e.df
中的行5
始终对应于df2
中的行5
。或者您需要根据创建的sprint\u和sprint\u开始的列来匹配它们吗？（此处看起来相同，但可能不同）。Pandas在连接和合并方面有很好的文档：正如我所看到的，Sprint列在第一个表中有多个重复的值。如果索引值在两个表之间不匹配，您将使用什么键来标识和联接行？在df2
yes中，因为它提供了sprint
日期范围的值。但是df
可以是随机的df
将有多个相同的sprintid
，df
中的键将是另一个列，它是project\u id，谢谢你指出，我更新了这个问题
#make sure columns are in datetime format
df1['sprint_created'] = pd.to_datetime(df1['sprint_created'])
df2['sprint_start'] = pd.to_datetime(df2['sprint_start'])
df2['sprint_end'] = pd.to_datetime(df2['sprint_end'])

#melt dataframe of the two date columns and resample by group
new = (df2.melt(id_vars='sprint').drop('variable', axis=1).set_index('value')
          .groupby('sprint', group_keys=False).resample('D').ffill().reset_index())

#create dictionary of date and the sprint and map back to df1
dct = dict(zip(new['value'], new['sprint']))
df1['sprint'] = df1['sprint_created'].map(dct)
#or df1['sprint'] = df1['sprint'].fillna(df1['sprint_created'].map(dct))
df1
Out[1]: 
  sprint sprint_created
0   S100     2020-01-01
1   S101     2020-01-10
2   S102     2020-01-20
3   S103     2020-01-31
4   S101     2020-01-10