Python 如何使用数据框中的另一个数据框填充数据框中缺少的值
我的Python 如何使用数据框中的另一个数据框填充数据框中缺少的值,python,pandas,Python,Pandas,我的df如下所示: sprint sprint_created ------ ----------- S100 2020-01-01 S101 2020-01-10 NULL 2020-01-20 NULL 2020-01-31 S101 2020-01-10 ... 在上面的df中,您可以看到一些sprint值是NULL 我有另一个df2具有sprint日期范围: sprint sprint_start sprint_end -
df
如下所示:
sprint sprint_created
------ -----------
S100 2020-01-01
S101 2020-01-10
NULL 2020-01-20
NULL 2020-01-31
S101 2020-01-10
...
在上面的df
中,您可以看到一些sprint
值是NULL
我有另一个df2
具有sprint
日期范围:
sprint sprint_start sprint_end
------ ----------- ----------
S100 2020-01-01 2020-01-09
S101 2020-01-10 2020-01-19
S102 2020-01-20 2020-01-29
S103 2020-01-30 2020-02-09
S104 2020-02-10 2020-02-19
...
如何通过比较df2
中的数据来映射这些数据并在df
中填写NULL
值
请注意,
df
和df2
的形状不同。我在df中总结了重复的sprint(可以删除第一个数据帧)。如果不是,请提出其他建议。根据我对您提供的两个dfs的比较,我使用合并asof和一天容差。如有,则另行通知
df.assign(sprint=pd.merge\u asof(df.drop\u duplicates(keep='first')),df1,left\u on=“sprint\u created”,right\u on=“sprint\u start”,tolerance=pd.Timedelta(“1D”)['sprint\u y']).dropna()
如果您的框架有合法的多个sprint,如上面的注释所述。请试一试
g=df.assign(sprint=pd.merge_asof( df.drop_duplicates(keep='first'), df1, left_on="sprint_created", right_on="sprint_start", tolerance=pd.Timedelta("1D"))['sprint_y'])
g.loc[g.sprint.isna(), 'sprint']=g.groupby('sprint_created').sprint.ffill()
print(g)
sprint sprint_created
0 S100 2020-01-01
1 S101 2020-01-10
2 S102 2020-01-20
3 S103 2020-01-31
4 S101 2020-01-10
一种方法是
melt
和重新采样您的df2
,并创建一个字典将映射回df1
:
#make sure columns are in datetime format
df1['sprint_created'] = pd.to_datetime(df1['sprint_created'])
df2['sprint_start'] = pd.to_datetime(df2['sprint_start'])
df2['sprint_end'] = pd.to_datetime(df2['sprint_end'])
#melt dataframe of the two date columns and resample by group
new = (df2.melt(id_vars='sprint').drop('variable', axis=1).set_index('value')
.groupby('sprint', group_keys=False).resample('D').ffill().reset_index())
#create dictionary of date and the sprint and map back to df1
dct = dict(zip(new['value'], new['sprint']))
df1['sprint'] = df1['sprint_created'].map(dct)
#or df1['sprint'] = df1['sprint'].fillna(df1['sprint_created'].map(dct))
df1
Out[1]:
sprint sprint_created
0 S100 2020-01-01
1 S101 2020-01-10
2 S102 2020-01-20
3 S103 2020-01-31
4 S101 2020-01-10
两个数据帧是否保证使用相同的基于行的索引?i、 e.df
中的行5
始终对应于df2
中的行5
。或者您需要根据创建的sprint\u和sprint\u开始的列来匹配它们吗?(此处看起来相同,但可能不同)。Pandas在连接和合并方面有很好的文档:正如我所看到的,Sprint列在第一个表中有多个重复的值。如果索引值在两个表之间不匹配,您将使用什么键来标识和联接行?在df2
yes中,因为它提供了sprint
日期范围的值。但是df
可以是随机的df
将有多个相同的sprintid
,df
中的键将是另一个列,它是project\u id
,谢谢你指出,我更新了这个问题
#make sure columns are in datetime format
df1['sprint_created'] = pd.to_datetime(df1['sprint_created'])
df2['sprint_start'] = pd.to_datetime(df2['sprint_start'])
df2['sprint_end'] = pd.to_datetime(df2['sprint_end'])
#melt dataframe of the two date columns and resample by group
new = (df2.melt(id_vars='sprint').drop('variable', axis=1).set_index('value')
.groupby('sprint', group_keys=False).resample('D').ffill().reset_index())
#create dictionary of date and the sprint and map back to df1
dct = dict(zip(new['value'], new['sprint']))
df1['sprint'] = df1['sprint_created'].map(dct)
#or df1['sprint'] = df1['sprint'].fillna(df1['sprint_created'].map(dct))
df1
Out[1]:
sprint sprint_created
0 S100 2020-01-01
1 S101 2020-01-10
2 S102 2020-01-20
3 S103 2020-01-31
4 S101 2020-01-10