Python 合并重叠段上的两个数据帧
我有两个数据帧,如果两列(带有Python 合并重叠段上的两个数据帧,python,pandas,dataframe,merge,Python,Pandas,Dataframe,Merge,我有两个数据帧,如果两列(带有开始和结束坐标)重叠而不跨越边界,则需要匹配行 例如: df_1 = pd.DataFrame(data={'start': [0, 10, 23, 35], 'end': [5, 17, 28, 41], 'some_data_1': ['AA', 'BB', 'CC', 'DD']}) df_2 = pd.DataFrame(data={'start': [0, 12, 23, 55], 'end': [5, 17, 25, 62], 'some_data_2'
开始
和结束
坐标)重叠而不跨越边界,则需要匹配行
例如:
df_1 = pd.DataFrame(data={'start': [0, 10, 23, 35], 'end': [5, 17, 28, 41], 'some_data_1': ['AA', 'BB', 'CC', 'DD']})
df_2 = pd.DataFrame(data={'start': [0, 12, 23, 55], 'end': [5, 17, 25, 62], 'some_data_2': ['AA_AA', 'BB_BB', 'CC_CC', 'DD_DD']})
在哪里
及
所需输出为:
df_1_2 :
start_1 end_1 start_2 end_2 some_data_1 some_data_2
0 5 0 5 AA AA_AA
10 17 12 17 BB BB_BB
23 28 23 25 CC CC_CC
35 41 NaN NaN DD NaN
NaN NaN 55 62 NaN DD_DD
是否有一种优雅的方法来检查一个段(由end
-start
给出)是否与另一个段重叠,如果重叠,则在此条件下合并数据帧
谢谢 创建条件以查找两个框架之间是否存在重叠,根据条件创建新列,然后使用how='outer'
我从数据中观察到的是,如果df_1中的重叠(结束-开始)大于或等于df_2中的重叠,则添加开始-开始-数据_2,否则保持原样。计算取决于此;如果这是一个错误的前提操作,一定要让我知道
#create overlap columns
df_1['overlap']= df_1.end - df_1.start
df_2['overlap']= df_2.end - df_2.start
cond1 = df_1.overlap.ge(df_2.overlap)
df_1['key'] = np.where(cond1, df_2.some_data_2,'n1')
df_2['key'] = np.where(cond1, df_2.some_data_2,'n')
(pd
.merge(df_1,df_2,
how='outer',
on='key',
suffixes = ('_1','_2'))
.drop(['key','overlap_1','overlap_2'],
axis=1)
)
start_1 end_1 some_data_1 start_2 end_2 some_data_2
0 0.0 5.0 AA 0.0 5.0 AA_AA
1 10.0 17.0 BB 12.0 17.0 BB_BB
2 23.0 28.0 CC 23.0 25.0 CC_CC
3 35.0 41.0 DD NaN NaN NaN
4 NaN NaN NaN 55.0 62.0 DD_DD
创建条件以查找两个框架之间是否存在重叠,根据条件创建新列,然后使用how='outer'
我从数据中观察到的是,如果df_1中的重叠(结束-开始)大于或等于df_2中的重叠,则添加开始-开始-数据_2,否则保持原样。计算取决于此;如果这是一个错误的前提操作,一定要让我知道
#create overlap columns
df_1['overlap']= df_1.end - df_1.start
df_2['overlap']= df_2.end - df_2.start
cond1 = df_1.overlap.ge(df_2.overlap)
df_1['key'] = np.where(cond1, df_2.some_data_2,'n1')
df_2['key'] = np.where(cond1, df_2.some_data_2,'n')
(pd
.merge(df_1,df_2,
how='outer',
on='key',
suffixes = ('_1','_2'))
.drop(['key','overlap_1','overlap_2'],
axis=1)
)
start_1 end_1 some_data_1 start_2 end_2 some_data_2
0 0.0 5.0 AA 0.0 5.0 AA_AA
1 10.0 17.0 BB 12.0 17.0 BB_BB
2 23.0 28.0 CC 23.0 25.0 CC_CC
3 35.0 41.0 DD NaN NaN NaN
4 NaN NaN NaN 55.0 62.0 DD_DD
你是怎么做的?阅读您的解决方案将指导响应演示您是否着手进行了此操作?阅读您的解决方案将指导响应。重要的部分是您如何创建
df_1.overlap
?有人告诉我,这不是重叠的定义:-)哦,好的。然后我误解了他的帖子。我会等他发表评论,如果我没有注意到,我会把它摘下来。好吧,在二读时,这很可能就是OP所要求的。但是,在一列数据上合并是危险的,因为数据可能有重复项。重要的部分是如何创建df_1.overlap
?有些事情告诉我,这不是重叠的定义:-)哦,好的。然后我误解了他的帖子。我会等他发表评论,如果我没有注意到,我会把它摘下来。好吧,在二读时,这很可能就是OP所要求的。但是,在一列数据上合并是危险的,因为数据可能有重复项。
#create overlap columns
df_1['overlap']= df_1.end - df_1.start
df_2['overlap']= df_2.end - df_2.start
cond1 = df_1.overlap.ge(df_2.overlap)
df_1['key'] = np.where(cond1, df_2.some_data_2,'n1')
df_2['key'] = np.where(cond1, df_2.some_data_2,'n')
(pd
.merge(df_1,df_2,
how='outer',
on='key',
suffixes = ('_1','_2'))
.drop(['key','overlap_1','overlap_2'],
axis=1)
)
start_1 end_1 some_data_1 start_2 end_2 some_data_2
0 0.0 5.0 AA 0.0 5.0 AA_AA
1 10.0 17.0 BB 12.0 17.0 BB_BB
2 23.0 28.0 CC 23.0 25.0 CC_CC
3 35.0 41.0 DD NaN NaN NaN
4 NaN NaN NaN 55.0 62.0 DD_DD