Python 如何连接两个数据帧,其中两列值在特定的2个范围内?
我有两个数据帧Python 如何连接两个数据帧,其中两列值在特定的2个范围内?,python,pandas,dataframe,join,intervals,Python,Pandas,Dataframe,Join,Intervals,我有两个数据帧 print(df1) Name df1 RT [min] Molecular Weight RT [min]+0.2 RT [min]-0.2 Molecular Weight + 0.2 Molecular Weight - 0.2 0 unknow compound 1 7.590 194.04212 7.790 7.390 194.242
print(df1)
Name df1 RT [min] Molecular Weight RT [min]+0.2 RT [min]-0.2 Molecular Weight + 0.2 Molecular Weight - 0.2
0 unknow compound 1 7.590 194.04212 7.790 7.390 194.24212 193.84212
1 unknow compound 2 7.510 194.15000 7.710 7.310 194.35000 193.95000
2 unknow compound 3 7.410 194.04209 7.610 7.210 194.24209 193.84209
3 unknow compound 4 7.434 342.11615 7.634 7.234 342.31615 341.91615
4 unknow compound 5 0.756 176.03128 0.956 0.556 176.23128 175.83128
及
如果满足两个条件,我想将df2中的行合并到df1中的行
print(df3)
Name df1 RT [min]+0.2 RT [min]-0.2 Molecular Weight + 0.2 Molecular Weight - 0.2 Name df2 Molecular Weight RT [min]
0 unknow compound 1 7.790 7.390 194.24212 193.84212 β-D-Glucopyranuronic acid 194.0422 7.483
1 unknow compound 1 7.790 7.390 194.24212 193.84212 α,α-Trehalose 194.1000 7.350
2 unknow compound 2 7.710 7.310 194.35000 193.95000 β-D-Glucopyranuronic acid 194.0422 7.483
3 unknow compound 3 8.310 7.910 206.30000 205.90000 Threonylserine 206.0897 8.258
4 unknow compound 4 7.634 7.234 342.31615 341.91615 NaN NaN NaN
5 unknow compound 5 0.956 0.556 176.23128 175.83128 NaN NaN NaN
df2中的第一行符合df1中未知化合物1和未知化合物2的两个条件,因此我们在df3中有两次
df2中的第二行仅满足未知化合物1的2个条件
df2中的第三行仅满足未知化合物3的2个条件
所有其他行不满足df1中的任何条件
我试着根据第一个答案来做
import pandas as pd
df_1 = pd.read_excel (r'D:\CD SandBox\df1.xlsx')
df_2 = pd.read_excel (r'D:\CD SandBox\df2.xlsx')
df2.index = pd.IntervalIndex.from_arrays(df2['RT [min]-0.2'],df2['RT [min]+0.2'],closed='both')
df2['RT [min]'] = df2['RT [min]'].apply( lambda x : df2.iloc[df1.index.get_loc(x)])
但不知道如何处理第二行代码并收到此错误:
df2['RT [min]'] = df2['RT [min]'].apply( lambda x : df2.iloc[df1.index.get_loc(x)])
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\BCDD\Anaconda3\envs\PTSD\lib\site-packages\pandas\core\series.py", line 4213, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas\_libs\lib.pyx", line 2403, in pandas._libs.lib.map_infer
File "<input>", line 1, in <lambda>
File "C:\Users\BCDD\Anaconda3\envs\PTSD\lib\site-packages\pandas\core\indexes\interval.py", line 730, in get_loc
raise KeyError(key)
KeyError: 8.258
为表提供了错误的匹配
任何想法/提示将不胜感激选项1
如果使用pandas 1.2.0,则可以创建两个数据帧的笛卡尔乘积,然后检查条件。另外,由于您不需要df1
中的RT[min]
和Molecular Weight
,我假设您已经删除了它们:
df3 = df1.merge(df2, how = 'cross', suffixes = [None,None])
#check if 'Molecular Weight' is in the interval:
mask1 = df3['Molecular Weight'].ge(df3['Molecular Weight - 0.2']) & df3['Molecular Weight'].le(df3['Molecular Weight + 0.2'])
#check if 'RT [min]' is in the interval
mask2 = df3['RT [min]'].ge(df3['RT [min]-0.2']) & df3['RT [min]'].le(df3['RT [min]+0.2'])
df3 = df3[mask1 & mask2].reset_index(drop = True)
输出:
df3
Name df1 RT [min]+0.2 RT [min]-0.2 ... Name df2 Molecular Weight RT [min]
0 unknow compound 1 7.79 7.39 ... β-D-Glucopyranuronic acid 194.0422 7.483
1 unknow compound 2 7.71 7.31 ... β-D-Glucopyranuronic acid 194.0422 7.483
2 unknow compound 2 7.71 7.31 ... α,α-Trehalose 194.1000 7.350
3 unknow compound 3 7.61 7.21 ... β-D-Glucopyranuronic acid 194.0422 7.483
4 unknow compound 3 7.61 7.21 ... α,α-Trehalose 194.1000 7.350
['unknow compound 1' 7.79 7.39 194.24212 193.84212 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'α,α-Trehalose' 194.1 7.35]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'α,α-Trehalose' 194.1 7.35]
选择2
由于您的数据相当大,为了不加载整个生成的数据帧,您可能希望使用生成器。同样,我假设你从df1
中删除了RT[min]
和Molecular Weight
import numpy as np
from itertools import product
def df_iter(df1,df2):
for row1, row2 in product(df1.values, df2.values):
# RT [min]-0.2 <= RT [min] <= RT [min]+0.2
if row1[2] <= row2[2] <= row1[1]:
#Molecular Weight - 0.2 <= Molecular Weight <= Molecular Weight + 0.2
if row1[4] <= row2[1] <= row1[3]:
yield np.concatenate((row1,row2))
df3_rows = df_iter(df1,df2)
输出:
df3
Name df1 RT [min]+0.2 RT [min]-0.2 ... Name df2 Molecular Weight RT [min]
0 unknow compound 1 7.79 7.39 ... β-D-Glucopyranuronic acid 194.0422 7.483
1 unknow compound 2 7.71 7.31 ... β-D-Glucopyranuronic acid 194.0422 7.483
2 unknow compound 2 7.71 7.31 ... α,α-Trehalose 194.1000 7.350
3 unknow compound 3 7.61 7.21 ... β-D-Glucopyranuronic acid 194.0422 7.483
4 unknow compound 3 7.61 7.21 ... α,α-Trehalose 194.1000 7.350
['unknow compound 1' 7.79 7.39 194.24212 193.84212 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'α,α-Trehalose' 194.1 7.35]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'α,α-Trehalose' 194.1 7.35]
或创建数据帧:
df3 = pd.DataFrame(data = list(df3_rows),
columns = np.concatenate((df1.columns, df2.columns)))
这将导致选项1中的相同数据帧
注1:注意函数df_iter
中条件中的索引,这些索引在my情况下工作
注2:我很确定您的数据与示例
df3
不匹配,请检查merge\u asof@Dani Mesejo,尝试使用添加到问题中的merge\u asof。它给出了错误的输出。我认为这是因为两个数据帧中的值在零后最多为5位,并且合并基于大于/小于,这太粗糙了。谢谢!问题是,我的数据(行数)比我发布的5行数据框大得多,因此,进行笛卡尔乘积可能会在时间上付出高昂的代价aspect@TaL你必须做一个“隐式笛卡尔积”,我的意思是,你必须比较所有df1
行和所有df2
行。数据帧有多大?数据帧大约有10000行each@TaL好吧,那是个好主意problem@TaL查看编辑,可能会有所帮助
df3 = pd.DataFrame(data = list(df3_rows),
columns = np.concatenate((df1.columns, df2.columns)))