Python 基于多行条件比较两个不同的数据帧_Python_Pandas_Dataframe_Merge

Python 基于多行条件比较两个不同的数据帧

python pandas dataframe merge

Python 基于多行条件比较两个不同的数据帧,python,pandas,dataframe,merge,Python,Pandas,Dataframe,Merge,我有两个数据框，其中包含关于同一患者的不同信息。我需要使用dataframe 1来过滤dataframe 2，这样dataframe 2只有在df_1中存在相同染色体，链，elementloc和患者的整数值时，才会保留其整数患者行值。如果dfu 1中有一个NaN值，我想将NaN放在dfu 2的同一位置。对于df_2中已有的NaN值，我想将它们保留为NaN 所以对于df_1和df_2类似： df_1 = pd.DataFrame({'chromosome': [1, 1, 5, 4],

我有两个数据框，其中包含关于同一患者的不同信息。我需要使用dataframe 1来过滤dataframe 2，这样dataframe 2只有在

df_1

中存在相同

染色体

，

链

，

elementloc

和

患者

的整数值时，才会保留其整数患者行值。如果dfu 1中有一个

NaN

值，我想将

NaN

放在

dfu 2

的同一位置。对于

df_2

中已有的

NaN

值，我想将它们保留为NaN

所以对于

df_1

和

df_2

类似：

df_1 = pd.DataFrame({'chromosome': [1, 1, 5, 4],
                     'strand': ['-', '-', '+', '-'],
                     'elementloc': [4991, 8870, 2703, 9674],
                     'Patient1_Reads': ['NaN', 25, 50, 'NaN'],
                     'Patient2_Reads': [35, 200, 'NaN', 500]})

print(df_1)                                                                    
   chromosome strand  elementloc Patient1_Reads Patient2_Reads
0           1      -        4991            NaN             35
1           1      -        8870             25            200
2           5      +        2703             50            NaN
3           4      -        9674            NaN            500


df_2 = pd.DataFrame({'chromosome': [1, 1, 5, 4],
                     'strand': ['-', '-', '+', '-'],
                     'elementloc': [4991, 8870, 2703, 9674],
                     'Patient1_PSI': [0.76, 0.35, 0.04, 'NaN'],
                     'Patient2_PSI': [0.89, 0.15, 0.47, 0.32]})
print(df_2)                                                                      
   chromosome strand  elementloc   Patient1_PSI    Patient2_PSI
0           1      -        4991           0.76            0.89
1           1      -        8870           0.35            0.15
2           5      +        2703           0.04            0.47
3           4      -        9674            NaN            0.32

我希望新的
dfu 2
看起来像：

   chromosome strand  elementloc  Patient1_PSI  Patient2_PSI
0           1      -        4991           NaN          0.89
1           1      -        8870          0.35          0.15
2           5      +        2703          0.04           NaN
3           4      -        9674           NaN          0.32

df3 = df1.merge(df2, on=['chromosome', 'strand', 'elementloc'])

r_cols = df3.columns[df3.columns.str.endswith('_Reads')]
p_cols = r_cols.str.strip('Reads') + 'PSI'

df3[p_cols] = df3[p_cols].mask(df3[r_cols].isna().to_numpy())
df3 = df3.drop(r_cols, 1)

使用：

   chromosome strand  elementloc  Patient1_PSI  Patient2_PSI
0           1      -        4991           NaN          0.89
1           1      -        8870          0.35          0.15
2           5      +        2703          0.04           NaN
3           4      -        9674           NaN          0.32

df3 = df1.merge(df2, on=['chromosome', 'strand', 'elementloc'])

r_cols = df3.columns[df3.columns.str.endswith('_Reads')]
p_cols = r_cols.str.strip('Reads') + 'PSI'

df3[p_cols] = df3[p_cols].mask(df3[r_cols].isna().to_numpy())
df3 = df3.drop(r_cols, 1)

详细信息：

   chromosome strand  elementloc  Patient1_PSI  Patient2_PSI
0           1      -        4991           NaN          0.89
1           1      -        8870          0.35          0.15
2           5      +        2703          0.04           NaN
3           4      -        9674           NaN          0.32

df3 = df1.merge(df2, on=['chromosome', 'strand', 'elementloc'])

r_cols = df3.columns[df3.columns.str.endswith('_Reads')]
p_cols = r_cols.str.strip('Reads') + 'PSI'

df3[p_cols] = df3[p_cols].mask(df3[r_cols].isna().to_numpy())
df3 = df3.drop(r_cols, 1)

步骤A：用于创建一个合并的数据帧

df3

，该数据帧是通过在

['chromose'，'strand'，'elementloc']

上合并数据帧

df1

和

df2

获得的

# print(df3)
   chromosome strand  elementloc  Patient1_Reads  Patient2_Reads  Patient1_PSI  Patient2_PSI
0           1      -        4991             NaN            35.0          0.76          0.89
1           1      -        8870            25.0           200.0          0.35          0.15
2           5      +        2703            50.0             NaN          0.04          0.47
3           4      -        9674             NaN           500.0           NaN          0.32

步骤B：用于获取

df3

中以

\u Reads

结尾的列，我们称其为

r\u cols

，然后使用此

\u Reads

列获得相应的

\u PSI

列，我们称其为

p\u cols

# print(r_cols)
Index(['Patient1_Reads', 'Patient2_Reads'], dtype='object')

# print(p_cols)
Index(['Patient1_PSI', 'Patient2_PSI'], dtype='object')

步骤C：在

\u Reads

列上使用以获得布尔掩码，然后使用此掩码以及填充

\u PSI

列中相应的

NaN

值。最后，使用从合并的datframe

df3

中删除

\u Reads

列，以获得所需的结果：

# print(df3)
   chromosome strand  elementloc  Patient1_PSI  Patient2_PSI
0           1      -        4991           NaN          0.89
1           1      -        8870          0.35          0.15
2           5      +        2703          0.04           NaN
3           4      -        9674           NaN          0.32

两个数据帧是否总是包含每个患者的相应列，例如，如果

PatientX\u Reads

出现在

df1

中，那么相应的

PatientX\u PSI

是否总是出现在

df2

中？@ShubhamSharma，是的！也就是说，为了比较PatientX_读数和PatientX_PSI，我需要确保染色体、strand和elemtloc的条目是相同的。