Python 3.x 合并这些数据帧的最快方法是什么？_Python 3.x_Pandas

Python 3.x 合并这些数据帧的最快方法是什么？

python-3.x pandas

Python 3.x 合并这些数据帧的最快方法是什么？,python-3.x,pandas,Python 3.x,Pandas,我喜欢以最快的方式执行以下任务： # From df1 (this is a dataframe in memory) # C1 C2 # Row1 TRUE TRUE # Row2 TRUE TRUE # Row3 TRUE TRUE # Row4 TRUE TRUE # from df2 (this is a dataframe on disk) # C2

我喜欢以最快的方式执行以下任务：

# From df1 (this is a dataframe in memory)
#         C1      C2      
# Row1    TRUE    TRUE    
# Row2    TRUE    TRUE    
# Row3    TRUE    TRUE    
# Row4    TRUE    TRUE    

# from df2 (this is a dataframe on disk)
#        C2       C5      C6
# Row1   TRUE     TRUE    FALSE
# Row2   FALSE    FALSE   TRUE
# Row5   TRUE     FALSE   TRUE
# Row6   FALSE    TRUE    FALSE

# To new df3 
#         C1       C2      C5       C6
# Row1    TRUE     TRUE    TRUE     FALSE
# Row2    TRUE     TRUE    FALSE    TRUE
# Row3    TRUE     TRUE    FALSE    FALSE
# Row4    TRUE     TRUE    FALSE    FALSE
# Row5    FALSE    TRUE    FALSE    TRUE
# Row6    FALSE    FALSE   TRUE     FALSE

df3是要保存的新文件。如果df2中不可用，则添加df1中的所有列。如果行在df1中，列在df2中，则它们应该为真；如果df2中的单元格在磁盘上可用，则这是默认值。如果单元格是新的，则默认值为false

换句话说：单元格的默认值（布尔行/列值）在添加到磁盘（df2）时为false，除非磁盘上已有值。如果它在内存中，则应将其设置为True

它具有以下功能：

    df1 = pd.DataFrame({'row': ['row1', 'row2', 'row3', 'row4'],
                        'c1': [True, True, True, True],
                        'c2': [True, True, True, True],
                        })


    df2 = pd.DataFrame({'row': ['row1', 'row2', 'row5', 'row6'],
                        'c2': [True, False, True, False],
                        'c5': [True, False, False, True],
                        'c6': [False, True, True, False],
                        })
    print(df1)
    print(df2)

    df3 = df1.merge(df2, how="outer", on='row', validate='one_to_one')
    df3 = df3.fillna(False)
    df3['c2_x'] = df3['c2_x'] | df3['c2_y']
    df3 = df3.drop('c2_y', axis=1)
    df3.columns = ['c2' if x=='c2_x' else x for x in df3.columns]

    print(df3)

具有以下输出

    row    c1    c2
0  row1  True  True
1  row2  True  True
2  row3  True  True
3  row4  True  True
    row     c2     c5     c6
0  row1   True   True  False
1  row2  False  False   True
2  row5   True  False   True
3  row6  False   True  False
    row     c1     c2     c5     c6
0  row1   True   True   True  False
1  row2   True   True  False   True
2  row3   True   True  False  False
3  row4   True   True  False  False
4  row5  False   True  False   True
5  row6  False  False   True  False

特别是守则：

    df3['c2_x'] = df3['c2_x'] | df3['c2_y']
    df3 = df3.drop('c2_y', axis=1)
    df3.columns = ['c2' if x=='c2_x' else x for x in df3.columns]

因为在实际情况中，这些列可能超过50列。此外，df1中的所有值始终为真。因此，如果在df1中提到，df2中元素的值基本上变为真。如果没有值，则为false。并非df2中的所有列都在df1中，反之亦然。

值row1、row2..6始终是唯一的

如果“df1中的所有值始终为真”，则此帧的信息值是什么？可以完全删除。df1的列应该添加到新的df中，对于df1中未提及的行，它们应该具有默认值false，如果是这种情况，您的预期输出似乎与此相反，因为对于第1行，第5列和第6列不在df1中，然而，第1行第5列的输出值为真，第6列的输出值为假？这是因为df2中的第1行已经有了这个值。让我们休息一下。df2是磁盘上的文件，df1是内存中的文件。应添加df1的所有单元格，并在磁盘上的df2文件中将其设置为true。