Python 将dataframe缩减为仅已更改的行_Python_Pandas

Python 将dataframe缩减为仅已更改的行

python pandas

Python 将dataframe缩减为仅已更改的行,python,pandas,Python,Pandas,我有一个旧数据帧和一个新数据帧，如下所示： import pandas as pd import numpy as np df_old = pd.DataFrame({ "col1": ["a", "b", "c", "d", "e"], "col2": [1.0, 2.0, 3.0, 4.0, 5.0],

我有一个旧数据帧和一个新数据帧，如下所示：

import pandas as pd
import numpy as np

df_old = pd.DataFrame({
        "col1": ["a", "b", "c", "d", "e"],
        "col2": [1.0, 2.0, 3.0, 4.0, 5.0],
        "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
    }, columns=["col1", "col2", "col3"])

df_new = pd.DataFrame({
        "col1": ["a", "b", "c", "e", "f"],
        "col2": [1.0, 2.0, 3.5, 5.0, 6.0],
        "col3": [1.0, 4.2, 3.0, 5.0, 6.0]
    }, columns=["col1", "col2", "col3"])

# Expected data
df_changed = pd.DataFrame({
        "col1": ["b", "c", "d", "f"],
        "col2": [2.0, 3.5, np.NaN, 6.0],
        "col3": [4.2, 3.0, np.NaN, 6.0]
    }, columns=["col1", "col2", "col3"])

print(df_old)
print(df_new)
print(df_changed)

我希望在旧df和新df之间更改（col2或col3）、添加和删除的行。在我的实际数据中，col1是唯一的，因此如果需要，它可以作为索引

编辑如果我将col1设置为索引

df_old.set_index('col1', inplace=True)
df_new.set_index('col1', inplace=True)

我能跑

print(df_new.ne(df_old))

       col2   col3
col1
a     False  False
b     False   True
c      True  False
d      True   True
e     False  False
f      True   True

然后我可以像这样创建一个diff df

df_diff = df_new.ne(df_old)
df_diff = df_diff[df_diff.col2 | df_diff.col3]

不过，我不知道如何将其与数据帧和数据关联起来。

您离解决方案还不远。完成

set_index

和

ne

操作后，沿列获取一个包含

any

的序列，以使每行至少有一个True，并且

reindex

df_new仅包含所需的值

g=df_new.set_index('col1')#Reset df_new's index


#subtract the datframes after resetting index and use the loc accessor to filter unwanted rows
g.loc[~(df_old.set_index('col1').sub(df_new.set_index('col1'))[['col2','col3']].reset_index()




  col1  col2  col3
0    b   2.0   4.2
1    c   3.5   3.0
2    f   6.0   6.0

df_old = df_old.set_index('col1')
df_new = df_new.set_index('col1')

s = df_new.ne(df_old).any(axis=1) # get True for rows with at least one True
print(s)
# 0    False
# 1     True
# 2     True
# 3     True
# 4     True
# dtype: bool

df_changed = df_new.reindex(s.index[s]).reset_index()
print(df_changed)
  col1  col2  col3
0    b   2.0   4.2
1    c   3.5   3.0
2    d   NaN   NaN
3    f   6.0   6.0

你尝试了什么？抱歉@nidabdella。请参阅编辑，了解我所拥有的sovar。大部分我只是刚刚弄明白。