Python 比较DataFrames/csv并仅返回包含键值在内的差异列
我有两个CSV文件,我比较并只返回具有不同值的并排列Python 比较DataFrames/csv并仅返回包含键值在内的差异列,python,pandas,numpy,dataframe,Python,Pandas,Numpy,Dataframe,我有两个CSV文件,我比较并只返回具有不同值的并排列 df1 Country 1980 1981 1982 1983 1984 Bermuda 0.00793 0.00687 0.00727 0.00971 0.00752 Canada 9.6947 9.58952 9.20637 9.18989 9.78546 Greenland 0.00791 0.00746 0.00722 0.00505 0.00799 Mexico 3.72819 4.11969 4.
df1
Country 1980 1981 1982 1983 1984
Bermuda 0.00793 0.00687 0.00727 0.00971 0.00752
Canada 9.6947 9.58952 9.20637 9.18989 9.78546
Greenland 0.00791 0.00746 0.00722 0.00505 0.00799
Mexico 3.72819 4.11969 4.33477 4.06414 4.18464
df2
Country 1980 1981 1982 1983 1984
Bermuda 0.77777 0.00687 0.00727 0.00971 0.00752
Canada 9.6947 9.58952 9.20637 9.18989 9.78546
Greenland 0.00791 0.00746 0.00722 0.00505 0.00799
Mexico 3.72819 4.11969 4.33477 4.06414 4.18464
import pandas as pd
import numpy as np
df1=pd.read_csv('csv1.csv')
df2=pd.read_csv('csv2.csv')
def diff_pd(df1, df2):
"""Identify differences between two pandas DataFrames"""
assert (df1.columns == df2.columns).all(), \
"DataFrame column names are different"
if any(df1.dtypes != df2.dtypes):
"Data Types are different, trying to convert"
df2 = df2.astype(df1.dtypes)
if df1.equals(df2):
print("Dataframes are the same")
return None
else:
# need to account for np.nan != np.nan returning True
diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
ne_stacked = diff_mask.stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['Country', 'Column']
difference_locations = np.where(diff_mask)
changed_from = df1.values[difference_locations][0]
changed_to = df2.values[difference_locations]
y=pd.DataFrame({'From': changed_from, 'To': changed_to},
index=changed.index)
print(y)
return pd.DataFrame({'From': changed_from, 'To': changed_to},
index=changed.index)
diff_pd(df1,df2)
我当前的输出是:
From To
Country Column
0 1980 0.00793 0.77777
因此,我希望获得具有不匹配值的行的国家名称,而不是索引0。下面是一个例子
我希望我的输出是:
From To
Country Column
Bermuda 1980 0.00793 0.77777
感谢所有能够提供解决方案的人。一个略短的方法,在此过程中重命名:
def process_df(df):
res = df.set_index('Country').stack()
res.index.rename('Column', level=1, inplace=True)
return res
df1 = process_df(df1)
df2 = process_df(df2)
mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
df3 = pd.concat([df1[mask], df2[mask]], axis=1).rename({0:'From', 1:'To'}, axis=1)
df3
From To
Country Column
Bermuda 1980 0.00793 0.77777