使用Python和DataCompy报告对csv/dataframes进行排序
我正在尝试比较两个csv文件(未排序),并希望得到一个类似SAS Proc compare的报告。我使用datacompy并在比较之前对数据帧进行排序,但datacompy报告显示“没有共同的行” 请让我知道我在下面的片段中遗漏了什么 我尝试过排序、重新索引,也尝试过使用_index=True,而不是使用join_列使用Python和DataCompy报告对csv/dataframes进行排序,python,csv,sas,compare,data-comparison,Python,Csv,Sas,Compare,Data Comparison,我正在尝试比较两个csv文件(未排序),并希望得到一个类似SAS Proc compare的报告。我使用datacompy并在比较之前对数据帧进行排序,但datacompy报告显示“没有共同的行” 请让我知道我在下面的片段中遗漏了什么 我尝试过排序、重新索引,也尝试过使用_index=True,而不是使用join_列 from io import StringIO import pandas as pd import datacompy data1 = """name,age,loc ABC,
from io import StringIO
import pandas as pd
import datacompy
data1 = """name,age,loc
ABC,123,LON
EFG,456,MAA
"""
data2 = """name,age,loc
EFG,457,MAA
ABC,124,LON
"""
df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))
df1.sort_values(by=['name','age','loc']).reindex
df2.sort_values(by=['name','age','loc']).reindex
compare = datacompy.Compare(
df1,
df2,
join_columns=['name','age','loc'], #You can also specify a list of columns
abs_tol=0.0001,
rel_tol=0,
df1_name='original',
df2_name='new')
compare.matches()
print(compare.report())
预期结果是
data1
姓名、年龄、地点
ABC,123,伦敦
EFG,456,MAA
数据2
姓名、年龄、地点
ABC,123,伦敦
EFG,457,MAA
报告应该与年龄列类似,最大差异为1,其他所有列都匹配良好。您将在所有三列上加入,并且只应在
name
上加入。在“加入”中,请更改为以下内容:
compare = datacompy.Compare(
df1,
df2,
join_columns=['name'], #You can also specify a list of columns
abs_tol=0.0001,
rel_tol=0,
df1_name='original',
df2_name='new')
compare.matches()
print(compare.report())
将产生以下输出:
DataFrame Summary
-----------------
DataFrame Columns Rows
0 original 3 2
1 new 3 2
Column Summary
--------------
Number of columns in common: 3
Number of columns in original but not in new: 0
Number of columns in new but not in original: 0
Row Summary
-----------
Matched on: name
Any duplicates on match values: No
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 2
Number of rows in original but not in new: 0
Number of rows in new but not in original: 0
Number of rows with some compared columns unequal: 2
Number of rows with all compared columns equal: 0
Column Comparison
-----------------
Number of columns compared with some values unequal: 1
Number of columns compared with all values equal: 2
Total number of values which compare unequal: 2
Columns with Unequal Values or Types
------------------------------------
Column original dtype new dtype # Unequal Max Diff # Null Diff
0 age int64 int64 2 1.0 0
Sample Rows with Unequal Values
-------------------------------
name age (original) age (new)
1 EFG 456 457
0 ABC 123 124
您将在所有三列上加入,并且只应在
name
上加入。在“加入”中,请更改为以下内容:
compare = datacompy.Compare(
df1,
df2,
join_columns=['name'], #You can also specify a list of columns
abs_tol=0.0001,
rel_tol=0,
df1_name='original',
df2_name='new')
compare.matches()
print(compare.report())
将产生以下输出:
DataFrame Summary
-----------------
DataFrame Columns Rows
0 original 3 2
1 new 3 2
Column Summary
--------------
Number of columns in common: 3
Number of columns in original but not in new: 0
Number of columns in new but not in original: 0
Row Summary
-----------
Matched on: name
Any duplicates on match values: No
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 2
Number of rows in original but not in new: 0
Number of rows in new but not in original: 0
Number of rows with some compared columns unequal: 2
Number of rows with all compared columns equal: 0
Column Comparison
-----------------
Number of columns compared with some values unequal: 1
Number of columns compared with all values equal: 2
Total number of values which compare unequal: 2
Columns with Unequal Values or Types
------------------------------------
Column original dtype new dtype # Unequal Max Diff # Null Diff
0 age int64 int64 2 1.0 0
Sample Rows with Unequal Values
-------------------------------
name age (original) age (new)
1 EFG 456 457
0 ABC 123 124
感谢moe_95,在单列连接中效果很好。我可以在多个列上加入吗?没问题,您可以在
join\u columns
中指定多个列,即join\u columns=['name','age']
可以工作。您不能在每个df中的所有列上进行连接。如果你能接受这个对你有效的正确答案,那也太好了!感谢moe_95,在单列连接中效果很好。我可以在多个列上加入吗?没问题,您可以在join\u columns
中指定多个列,即join\u columns=['name','age']
可以工作。您不能在每个df中的所有列上进行连接。如果你能接受这个对你有效的正确答案,那也太好了!