使用Python和DataCompy报告对csv/dataframes进行排序_Python_Csv_Sas_Compare_Data Comparison

使用Python和DataCompy报告对csv/dataframes进行排序

python csv sas

使用Python和DataCompy报告对csv/dataframes进行排序,python,csv,sas,compare,data-comparison,Python,Csv,Sas,Compare,Data Comparison,我正在尝试比较两个csv文件（未排序），并希望得到一个类似SAS Proc compare的报告。我使用datacompy并在比较之前对数据帧进行排序，但datacompy报告显示“没有共同的行” 请让我知道我在下面的片段中遗漏了什么我尝试过排序、重新索引，也尝试过使用_index=True，而不是使用join_列 from io import StringIO import pandas as pd import datacompy data1 = """name,age,loc ABC,

我正在尝试比较两个csv文件（未排序），并希望得到一个类似SAS Proc compare的报告。我使用datacompy并在比较之前对数据帧进行排序，但datacompy报告显示“没有共同的行”

请让我知道我在下面的片段中遗漏了什么

我尝试过排序、重新索引，也尝试过使用_index=True，而不是使用join_列

from io import StringIO
import pandas as pd
import datacompy

data1 = """name,age,loc
ABC,123,LON
EFG,456,MAA
"""

data2 = """name,age,loc
EFG,457,MAA
ABC,124,LON
"""

df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))

df1.sort_values(by=['name','age','loc']).reindex
df2.sort_values(by=['name','age','loc']).reindex

compare = datacompy.Compare(
    df1,
    df2,
    join_columns=['name','age','loc'],  #You can also specify a list of columns
    abs_tol=0.0001,
    rel_tol=0,
    df1_name='original',
    df2_name='new')
compare.matches()

print(compare.report())

预期结果是

data1

姓名、年龄、地点

ABC，123，伦敦

EFG，456，MAA

数据2

姓名、年龄、地点

ABC，123，伦敦

EFG，457，MAA

报告应该与年龄列类似，最大差异为1，其他所有列都匹配良好。

您将在所有三列上加入，并且只应在

name

上加入。在“加入”中，请更改为以下内容：

compare = datacompy.Compare(
    df1,
    df2,
    join_columns=['name'],  #You can also specify a list of columns
    abs_tol=0.0001,
    rel_tol=0,
    df1_name='original',
    df2_name='new')
compare.matches()

print(compare.report())

将产生以下输出：

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0  original        3     2
1       new        3     2

Column Summary
--------------

Number of columns in common: 3
Number of columns in original but not in new: 0
Number of columns in new but not in original: 0

Row Summary
-----------

Matched on: name
Any duplicates on match values: No
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 2
Number of rows in original but not in new: 0
Number of rows in new but not in original: 0

Number of rows with some compared columns unequal: 2
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 1
Number of columns compared with all values equal: 2
Total number of values which compare unequal: 2

Columns with Unequal Values or Types
------------------------------------

  Column original dtype new dtype  # Unequal  Max Diff  # Null Diff
0    age          int64     int64          2       1.0            0

Sample Rows with Unequal Values
-------------------------------

  name  age (original)  age (new)
1  EFG             456        457
0  ABC             123        124

您将在所有三列上加入，并且只应在

name

上加入。在“加入”中，请更改为以下内容：

compare = datacompy.Compare(
    df1,
    df2,
    join_columns=['name'],  #You can also specify a list of columns
    abs_tol=0.0001,
    rel_tol=0,
    df1_name='original',
    df2_name='new')
compare.matches()

print(compare.report())

将产生以下输出：

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0  original        3     2
1       new        3     2

Column Summary
--------------

Number of columns in common: 3
Number of columns in original but not in new: 0
Number of columns in new but not in original: 0

Row Summary
-----------

Matched on: name
Any duplicates on match values: No
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 2
Number of rows in original but not in new: 0
Number of rows in new but not in original: 0

Number of rows with some compared columns unequal: 2
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 1
Number of columns compared with all values equal: 2
Total number of values which compare unequal: 2

Columns with Unequal Values or Types
------------------------------------

  Column original dtype new dtype  # Unequal  Max Diff  # Null Diff
0    age          int64     int64          2       1.0            0

Sample Rows with Unequal Values
-------------------------------

  name  age (original)  age (new)
1  EFG             456        457
0  ABC             123        124

感谢moe_95，在单列连接中效果很好。我可以在多个列上加入吗？没问题，您可以在

join\u columns

中指定多个列，即

join\u columns=['name'，'age']

可以工作。您不能在每个df中的所有列上进行连接。如果你能接受这个对你有效的正确答案，那也太好了！感谢moe_95，在单列连接中效果很好。我可以在多个列上加入吗？没问题，您可以在

join\u columns

中指定多个列，即

join\u columns=['name'，'age']

可以工作。您不能在每个df中的所有列上进行连接。如果你能接受这个对你有效的正确答案，那也太好了！