Python 数据帧如何显示相同但失败相等（）？_Python_Pandas

Python 数据帧如何显示相同但失败相等（）？

python pandas

Python 数据帧如何显示相同但失败相等（）？,python,pandas,Python,Pandas,为了确认我了解Pandasdf.groupby（）和df.reset\u index（）所做的事情，我尝试从数据帧到相同数据的分组版本进行往返。往返之后，列和行必须再次排序，因为groupby（）会影响行顺序，reset\u index（）会影响列顺序，但在两次快速操作以恢复列和索引顺序后，数据帧看起来是相同的：相同的列名列表每个列的数据类型相同相应的索引值严格相等相应的数据值严格相等然而，在所有这些检查成功后，df1.equals（df5）返回惊人的值False 这些数据帧之间的

为了确认我了解Pandas

df.groupby（）

和

df.reset\u index（）

所做的事情，我尝试从数据帧到相同数据的分组版本进行往返。往返之后，列和行必须再次排序，因为

groupby（）

会影响行顺序，

reset\u index（）

会影响列顺序，但在两次快速操作以恢复列和索引顺序后，数据帧看起来是相同的：

相同的列名列表
每个列的数据类型相同
相应的索引值严格相等
相应的数据值严格相等

然而，在所有这些检查成功后，

df1.equals（df5）

返回惊人的值

False

这些数据帧之间的区别是

equals（）

揭示了我还没有弄清楚如何检查自己

测试代码：

csv_text = """\
Title,Year,Director
North by Northwest,1959,Alfred Hitchcock
Notorious,1946,Alfred Hitchcock
The Philadelphia Story,1940,George Cukor
To Catch a Thief,1955,Alfred Hitchcock
His Girl Friday,1940,Howard Hawks
"""

import pandas as pd

df1 = pd.read_csv('sample.csv')
df1.columns = map(str.lower, df1.columns)
print(df1)

df2 = df1.groupby(['director', df1.index]).first()
df3 = df2.reset_index('director')
df4 = df3[['title', 'year', 'director']]
df5 = df4.sort_index()
print(df5)

print()
print(repr(df1.columns))
print(repr(df5.columns))
print()
print(df1.dtypes)
print(df5.dtypes)
print()
print(df1 == df5)
print()
print(df1.index == df5.index)
print()
print(df1.equals(df5))

运行脚本时收到的输出为：

                    title  year          director
0      North by Northwest  1959  Alfred Hitchcock
1               Notorious  1946  Alfred Hitchcock
2  The Philadelphia Story  1940      George Cukor
3        To Catch a Thief  1955  Alfred Hitchcock
4         His Girl Friday  1940      Howard Hawks
                    title  year          director
0      North by Northwest  1959  Alfred Hitchcock
1               Notorious  1946  Alfred Hitchcock
2  The Philadelphia Story  1940      George Cukor
3        To Catch a Thief  1955  Alfred Hitchcock
4         His Girl Friday  1940      Howard Hawks

Index(['title', 'year', 'director'], dtype='object')
Index(['title', 'year', 'director'], dtype='object')

title       object
year         int64
director    object
dtype: object
title       object
year         int64
director    object
dtype: object

  title  year director
0  True  True     True
1  True  True     True
2  True  True     True
3  True  True     True
4  True  True     True

[ True  True  True  True  True]

False

谢谢你的帮助

这对我来说像是一个bug，但可能只是我误解了什么。这些块按不同的顺序列出：

>>> df1._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object
>>> df5._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64

在

core/internals.py

中，我们有

BlockManager

方法

def equals(self, other):
    self_axes, other_axes = self.axes, other.axes
    if len(self_axes) != len(other_axes):
        return False
    if not all (ax1.equals(ax2) for ax1, ax2 in zip(self_axes, other_axes)):
        return False
    self._consolidate_inplace()
    other._consolidate_inplace()
    return all(block.equals(oblock) for block, oblock in
               zip(self.blocks, other.blocks))

最后一个

all

假设

self

和

other

中的块对应。但是如果我们在它前面添加一些

print

调用，我们会看到：

>>> df1.equals(df5)
blocks self: (IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64, ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object)
blocks other: (ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object, IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64)
False

所以我们在比较错误的东西。我不确定这是否是一个bug的原因是因为我不确定

equals

是否应该如此挑剔。如果是这样的话，我认为至少存在一个文档错误，因为

equals

应该大声说，它不应该用于您可能认为来自名称和文档字符串的内容。

另一种检查方法是使用

pandas.util.testing.assert\u frame\u equal

。这可能会给你一个关于熊猫认为什么是不同的报告。好主意！我试过了。如果其参数是

assert\u frame\u equal（df1，df5）

作为上述脚本中的最后一行，则不会引发异常。因此，它似乎认为它们是相等的，即使

.equals（）

不相等。你能对此提出问题吗？最初在这里报道：；块排序是一种impl细节atm。为了使这一点保持一致，在问题的结尾有一些选择。谢谢大家！我已将此示例代码附加到问题中，以警告他们这不是特定于HDF的，但可能会影响任何人。