Python 存在分层索引集行为问题_Python_Pandas

Python 存在分层索引集行为问题

python pandas

Python 存在分层索引集行为问题,python,pandas,Python,Pandas,我无法从数据帧的层次索引中找出这种奇怪的行为。简而言之，我想做的很简单；我试图弄清楚元组是否在数据帧的索引中这是我所期望的行为： arrays = [[dt.date(2014,6,4), dt.date(2014,6,4), dt.date(2014,6,21), dt.date(2014,6,21),dt.date(2014,6,13), dt.date(2014,6,13), dt.date(2014,6,7), dt.date(2014,6,7)],['one', 'two', 'on

我无法从数据帧的层次索引中找出这种奇怪的行为。简而言之，我想做的很简单；我试图弄清楚元组是否在数据帧的索引中

这是我所期望的行为：

arrays = [[dt.date(2014,6,4), dt.date(2014,6,4), dt.date(2014,6,21), dt.date(2014,6,21),dt.date(2014,6,13), dt.date(2014,6,13), dt.date(2014,6,7), dt.date(2014,6,7)],['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(randn(8), index=index)
print (dt.date(2014,6,4),'one') in s.index
print (dt.date(2014,6,4),'fifty') in s.index
print (dt.date(2014,1,1),'one') in s.index

WeirdIdx = pd.MultiIndex(levels=[[dt.date(2014,7,4), dt.date(2014,7,5),dt.date(2014,7,6), dt.date(2014,7,7), dt.date(2014,7,8),dt.date(2014,7,9)], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]],labels=[[0, 0, 0, 0, 0], [8, 8, 8, 8, 8]],names=[u'day', u'hour'])
frame = pd.DataFrame({'a':np.random.normal(0,1,5)},index=WeirdIdx)
print type(frame)
print frame.index
print frame

True 
False 
False

这就是我所面对的：

arrays = [[dt.date(2014,6,4), dt.date(2014,6,4), dt.date(2014,6,21), dt.date(2014,6,21),dt.date(2014,6,13), dt.date(2014,6,13), dt.date(2014,6,7), dt.date(2014,6,7)],['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(randn(8), index=index)
print (dt.date(2014,6,4),'one') in s.index
print (dt.date(2014,6,4),'fifty') in s.index
print (dt.date(2014,1,1),'one') in s.index

WeirdIdx = pd.MultiIndex(levels=[[dt.date(2014,7,4), dt.date(2014,7,5),dt.date(2014,7,6), dt.date(2014,7,7), dt.date(2014,7,8),dt.date(2014,7,9)], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]],labels=[[0, 0, 0, 0, 0], [8, 8, 8, 8, 8]],names=[u'day', u'hour'])
frame = pd.DataFrame({'a':np.random.normal(0,1,5)},index=WeirdIdx)
print type(frame)
print frame.index
print frame

收益率：

<class 'pandas.core.frame.DataFrame'>
day         hour
2014-07-04  8   
            8   
            8   
            8   
            8   
                        a
day        hour          
2014-07-04 8     0.335840
           8     0.801193
           8    -0.092492
           8     0.610675
           8    -0.044947

True
True
True

MultiIndex(levels=[[2014-07-04, 2014-07-05, 2014-07-06, 2014-07-07, 2014-07-08, 2014-07-09], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]],
       labels=[[0, 0, 0, 0, 0], [8, 8, 8, 8, 8]],
       names=[u'day', u'hour'])

收益率：

<class 'pandas.core.frame.DataFrame'>
day         hour
2014-07-04  8   
            8   
            8   
            8   
            8   
                        a
day        hour          
2014-07-04 8     0.335840
           8     0.801193
           8    -0.092492
           8     0.610675
           8    -0.044947

True
True
True

MultiIndex(levels=[[2014-07-04, 2014-07-05, 2014-07-06, 2014-07-07, 2014-07-08, 2014-07-09], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]],
       labels=[[0, 0, 0, 0, 0], [8, 8, 8, 8, 8]],
       names=[u'day', u'hour'])

最后：

frame.index

收益率：

<class 'pandas.core.frame.DataFrame'>
day         hour
2014-07-04  8   
            8   
            8   
            8   
            8   
                        a
day        hour          
2014-07-04 8     0.335840
           8     0.801193
           8    -0.092492
           8     0.610675
           8    -0.044947

True
True
True

MultiIndex(levels=[[2014-07-04, 2014-07-05, 2014-07-06, 2014-07-07, 2014-07-08, 2014-07-09], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]],
       labels=[[0, 0, 0, 0, 0], [8, 8, 8, 8, 8]],
       names=[u'day', u'hour'])

一个问题是帧索引中的

（dt.date（2014,8,4），1）应该为假
我在这里遗漏了什么？
问题似乎是由于您的多重索引是非唯一的。熊猫在这种情况下有奇怪的行为，我认为这是一种错误。问题与日期甚至数据帧无关；这纯粹是一个多指标问题。下面是一个更简单的例子：
WeirdIdx = pandas.MultiIndex(
    levels=[[0], [1]],
    labels=[[0, 0], [0,0]],names=[u'X', u'Y']
)

则认为任何大小和类型正确的元组都包含在多重索引中：
>>> (0, 0) in WeirdIdx
True
>>> (1, 0) in WeirdIdx
True
>>> (100, 0) in WeirdIdx
True
>>> (100, 100) in WeirdIdx
True

在源代码中，我可以看到这些结果是如何产生的：如果多索引是非唯一的，索引将返回到切片，并且切片始终工作，即使值不存在（仅返回零长度切片）。但我不明白为什么事情是这样实施的
我在pandas bug tracker上找不到与此相关的bug，尽管有各种各样的bug与重复的MutliIndex有关，例如。一些评论认为这个问题应该在pandas 0.14中得到修复，但我不知道它是否已经被修复，这个bug仍然存在。从各种bug报告中我的印象是，除非它们是唯一的，否则mutliIndex基本上不起作用。我建议打开一个bug报告和/或在熊猫邮件列表上询问。
您提供了一个有效的案例，但没有提供无效的案例。你能给出一个示例数据框/系列来显示你所看到的不想要的行为吗？这里后面的部分就是我所面对的打印（dt.date（2014,1,4），8）不应使用datetime.date
进行True
“工作”，但不能利用pandas的任何功能。使用datetime.datetime
（或者更好的方法是使用date\u range
来创建日期。@tipanverella：是的，但我的意思是，您提供的示例数据与您所说的不起作用的示例不匹配。什么是“给定的数据帧df
”？观点很好。但请记住，第一个示例的效果与预期相符。这是第二个出现问题的示例。我正在运行pandas.\uu版本\uuuu
为0.14，因此我想它尚未修复。我如何在github上引用您？@tipanverella：您可以链接到这个问题。