Python 尝试将行附加到GROUPBY对象中的每个组时出现的奇怪行为

Python 尝试将行附加到GROUPBY对象中的每个组时出现的奇怪行为,python,python-2.7,pandas,dataframe,pandas-groupby,Python,Python 2.7,Pandas,Dataframe,Pandas Groupby,这个问题是关于一个函数在应用于两个不同的数据帧(更准确地说,是groupby对象)时以意外的方式运行。要么我错过了一些明显错误的东西,要么熊猫身上有虫子 我编写了下面的函数,将一行附加到groupby对象中的每个组。是与功能相关的另一个问题 def myfunction(g, now): '''This function appends a row to each group and populates the DTM column value of that row with th

这个问题是关于一个函数在应用于两个不同的数据帧(更准确地说,是groupby对象)时以意外的方式运行。要么我错过了一些明显错误的东西,要么熊猫身上有虫子


我编写了下面的函数,将一行附加到groupby对象中的每个组。是与功能相关的另一个问题

def myfunction(g, now):

    '''This function appends a row to each group and populates the DTM column value of that row with the current timestamp. Other columns of the new row will have NaN s.
       g: a groupby object
       now: current timestamp

       returns a dataframe that has the current timestamp appended in the DTM column for each group

    '''
        g.loc[g.shape[0], 'DTM'] = now # Appending the current timestamp to a DTM column in each group

        return g

我们将运行两个测试来测试该功能。


测试1

它在链接问题(在上面的问题中演示)中的数据帧
a
上正常工作。为了更加清晰,这里有一个稍微增加的重播(主要是从链接问题中粘贴的副本)

应用该函数

 a = a.reset_index().groupby(['first', 'second']).apply(lambda x: myfunction(x, now))
dd = d.reset_index().groupby(['ID', 'SEQ']).apply(lambda x: myfunction(x, now)) # a group is a unique combination of ID-SEQ pairs
它向每个组追加了一个新行。添加了一个新的
DTM
列,因为它不在原始
A
中。组是第一对-
第二对

a
Out[52]: 
               first second         0                     DTM
first second                                                 
bar   one    0   bar    one  0.134379                     NaT
             1   bar    one  0.967928                     NaT
             2   NaN    NaN       NaN 2017-07-03 18:56:33.183
      two    2   bar    two  0.067502                     NaT
             1   NaN    NaN       NaN 2017-07-03 18:56:33.183
baz   one    3   baz    one  0.182887                     NaT
             1   NaN    NaN       NaN 2017-07-03 18:56:33.183
      two    4   baz    two  0.926932                     NaT
             1   NaN    NaN       NaN 2017-07-03 18:56:33.183
foo   one    5   foo    one  0.806225                     NaT
             1   NaN    NaN       NaN 2017-07-03 18:56:33.183
      two    6   foo    two  0.718322                     NaT
             7   foo    two  0.932114                     NaT
             2   NaN    NaN       NaN 2017-07-03 18:56:33.183
qux   one    8   qux    one  0.772494                     NaT
             1   NaN    NaN       NaN 2017-07-03 18:56:33.183
      two    9   qux    two  0.141510                     NaT
             1   NaN    NaN       NaN 2017-07-03 18:56:33.183
有些精致

a = a.reset_index(level = 2).drop(('level_2', 'first', 'second')).loc[:,(0,'DTM')]
ddd = dd.reset_index(level = 2).drop(('level_2', 'ID', 'SEQ')).loc[:,('DTM','STATUS')]

ddd
Out[39]: 
                               DTM STATUS
ID SEQ                                   
C1 572              5/9/2017 10:13     PE
   572              5/9/2017 12:24     OK
   572  2017-07-03 18:56:33.183000    NaN
   579  2017-07-03 18:56:33.183000     PE
   579              5/9/2017 13:25     OK
   587              5/9/2017 10:20     PE
   587              5/9/2017 12:25     OK
   587  2017-07-03 18:56:33.183000    NaN
   590              5/9/2017 10:21     PE
   590              5/9/2017 13:09     OK
   590  2017-07-03 18:56:33.183000    NaN
   604              5/9/2017 10:38     PE
   604              5/9/2017 12:32     OK
   604  2017-07-03 18:56:33.183000    NaN
   609              5/9/2017 10:39     PE
   609              5/9/2017 13:29     OK
   609  2017-07-03 18:56:33.183000    NaN
   613              5/9/2017 10:39     PE
   613              5/9/2017 13:08     OK
   613  2017-07-03 18:56:33.183000    NaN
   618              5/9/2017 10:40     PE
   618              5/9/2017 13:33     OK
   618  2017-07-03 18:56:33.183000    NaN
   636              5/9/2017 10:54     PE
   636              5/9/2017 13:36     OK
   636  2017-07-03 18:56:33.183000    NaN
   642              5/9/2017 10:55     PE
   642              5/9/2017 13:35     OK
   642  2017-07-03 18:56:33.183000    NaN
   643              5/9/2017 10:56     PE
   643              5/9/2017 13:34     OK
   643  2017-07-03 18:56:33.183000    NaN
   656              5/9/2017 10:55     PE
   656              5/9/2017 13:36     OK
   656  2017-07-03 18:56:33.183000    NaN
C2 86               9/5/2016 19:45     PE
   86               9/6/2016 11:55     OK
   86   2017-07-03 18:56:33.183000    NaN
C3 10              4/17/2017 12:23     PE
   10              4/17/2017 14:51     OK
   10   2017-07-03 18:56:33.183000    NaN
C4 38              3/25/2017 10:35     PE
   38              3/25/2017 10:51     OK
   38   2017-07-03 18:56:33.183000    NaN
这将给出最终的
a
as

a
Out[62]: 
                     0                     DTM
first second                                  
bar   one     0.371683                     NaT
      one     0.327870                     NaT
      one          NaN 2017-07-03 18:56:33.183
      two     0.048794                     NaT
      two          NaN 2017-07-03 18:56:33.183
baz   one     0.462747                     NaT
      one          NaN 2017-07-03 18:56:33.183
      two     0.758674                     NaT
      two          NaN 2017-07-03 18:56:33.183
foo   one     0.238607                     NaT
      one          NaN 2017-07-03 18:56:33.183
      two     0.156104                     NaT
      two     0.594270                     NaT
      two          NaN 2017-07-03 18:56:33.183
qux   one     0.091088                     NaT
      one          NaN 2017-07-03 18:56:33.183
      two     0.795864                     NaT
      two          NaN 2017-07-03 18:56:33.183
到目前为止还不错。这是预期的行为。每对
first
-
second
都追加了一个新行,该行的
DTM
列已填充了当前时间戳


测试2

令人惊讶的是,我无法在下面的数据帧
df
中重现这种行为。组是一个
ID
-
SEQ
组合


df
可通过以下方式复制:

1.复制以下内容

    C1  572  5/9/2017 10:13  PE
    C1  572  5/9/2017 12:24  OK
    C1  579  5/9/2017 10:19  PE
    C1  579  5/9/2017 13:25  OK
    C1  587  5/9/2017 10:20  PE
    C1  587  5/9/2017 12:25  OK
    C1  590  5/9/2017 10:21  PE
    C1  590  5/9/2017 13:09  OK
    C1  604  5/9/2017 10:38  PE
    C1  604  5/9/2017 12:32  OK
    C1  609  5/9/2017 10:39  PE
    C1  609  5/9/2017 13:29  OK
    C1  613  5/9/2017 10:39  PE
    C1  613  5/9/2017 13:08  OK
    C1  618  5/9/2017 10:40  PE
    C1  618  5/9/2017 13:33  OK
    C1  636  5/9/2017 10:54  PE
    C1  636  5/9/2017 13:36  OK
    C1  642  5/9/2017 10:55  PE
    C1  642  5/9/2017 13:35  OK
    C1  643  5/9/2017 10:56  PE
    C1  643  5/9/2017 13:34  OK
    C1  656  5/9/2017 10:55  PE
    C1  656  5/9/2017 13:36  OK
    C2  86  9/5/2016 19:45   PE
    C2  86  9/6/2016 11:55   OK
    C3  10  4/17/2017 12:23  PE
    C3  10  4/17/2017 14:51  OK
    C4  38  3/25/2017 10:35  PE
    C4  38  3/25/2017 10:51  OK
2.然后执行这些

  df = pd.read_clipboard(sep = '[ ]{2,}')
  df.columns = ['ID', 'SEQ', 'DTM', 'STATUS']

设置多索引

d = df.set_index(['ID', 'SEQ', 'DTM']) # I have three index levels this time in the original dataframe
d
看起来像什么

d
Out[40]: 
                       STATUS
ID SEQ DTM                   
C1 572 5/9/2017 10:13      PE
       5/9/2017 12:24      OK
   579 5/9/2017 10:19      PE
       5/9/2017 13:25      OK
   587 5/9/2017 10:20      PE
       5/9/2017 12:25      OK
   590 5/9/2017 10:21      PE
       5/9/2017 13:09      OK
   604 5/9/2017 10:38      PE
       5/9/2017 12:32      OK
   609 5/9/2017 10:39      PE
       5/9/2017 13:29      OK
   613 5/9/2017 10:39      PE
       5/9/2017 13:08      OK
   618 5/9/2017 10:40      PE
       5/9/2017 13:33      OK
   636 5/9/2017 10:54      PE
       5/9/2017 13:36      OK
   642 5/9/2017 10:55      PE
       5/9/2017 13:35      OK
   643 5/9/2017 10:56      PE
       5/9/2017 13:34      OK
   656 5/9/2017 10:55      PE
       5/9/2017 13:36      OK
C2 86  9/5/2016 19:45      PE
       9/6/2016 11:55      OK
C3 10  4/17/2017 12:23     PE
       4/17/2017 14:51     OK
C4 38  3/25/2017 10:35     PE
       3/25/2017 10:51     OK
应用该函数

 a = a.reset_index().groupby(['first', 'second']).apply(lambda x: myfunction(x, now))
dd = d.reset_index().groupby(['ID', 'SEQ']).apply(lambda x: myfunction(x, now)) # a group is a unique combination of ID-SEQ pairs
返回(注意第四行)

有些精致

a = a.reset_index(level = 2).drop(('level_2', 'first', 'second')).loc[:,(0,'DTM')]
ddd = dd.reset_index(level = 2).drop(('level_2', 'ID', 'SEQ')).loc[:,('DTM','STATUS')]

ddd
Out[39]: 
                               DTM STATUS
ID SEQ                                   
C1 572              5/9/2017 10:13     PE
   572              5/9/2017 12:24     OK
   572  2017-07-03 18:56:33.183000    NaN
   579  2017-07-03 18:56:33.183000     PE
   579              5/9/2017 13:25     OK
   587              5/9/2017 10:20     PE
   587              5/9/2017 12:25     OK
   587  2017-07-03 18:56:33.183000    NaN
   590              5/9/2017 10:21     PE
   590              5/9/2017 13:09     OK
   590  2017-07-03 18:56:33.183000    NaN
   604              5/9/2017 10:38     PE
   604              5/9/2017 12:32     OK
   604  2017-07-03 18:56:33.183000    NaN
   609              5/9/2017 10:39     PE
   609              5/9/2017 13:29     OK
   609  2017-07-03 18:56:33.183000    NaN
   613              5/9/2017 10:39     PE
   613              5/9/2017 13:08     OK
   613  2017-07-03 18:56:33.183000    NaN
   618              5/9/2017 10:40     PE
   618              5/9/2017 13:33     OK
   618  2017-07-03 18:56:33.183000    NaN
   636              5/9/2017 10:54     PE
   636              5/9/2017 13:36     OK
   636  2017-07-03 18:56:33.183000    NaN
   642              5/9/2017 10:55     PE
   642              5/9/2017 13:35     OK
   642  2017-07-03 18:56:33.183000    NaN
   643              5/9/2017 10:56     PE
   643              5/9/2017 13:34     OK
   643  2017-07-03 18:56:33.183000    NaN
   656              5/9/2017 10:55     PE
   656              5/9/2017 13:36     OK
   656  2017-07-03 18:56:33.183000    NaN
C2 86               9/5/2016 19:45     PE
   86               9/6/2016 11:55     OK
   86   2017-07-03 18:56:33.183000    NaN
C3 10              4/17/2017 12:23     PE
   10              4/17/2017 14:51     OK
   10   2017-07-03 18:56:33.183000    NaN
C4 38              3/25/2017 10:35     PE
   38              3/25/2017 10:51     OK
   38   2017-07-03 18:56:33.183000    NaN
问题

包含当前时间戳的新行已附加到每个
ID
-
SEQ
组,但
C1
-
579
组除外!(在
dd
ddd
中的第四行)


问题

  • 是什么导致了这个问题
  • dd
    中引入的附加索引级别是什么

  • 经过大量调试后,发现了一个问题

    级别
    3
    中的相同数字存在问题-在上一个示例中,是组
    2
    的形状,但此值以前存在,因此在覆盖行之前未添加新行

                ID    SEQ                        DTM STATUS
    ID SEQ                                                 
    C1 572 0    C1  572.0 2017-05-09 10:13:00.000000     PE
           1    C1  572.0 2017-05-09 12:24:00.000000     OK
           2   NaN    NaN 2017-07-06 08:46:02.341472    NaN
       579 2    C1  579.0 2017-07-06 08:46:02.341472     PE <- ovetwritten values in row
           3    C1  579.0 2017-05-09 13:25:00.000000     OK
       587 4    C1  587.0 2017-05-09 10:20:00.000000     PE
           5    C1  587.0 2017-05-09 12:25:00.000000     OK
           2   NaN    NaN 2017-07-06 08:46:02.341472    NaN
    
    同样的问题

    print (a)
                   first second         0                        DTM
    first second                                                    
    bar   one    0   bar    one  0.366258                        NaT
                 1   NaN    NaN       NaN 2017-07-06 08:47:55.610671
          two    1   bar    two  0.583205                        NaT
                 2   bar    two  0.159388 2017-07-06 08:47:55.610671 <- ovetwritten 
    baz   one    3   baz    one  0.598198                        NaT
                 1   NaN    NaN       NaN 2017-07-06 08:47:55.610671
          two    4   baz    two  0.274027                        NaT
    

    如果你能让问题包含所需的所有信息,而不是链接到其他来源并说“我正在尝试这样做”,你就更有可能得到回答。另外,将您的代码和示例简化为问题仍然存在的最简单的情况,以便人们更容易理解understand@mjp谢谢你的建议。我重新安排了这个问题。这是一个我非常想得到答案的问题。首先,非常感谢你的回答!我不确定我是否理解正确。请您对下面的内容进行编辑,以使其更清晰<代码>级别3中的同一个数字有问题-在您的上一个示例中是组2的形状,但此值以前存在,因此未添加新行,但行被覆盖。
    @akilat90-感谢您的建议。如果你认为一些语法可以更好,如果你能改进我的答案,那没问题;)您可以毫无问题地编辑它;)谢谢-英语也不是我的第一语言:)但一旦我完全理解了这是怎么发生的,我会编辑它。关于新增加的一个级别(我的问题中的问题2);你能解释一下为什么引入了新的级别吗?什么决定了它的价值?也许我需要阅读更多关于
    df.loc
    @akilat90-不,它更简单。数字来自原始索引-
    a=a。重置索引()
    create
    index=0,1,2..
    。如果需要删除从组中创建的索引,可以使用
    a.reset_index().groupby(['first','second',,group_keys=False)。apply(lambda x:myfunction(x,now))
    。那么
    a.reset_index().groupby(['first','second')。head()应该显示三个索引级别吗?(不是)顺便说一句,我所理解的是,在你的答案中,
    这个值存在于
    之前,就是索引
    C1-579-2
    正在重复。您创建的
    a
    中的
    bar-two-2
    也在重复。是这样吗?