Python 尝试将行附加到GROUPBY对象中的每个组时出现的奇怪行为
这个问题是关于一个函数在应用于两个不同的数据帧(更准确地说,是groupby对象)时以意外的方式运行。要么我错过了一些明显错误的东西,要么熊猫身上有虫子Python 尝试将行附加到GROUPBY对象中的每个组时出现的奇怪行为,python,python-2.7,pandas,dataframe,pandas-groupby,Python,Python 2.7,Pandas,Dataframe,Pandas Groupby,这个问题是关于一个函数在应用于两个不同的数据帧(更准确地说,是groupby对象)时以意外的方式运行。要么我错过了一些明显错误的东西,要么熊猫身上有虫子 我编写了下面的函数,将一行附加到groupby对象中的每个组。是与功能相关的另一个问题 def myfunction(g, now): '''This function appends a row to each group and populates the DTM column value of that row with th
我编写了下面的函数,将一行附加到groupby对象中的每个组。是与功能相关的另一个问题
def myfunction(g, now):
'''This function appends a row to each group and populates the DTM column value of that row with the current timestamp. Other columns of the new row will have NaN s.
g: a groupby object
now: current timestamp
returns a dataframe that has the current timestamp appended in the DTM column for each group
'''
g.loc[g.shape[0], 'DTM'] = now # Appending the current timestamp to a DTM column in each group
return g
我们将运行两个测试来测试该功能。
测试1 它在链接问题(在上面的问题中演示)中的数据帧
a
上正常工作。为了更加清晰,这里有一个稍微增加的重播(主要是从链接问题中粘贴的副本)
应用该函数
a = a.reset_index().groupby(['first', 'second']).apply(lambda x: myfunction(x, now))
dd = d.reset_index().groupby(['ID', 'SEQ']).apply(lambda x: myfunction(x, now)) # a group is a unique combination of ID-SEQ pairs
它向每个组追加了一个新行。添加了一个新的DTM
列,因为它不在原始A
中。组是第一对-第二对
a
Out[52]:
first second 0 DTM
first second
bar one 0 bar one 0.134379 NaT
1 bar one 0.967928 NaT
2 NaN NaN NaN 2017-07-03 18:56:33.183
two 2 bar two 0.067502 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
baz one 3 baz one 0.182887 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
two 4 baz two 0.926932 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
foo one 5 foo one 0.806225 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
two 6 foo two 0.718322 NaT
7 foo two 0.932114 NaT
2 NaN NaN NaN 2017-07-03 18:56:33.183
qux one 8 qux one 0.772494 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
two 9 qux two 0.141510 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
有些精致
a = a.reset_index(level = 2).drop(('level_2', 'first', 'second')).loc[:,(0,'DTM')]
ddd = dd.reset_index(level = 2).drop(('level_2', 'ID', 'SEQ')).loc[:,('DTM','STATUS')]
ddd
Out[39]:
DTM STATUS
ID SEQ
C1 572 5/9/2017 10:13 PE
572 5/9/2017 12:24 OK
572 2017-07-03 18:56:33.183000 NaN
579 2017-07-03 18:56:33.183000 PE
579 5/9/2017 13:25 OK
587 5/9/2017 10:20 PE
587 5/9/2017 12:25 OK
587 2017-07-03 18:56:33.183000 NaN
590 5/9/2017 10:21 PE
590 5/9/2017 13:09 OK
590 2017-07-03 18:56:33.183000 NaN
604 5/9/2017 10:38 PE
604 5/9/2017 12:32 OK
604 2017-07-03 18:56:33.183000 NaN
609 5/9/2017 10:39 PE
609 5/9/2017 13:29 OK
609 2017-07-03 18:56:33.183000 NaN
613 5/9/2017 10:39 PE
613 5/9/2017 13:08 OK
613 2017-07-03 18:56:33.183000 NaN
618 5/9/2017 10:40 PE
618 5/9/2017 13:33 OK
618 2017-07-03 18:56:33.183000 NaN
636 5/9/2017 10:54 PE
636 5/9/2017 13:36 OK
636 2017-07-03 18:56:33.183000 NaN
642 5/9/2017 10:55 PE
642 5/9/2017 13:35 OK
642 2017-07-03 18:56:33.183000 NaN
643 5/9/2017 10:56 PE
643 5/9/2017 13:34 OK
643 2017-07-03 18:56:33.183000 NaN
656 5/9/2017 10:55 PE
656 5/9/2017 13:36 OK
656 2017-07-03 18:56:33.183000 NaN
C2 86 9/5/2016 19:45 PE
86 9/6/2016 11:55 OK
86 2017-07-03 18:56:33.183000 NaN
C3 10 4/17/2017 12:23 PE
10 4/17/2017 14:51 OK
10 2017-07-03 18:56:33.183000 NaN
C4 38 3/25/2017 10:35 PE
38 3/25/2017 10:51 OK
38 2017-07-03 18:56:33.183000 NaN
这将给出最终的a
as
a
Out[62]:
0 DTM
first second
bar one 0.371683 NaT
one 0.327870 NaT
one NaN 2017-07-03 18:56:33.183
two 0.048794 NaT
two NaN 2017-07-03 18:56:33.183
baz one 0.462747 NaT
one NaN 2017-07-03 18:56:33.183
two 0.758674 NaT
two NaN 2017-07-03 18:56:33.183
foo one 0.238607 NaT
one NaN 2017-07-03 18:56:33.183
two 0.156104 NaT
two 0.594270 NaT
two NaN 2017-07-03 18:56:33.183
qux one 0.091088 NaT
one NaN 2017-07-03 18:56:33.183
two 0.795864 NaT
two NaN 2017-07-03 18:56:33.183
到目前为止还不错。这是预期的行为。每对first
-second
都追加了一个新行,该行的DTM
列已填充了当前时间戳
测试2
令人惊讶的是,我无法在下面的数据帧df
中重现这种行为。组是一个ID
-SEQ
组合
此df
可通过以下方式复制:
1.复制以下内容
C1 572 5/9/2017 10:13 PE
C1 572 5/9/2017 12:24 OK
C1 579 5/9/2017 10:19 PE
C1 579 5/9/2017 13:25 OK
C1 587 5/9/2017 10:20 PE
C1 587 5/9/2017 12:25 OK
C1 590 5/9/2017 10:21 PE
C1 590 5/9/2017 13:09 OK
C1 604 5/9/2017 10:38 PE
C1 604 5/9/2017 12:32 OK
C1 609 5/9/2017 10:39 PE
C1 609 5/9/2017 13:29 OK
C1 613 5/9/2017 10:39 PE
C1 613 5/9/2017 13:08 OK
C1 618 5/9/2017 10:40 PE
C1 618 5/9/2017 13:33 OK
C1 636 5/9/2017 10:54 PE
C1 636 5/9/2017 13:36 OK
C1 642 5/9/2017 10:55 PE
C1 642 5/9/2017 13:35 OK
C1 643 5/9/2017 10:56 PE
C1 643 5/9/2017 13:34 OK
C1 656 5/9/2017 10:55 PE
C1 656 5/9/2017 13:36 OK
C2 86 9/5/2016 19:45 PE
C2 86 9/6/2016 11:55 OK
C3 10 4/17/2017 12:23 PE
C3 10 4/17/2017 14:51 OK
C4 38 3/25/2017 10:35 PE
C4 38 3/25/2017 10:51 OK
2.然后执行这些
df = pd.read_clipboard(sep = '[ ]{2,}')
df.columns = ['ID', 'SEQ', 'DTM', 'STATUS']
设置多索引
d = df.set_index(['ID', 'SEQ', 'DTM']) # I have three index levels this time in the original dataframe
d
看起来像什么
d
Out[40]:
STATUS
ID SEQ DTM
C1 572 5/9/2017 10:13 PE
5/9/2017 12:24 OK
579 5/9/2017 10:19 PE
5/9/2017 13:25 OK
587 5/9/2017 10:20 PE
5/9/2017 12:25 OK
590 5/9/2017 10:21 PE
5/9/2017 13:09 OK
604 5/9/2017 10:38 PE
5/9/2017 12:32 OK
609 5/9/2017 10:39 PE
5/9/2017 13:29 OK
613 5/9/2017 10:39 PE
5/9/2017 13:08 OK
618 5/9/2017 10:40 PE
5/9/2017 13:33 OK
636 5/9/2017 10:54 PE
5/9/2017 13:36 OK
642 5/9/2017 10:55 PE
5/9/2017 13:35 OK
643 5/9/2017 10:56 PE
5/9/2017 13:34 OK
656 5/9/2017 10:55 PE
5/9/2017 13:36 OK
C2 86 9/5/2016 19:45 PE
9/6/2016 11:55 OK
C3 10 4/17/2017 12:23 PE
4/17/2017 14:51 OK
C4 38 3/25/2017 10:35 PE
3/25/2017 10:51 OK
应用该函数
a = a.reset_index().groupby(['first', 'second']).apply(lambda x: myfunction(x, now))
dd = d.reset_index().groupby(['ID', 'SEQ']).apply(lambda x: myfunction(x, now)) # a group is a unique combination of ID-SEQ pairs
返回(注意第四行)
有些精致
a = a.reset_index(level = 2).drop(('level_2', 'first', 'second')).loc[:,(0,'DTM')]
ddd = dd.reset_index(level = 2).drop(('level_2', 'ID', 'SEQ')).loc[:,('DTM','STATUS')]
ddd
Out[39]:
DTM STATUS
ID SEQ
C1 572 5/9/2017 10:13 PE
572 5/9/2017 12:24 OK
572 2017-07-03 18:56:33.183000 NaN
579 2017-07-03 18:56:33.183000 PE
579 5/9/2017 13:25 OK
587 5/9/2017 10:20 PE
587 5/9/2017 12:25 OK
587 2017-07-03 18:56:33.183000 NaN
590 5/9/2017 10:21 PE
590 5/9/2017 13:09 OK
590 2017-07-03 18:56:33.183000 NaN
604 5/9/2017 10:38 PE
604 5/9/2017 12:32 OK
604 2017-07-03 18:56:33.183000 NaN
609 5/9/2017 10:39 PE
609 5/9/2017 13:29 OK
609 2017-07-03 18:56:33.183000 NaN
613 5/9/2017 10:39 PE
613 5/9/2017 13:08 OK
613 2017-07-03 18:56:33.183000 NaN
618 5/9/2017 10:40 PE
618 5/9/2017 13:33 OK
618 2017-07-03 18:56:33.183000 NaN
636 5/9/2017 10:54 PE
636 5/9/2017 13:36 OK
636 2017-07-03 18:56:33.183000 NaN
642 5/9/2017 10:55 PE
642 5/9/2017 13:35 OK
642 2017-07-03 18:56:33.183000 NaN
643 5/9/2017 10:56 PE
643 5/9/2017 13:34 OK
643 2017-07-03 18:56:33.183000 NaN
656 5/9/2017 10:55 PE
656 5/9/2017 13:36 OK
656 2017-07-03 18:56:33.183000 NaN
C2 86 9/5/2016 19:45 PE
86 9/6/2016 11:55 OK
86 2017-07-03 18:56:33.183000 NaN
C3 10 4/17/2017 12:23 PE
10 4/17/2017 14:51 OK
10 2017-07-03 18:56:33.183000 NaN
C4 38 3/25/2017 10:35 PE
38 3/25/2017 10:51 OK
38 2017-07-03 18:56:33.183000 NaN
问题
包含当前时间戳的新行已附加到每个ID
-SEQ
组,但C1
-579
组除外!(在dd
和ddd
中的第四行)
问题
是什么导致了这个问题
dd
中引入的附加索引级别是什么
经过大量调试后,发现了一个问题
级别3
中的相同数字存在问题-在上一个示例中,是组2
的形状,但此值以前存在,因此在覆盖行之前未添加新行
ID SEQ DTM STATUS
ID SEQ
C1 572 0 C1 572.0 2017-05-09 10:13:00.000000 PE
1 C1 572.0 2017-05-09 12:24:00.000000 OK
2 NaN NaN 2017-07-06 08:46:02.341472 NaN
579 2 C1 579.0 2017-07-06 08:46:02.341472 PE <- ovetwritten values in row
3 C1 579.0 2017-05-09 13:25:00.000000 OK
587 4 C1 587.0 2017-05-09 10:20:00.000000 PE
5 C1 587.0 2017-05-09 12:25:00.000000 OK
2 NaN NaN 2017-07-06 08:46:02.341472 NaN
同样的问题
print (a)
first second 0 DTM
first second
bar one 0 bar one 0.366258 NaT
1 NaN NaN NaN 2017-07-06 08:47:55.610671
two 1 bar two 0.583205 NaT
2 bar two 0.159388 2017-07-06 08:47:55.610671 <- ovetwritten
baz one 3 baz one 0.598198 NaT
1 NaN NaN NaN 2017-07-06 08:47:55.610671
two 4 baz two 0.274027 NaT
如果你能让问题包含所需的所有信息,而不是链接到其他来源并说“我正在尝试这样做”,你就更有可能得到回答。另外,将您的代码和示例简化为问题仍然存在的最简单的情况,以便人们更容易理解understand@mjp谢谢你的建议。我重新安排了这个问题。这是一个我非常想得到答案的问题。首先,非常感谢你的回答!我不确定我是否理解正确。请您对下面的内容进行编辑,以使其更清晰<代码>级别3中的同一个数字有问题-在您的上一个示例中是组2的形状,但此值以前存在,因此未添加新行,但行被覆盖。
@akilat90-感谢您的建议。如果你认为一些语法可以更好,如果你能改进我的答案,那没问题;)您可以毫无问题地编辑它;)谢谢-英语也不是我的第一语言:)但一旦我完全理解了这是怎么发生的,我会编辑它。关于新增加的一个级别(我的问题中的问题2);你能解释一下为什么引入了新的级别吗?什么决定了它的价值?也许我需要阅读更多关于df.loc
@akilat90-不,它更简单。数字来自原始索引-a=a。重置索引()
createindex=0,1,2..
。如果需要删除从组中创建的索引,可以使用a.reset_index().groupby(['first','second',,group_keys=False)。apply(lambda x:myfunction(x,now))
。那么a.reset_index().groupby(['first','second')。head()应该显示三个索引级别吗?(不是)顺便说一句,我所理解的是,在你的答案中,这个值存在于
之前,就是索引C1-579-2
正在重复。您创建的a
中的bar-two-2
也在重复。是这样吗?