Python 在groupby对象上应用函数以向每个组追加一行
我有一个相当大的数据集,但为了再现性,假设我有以下多索引数据框:Python 在groupby对象上应用函数以向每个组追加一行,python,pandas,Python,Pandas,我有一个相当大的数据集,但为了再现性,假设我有以下多索引数据框: arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'], ['one', 'one','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']] tuples = list(zip(*arrays)) index = pd.MultiIndex.fr
arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
['one', 'one','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
a[1] = pd.date_range('2017-07-02', periods=10, freq='5min')
a
Out[68]:
0 1
first second
bar one 0.705488 2017-07-02 00:00:00
one 0.715645 2017-07-02 00:05:00
two 0.194648 2017-07-02 00:10:00
baz one 0.129729 2017-07-02 00:15:00
two 0.449889 2017-07-02 00:20:00
foo one 0.031531 2017-07-02 00:25:00
two 0.320757 2017-07-02 00:30:00
two 0.876243 2017-07-02 00:35:00
qux one 0.443682 2017-07-02 00:40:00
two 0.802774 2017-07-02 00:45:00
我想将当前时间戳作为第一个和第二个索引组合标识的每个组的新行追加。(例如,第一栏
,第二栏
等)
我所做的:
将时间戳附加到每个组的函数:
def myfunction(g, now):
g.loc[g.shape[0], 1] = now # current timestamp
return g
将函数应用于groupby对象
# current timestamp
now = pd.datetime.now()
a = a.reset_index().groupby(['first', 'second']).apply(lambda x: myfunction(x, now))
这将返回:
first second 0 1
first second
bar one 0 bar one 0.705488 2017-07-02 00:00:00.000
1 bar one 0.715645 2017-07-02 00:05:00.000
2 NaN NaN NaN 2017-07-02 02:05:06.442
two 2 bar two 0.194648 2017-07-02 00:10:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
baz one 3 baz one 0.129729 2017-07-02 00:15:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
two 4 baz two 0.449889 2017-07-02 00:20:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
foo one 5 foo one 0.031531 2017-07-02 00:25:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
two 6 foo two 0.320757 2017-07-02 00:30:00.000
7 foo two 0.876243 2017-07-02 00:35:00.000
2 NaN NaN NaN 2017-07-02 02:05:06.442
qux one 8 qux one 0.443682 2017-07-02 00:40:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
two 9 qux two 0.802774 2017-07-02 00:45:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
我不明白为什么引入了新的索引级别,但是,我可以摆脱它,最终得到我想要的:
a = a.reset_index(level = 2).drop(('level_2', 'first', 'second')).loc[:,(0,1)]
0 1
first second
bar one 0.705488 2017-07-02 00:00:00.000
one 0.715645 2017-07-02 00:05:00.000
one NaN 2017-07-02 02:05:06.442
two 0.194648 2017-07-02 00:10:00.000
two NaN 2017-07-02 02:05:06.442
baz one 0.129729 2017-07-02 00:15:00.000
one NaN 2017-07-02 02:05:06.442
two 0.449889 2017-07-02 00:20:00.000
two NaN 2017-07-02 02:05:06.442
foo one 0.031531 2017-07-02 00:25:00.000
one NaN 2017-07-02 02:05:06.442
two 0.320757 2017-07-02 00:30:00.000
two 0.876243 2017-07-02 00:35:00.000
two NaN 2017-07-02 02:05:06.442
qux one 0.443682 2017-07-02 00:40:00.000
one NaN 2017-07-02 02:05:06.442
two 0.802774 2017-07-02 00:45:00.000
two NaN 2017-07-02 02:05:06.442
问题:
我想知道是否有一种优雅的、更泛化的方法来实现这一点(在每个组中添加一个新行,尽管这里没有提到,但有条件地填充除时间戳字段之外的新行的其余字段。)简单地说:
b= a.groupby(level=[0,1]).max() # the new lines
b[:]= np.NaN, pd.datetime.now() # updated
a = a.append(b).sort_index() # appended and sorted
按级别分组保留结构,因此更易于管理。您可以先按索引分组,为每个组构建所需的附加行,然后将其重新合并并对df排序
(
pd.concat([a,
a.groupby(level=[0,1]).first().apply(lambda x: [np.nan,dt.datetime.now()]
,axis=1)])
.sort_index()
)
Out[538]:
0 1
first second
bar one 0.587648 2017-07-02 00:00:00.000000
one 0.974524 2017-07-02 00:05:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.555171 2017-07-02 00:10:00.000000
two NaN 2017-07-02 15:18:57.503371
baz one 0.832874 2017-07-02 00:15:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.956891 2017-07-02 00:20:00.000000
two NaN 2017-07-02 15:18:57.503371
foo one 0.872959 2017-07-02 00:25:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.056546 2017-07-02 00:30:00.000000
two 0.359184 2017-07-02 00:35:00.000000
two NaN 2017-07-02 15:18:57.503371
qux one 0.301327 2017-07-02 00:40:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.891815 2017-07-02 00:45:00.000000
two NaN 2017-07-02 15:18:57.503371
“潘多尼克”——我以前从未听说过;-)谢谢这更简单。你能说说我写的函数吗?谢谢!这很简单,我不知道排序索引()你能说说我写的函数吗。