pandas groupby.apply 0.23.4和0.24.2之间的差异与深度复制
当我将pandas的版本从0.23.4更新到0.24.2时,我注意到了一个奇怪的行为。以下代码段演示了这一点: CSV文件:(文件名:my_data_new.CSV) 片段:pandas groupby.apply 0.23.4和0.24.2之间的差异与深度复制,pandas,pandas-groupby,Pandas,Pandas Groupby,当我将pandas的版本从0.23.4更新到0.24.2时,我注意到了一个奇怪的行为。以下代码段演示了这一点: CSV文件:(文件名:my_data_new.CSV) 片段: import pandas as pd print("PANDAS-VERSION:", pd.__version__) def my_func(d): d_copy = d.copy(deep=True) return d_copy data = pd.read_csv(&quo
import pandas as pd
print("PANDAS-VERSION:", pd.__version__)
def my_func(d):
d_copy = d.copy(deep=True)
return d_copy
data = pd.read_csv("~/my_data_new.csv", parse_dates=['date'], index_col=['date']).sort_index()
result = data.groupby('name').apply(my_func)
print(result)
输出:
在版本-0.23.4中:
PANDAS-VERSION: 0.23.4
name id roll sub_1 sub_2 sub_3
name date
AAA 2016-11-30 08:00:00 AAA A 1 123.456 123.456 123.456
2016-11-30 09:00:00 AAA A 1 123.457 123.457 123.457
2016-11-30 10:00:00 AAA A 1 123.458 123.458 123.458
2016-11-30 11:00:00 AAA A 1 123.459 123.459 123.459
BBB 2016-11-30 12:00:00 BBB B 2 123.451 123.456 123.456
2016-11-30 13:00:00 BBB B 2 123.452 123.457 123.457
2016-11-30 14:00:00 BBB B 2 123.453 123.458 123.458
2016-11-30 15:00:00 BBB B 2 123.454 123.459 123.459
在版本-0.24.2中:
PANDAS-VERSION: 0.24.2
name id roll sub_1 sub_2 sub_3
name date
AAA 2016-11-30 12:00:00 AAA A 1 123.456 123.456 123.456
2016-11-30 13:00:00 AAA A 1 123.457 123.457 123.457
2016-11-30 14:00:00 AAA A 1 123.458 123.458 123.458
2016-11-30 15:00:00 AAA A 1 123.459 123.459 123.459
BBB 2016-11-30 12:00:00 BBB B 2 123.451 123.456 123.456
2016-11-30 13:00:00 BBB B 2 123.452 123.457 123.457
2016-11-30 14:00:00 BBB B 2 123.453 123.458 123.458
2016-11-30 15:00:00 BBB B 2 123.454 123.459 123.459
我的意见如下:
在pandas-v0.24.2中,最后一组df的索引(在当前情况下“BBB
”)将应用于所有先前的组df(在当前情况下“AAA
”),而在pandas-0.23.4中,保留先前的索引
这是一种记录在案的行为吗?如果是这样,请告诉我回购中发行说明/代码中的修改。此问题已在此处报告: 还有一个观察结果是,只有当索引的数据类型为datetime64[ns]时,才会发生这种情况,而如果索引的数据类型为obj,则不会发生这种情况
data = pd.read_csv("~/my_data_new.csv")
data['date'] = pd.to_datetime(data['date'])
data = data.set_index(['date']).sort_index()
result = data.groupby('name').apply(my_func)
print(result)
上述结果将是:
PANDAS-VERSION: 0.24.2
name id roll sub_1 sub_2 sub_3
name date
AAA 2016-11-30 12:00:00 AAA A 1 123.456 123.456 123.456
2016-11-30 13:00:00 AAA A 1 123.457 123.457 123.457
2016-11-30 14:00:00 AAA A 1 123.458 123.458 123.458
2016-11-30 15:00:00 AAA A 1 123.459 123.459 123.459
BBB 2016-11-30 12:00:00 BBB B 2 123.451 123.456 123.456
2016-11-30 13:00:00 BBB B 2 123.452 123.457 123.457
2016-11-30 14:00:00 BBB B 2 123.453 123.458 123.458
2016-11-30 15:00:00 BBB B 2 123.454 123.459 123.459
PANDAS-VERSION: 0.24.2
name id roll sub_1 sub_2 sub_3
name date
AAA 2016-11-30 08:00:00 AAA A 1 123.456 123.456 123.456
2016-11-30 09:00:00 AAA A 1 123.457 123.457 123.457
2016-11-30 10:00:00 AAA A 1 123.458 123.458 123.458
2016-11-30 11:00:00 AAA A 1 123.459 123.459 123.459
BBB 2016-11-30 12:00:00 BBB B 2 123.451 123.456 123.456
2016-11-30 13:00:00 BBB B 2 123.452 123.457 123.457
2016-11-30 14:00:00 BBB B 2 123.453 123.458 123.458
2016-11-30 15:00:00 BBB B 2 123.454 123.459 123.459
如果我执行以下操作,则不会发生这种情况:
data = pd.read_csv("~/my_data_new.csv")
data = data.set_index(['date']).sort_index()
result = data.groupby('name').apply(my_func)
print(result)
上述代码的结果将是:
PANDAS-VERSION: 0.24.2
name id roll sub_1 sub_2 sub_3
name date
AAA 2016-11-30 12:00:00 AAA A 1 123.456 123.456 123.456
2016-11-30 13:00:00 AAA A 1 123.457 123.457 123.457
2016-11-30 14:00:00 AAA A 1 123.458 123.458 123.458
2016-11-30 15:00:00 AAA A 1 123.459 123.459 123.459
BBB 2016-11-30 12:00:00 BBB B 2 123.451 123.456 123.456
2016-11-30 13:00:00 BBB B 2 123.452 123.457 123.457
2016-11-30 14:00:00 BBB B 2 123.453 123.458 123.458
2016-11-30 15:00:00 BBB B 2 123.454 123.459 123.459
PANDAS-VERSION: 0.24.2
name id roll sub_1 sub_2 sub_3
name date
AAA 2016-11-30 08:00:00 AAA A 1 123.456 123.456 123.456
2016-11-30 09:00:00 AAA A 1 123.457 123.457 123.457
2016-11-30 10:00:00 AAA A 1 123.458 123.458 123.458
2016-11-30 11:00:00 AAA A 1 123.459 123.459 123.459
BBB 2016-11-30 12:00:00 BBB B 2 123.451 123.456 123.456
2016-11-30 13:00:00 BBB B 2 123.452 123.457 123.457
2016-11-30 14:00:00 BBB B 2 123.453 123.458 123.458
2016-11-30 15:00:00 BBB B 2 123.454 123.459 123.459
您应该真正更新到当前版本(1.1.1)。这是许多版本的背后。这在当前版本中不是问题。也许这是相关的。谢谢你的评论。但我的问题是索引。为什么将最后一个组DF的索引复制到其他组DF?