Python 在旋转数据帧时保留每个箱子_Python_Pandas

Python 在旋转数据帧时保留每个箱子

python pandas

Python 在旋转数据帧时保留每个箱子,python,pandas,Python,Pandas,我有12个不同的数据帧，由于系统限制，我只能在内存中顺序加载目标是在不同数据集的每次迭代中更新bin计数；每个系统的预处理管道如下所示： ID duration_vs_delay_30 delay_hour_vs_delay_30 0 1 1.12 1.12 1 4 1.13

我有12个不同的数据帧，由于系统限制，我只能在内存中顺序加载

目标是在不同数据集的每次迭代中更新bin计数；每个系统的预处理管道如下所示：

           ID         duration_vs_delay_30    delay_hour_vs_delay_30
0          1                  1.12                    1.12
1          4                  1.13                    1.13
2          5                  1.21                    1.21
3          6                  2.1                     1.7
4         10                  1.95                    1.9

bin_df = chunk.pivot_table(index='bin_x_axis', columns='bin_y_axis', values='count_route',
                aggfunc = np.sum, fill_value = 0)

因此，我将两个数字列与两个已定义的列表组合在一起，并使用透视表对每个组合中的所有值求和：

# Define bins
y_axis = [1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5]
x_axis = [1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0]

chunk['bin_y_axis'] = pd.cut(chunk.delay_hour_vs_delay_30, y_axis)    
chunk['bin_x_axis'] = pd.cut(chunk.duration_vs_delay_30, x_axis)  

chunk["count_route"] = 1

然后，如果我们在第一次迭代中，我使用以下方法：

           ID         duration_vs_delay_30    delay_hour_vs_delay_30
0          1                  1.12                    1.12
1          4                  1.13                    1.13
2          5                  1.21                    1.21
3          6                  2.1                     1.7
4         10                  1.95                    1.9

bin_df = chunk.pivot_table(index='bin_x_axis', columns='bin_y_axis', values='count_route',
                aggfunc = np.sum, fill_value = 0)

其他：

但是，如果当前块没有定义的容器中的所有值，我将获得列和行少于其应有列和行的数据帧

如何更正此错误，以使列和行包含所有bin，并且主bin_df按顺序更新？

在循环遍历每个文件时，为每个

（bin_x，bin_y）

元组保留一个运行总数，并保存最后一个数据透视：

summary = None

# Lop through your files
for file in ...

    # ... process the file ...

    chunk_summary = chunk.groupby(['bin_x_axis', 'bin_y_axis']).size()
    if summary is None:
        summary = chunk_summary
    else:
        summary += chunk_summary

# Save the pivot operation for the last
bin_df = summary.unstack()

注意：您的存储箱没有覆盖所有值，因此某些行被分配给存储箱

groupby

忽略

bin\u x

或

bin\u y

为NaN的行。

在循环浏览每个文件时，为每个

（bin\u x，bin\u y）

元组保留一个运行总数，并为最后一个元组保存轴：

summary = None

# Lop through your files
for file in ...

    # ... process the file ...

    chunk_summary = chunk.groupby(['bin_x_axis', 'bin_y_axis']).size()
    if summary is None:
        summary = chunk_summary
    else:
        summary += chunk_summary

# Save the pivot operation for the last
bin_df = summary.unstack()

注意：您的存储箱没有覆盖所有值，因此某些行被分配给存储箱

groupby

忽略

bin\u x

或

bin\u y

为NaN的行