Python 在给定线段长度和线段中的偏移量的情况下，如何从熊猫的起点创建偏移量？_Python_Pandas_Numpy_Dataframe_Cumsum

Python 在给定线段长度和线段中的偏移量的情况下，如何从熊猫的起点创建偏移量？

python pandas numpy dataframe

Python 在给定线段长度和线段中的偏移量的情况下，如何从熊猫的起点创建偏移量？,python,pandas,numpy,dataframe,cumsum,Python,Pandas,Numpy,Dataframe,Cumsum,标题可能不是信息量最大的我有以下工作代码，我想使用本机pandas对[no for loops]进行矢量化。基本上，它应该为每一行返回其从0的累积偏移量，给定每个段的长度以及该段内的相对偏移量 import pandas as pd import numpy as np df = pd.DataFrame({"id": [0, 1, 2, 2, 2, 3, 3, 4, 5, 6, 6, 7, 9], # notice no 8

标题可能不是信息量最大的

我有以下工作代码，我想使用本机pandas对[no for loops]进行矢量化。
基本上，它应该为每一行返回其从

的累积偏移量，给定每个段的长度以及该段内的相对偏移量

import pandas as pd
import numpy as np

df = pd.DataFrame({"id":     [0, 1,  2,  2,  2,  3,  3,  4,  5,  6,  6,   7,   9],  # notice no 8
                   "length": [0, 10, 20, 20, 20, 30, 30, 40, 50, 60, 60,  70,  90],
                   "offset": [0, 0,  1,  3,  4,  0,  7,  0,  0,  0,  1,   0,   0]})


result = np.zeros((len(df),))
current_abs = df.loc[0, "id"]
for i in range(1, len(df)):
    if current_abs == df.loc[i, "id"]:
        result[i] = result[i - 1]
    else:
        current_abs = df.loc[i, "id"]
        result[i] = result[i - 1] + df.loc[i, "length"]

df["offset_from_start"] = result + df["offset"]

print(df)

这似乎是一个花哨的

cumsum

操作，但我不知道如何有效地执行它。

让我们尝试在复制品上使用

mask

，然后使用cumsum：

df['offset_from_start'] = (df['length'].mask(df.duplicated('id'),0)
                                       .cumsum() + df['offset']
                          )

输出：

    id  length  offset  offset_from_start
0    0       0       0                  0
1    1      10       0                 10
2    2      20       1                 31
3    2      20       3                 33
4    2      20       4                 34
5    3      30       0                 60
6    3      30       7                 67
7    4      40       0                100
8    5      50       0                150
9    6      60       0                210
10   6      60       1                211
11   7      70       0                280
12   9      90       0                370

另一种方法，同样的原则：

df['offset_from_start'] = (~df['id'].duplicated() * df['length']).cumsum() + df['offset']
print(df)

输出

    id  length  offset  offset_from_start
0    0       0       0                  0
1    1      10       0                 10
2    2      20       1                 31
3    2      20       3                 33
4    2      20       4                 34
5    3      30       0                 60
6    3      30       7                 67
7    4      40       0                100
8    5      50       0                150
9    6      60       0                210
10   6      60       1                211
11   7      70       0                280
12   9      90       0                370

以下是每种方法的时间安排：

%timeit fun_dani_duplicated(df2)
647 µs ± 49.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit fun_quang_hoang(df3)
1.31 ms ± 264 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

你能解释一下为什么这种方法更快吗？对于大型dfs是否也会更快？@Gulzar我的猜测是，掩码方法中隐藏了一些复杂性。我不能保证对于更大的dfs来说速度会更快。

%timeit fun_dani_duplicated(df2)
647 µs ± 49.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit fun_quang_hoang(df3)
1.31 ms ± 264 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)