Pandas 拆分数据帧_Pandas_Dataframe_Split_Threshold

Pandas 拆分数据帧

pandas dataframe

Pandas 拆分数据帧,pandas,dataframe,split,threshold,Pandas,Dataframe,Split,Threshold,打印（df）您可以在B列的累积和上使用： th = 50 # find the cumulative sum of B cumsum = df.B.cumsum() # create the bins with spacing of th (threshold) bins = list(range(0, cumsum.max() + 1, th)) # group by (split by) the bins groups = pd.cut(cumsum, bins) for key

打印（df）

您可以在B列的累积和上使用：

th = 50

# find the cumulative sum of B 
cumsum = df.B.cumsum()

# create the bins with spacing of th (threshold)
bins = list(range(0, cumsum.max() + 1, th))

# group by (split by) the bins
groups = pd.cut(cumsum, bins)

for key, group in df.groupby(groups):
    print(group)
    print()

输出

这里有一种使用

numba

来加速

for循环的方法

：

我们检查何时达到限制，并重置

总数

计数，然后分配一个新的

组

：

from numba import njit

@njit
def cumsum_reset(array, limit):
    total = 0
    counter = 0 
    groups = np.empty(array.shape[0])
    for idx, i in enumerate(array):
        total += i
        if total >= limit or array[idx-1] == limit:
            counter += 1
            groups[idx] = counter
            total = 0
        else:
            groups[idx] = counter
    
    return groups

grps = cumsum_reset(df['B'].to_numpy(), 50)

for _, grp in df.groupby(grps):
    print(grp, '\n')

输出

计时：

# create dataframe of 600k rows
dfbig = pd.concat([df]*100000, ignore_index=True)
dfbig.shape

(600000, 2)

# Erfan
%%timeit
cumsum_reset(dfbig['B'].to_numpy(), 50)

4.25 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Daniel Mesejo
def daniel_mesejo(th, column):
    cumsum = column.cumsum()
    bins = list(range(0, cumsum.max() + 1, th))
    groups = pd.cut(cumsum, bins)
    
    return groups

%%timeit
daniel_mesejo(50, dfbig['B'])

10.3 s ± 2.17 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

结论，

numba

for loop快了24倍。

嗨，丹尼尔，你知道我该如何给这些小组起不同的名字吗？你可以把他们放在dictionary@DanielMesejo你可能对我回答的时间感兴趣。没有；我不希望

for loop

和

numba

的速度会快得多。@Erfan确实快得多，我想numba是一个非常好的工具。如果你对

组使用numpy数组而不是列表，你可以在numba函数中获得很大的加速。我试过了，但是得到了一个错误，无法确定变量的类型
@MAX9111应该可以使用groups=np.empty（array.shape[0]，dtype=np.uint64）
而不是groups=[]
分配数组，并使用groups[idx]=counter
而不是groups.append（counter）
将结果写入数组。我明白了，这确实有效，将编辑答案。我尝试了groups=np.array（[]）
，然后尝试了groups=np.append（groups，counter）
。这给了我一个错误@max9111Hi@Erfan我感谢您的回答，但我需要根据阈值将数据始终拆分为6个存储箱。是否可能，我尝试编辑您的代码，但无效：如果总数>=50或数组[idx-1]==50且goups==3：