Python 如何提高大熊猫的海螺繁殖速度_Python_Pandas_Performance

Python 如何提高大熊猫的海螺繁殖速度

python pandas performance

Python 如何提高大熊猫的海螺繁殖速度,python,pandas,performance,Python,Pandas,Performance,我想通过定期复制行来扩展我的数据帧 import pandas as pd import numpy as np def expandData(data, timeStep=2, sampleLen= 5): dataEp = pd.DataFrame() for epoch in range(int(len(data)/sampleLen)): dataSample = data.iloc[epoch*sampleLen:(epoch+1)*sampleLe

我想通过定期复制行来扩展我的数据帧

import pandas as pd 
import numpy as np 
def expandData(data, timeStep=2, sampleLen= 5):
    dataEp = pd.DataFrame()
    for epoch in range(int(len(data)/sampleLen)):
        dataSample = data.iloc[epoch*sampleLen:(epoch+1)*sampleLen, :]
        for num in range(int(sampleLen-timeStep +1)):
            tempDf = dataSample.iloc[num:timeStep+num,:]
            dataEp = pd.concat([dataEp, tempDf],axis= 0)
    return dataEp

df = pd.DataFrame({'a':list(np.arange(5))+list(np.arange(15,20)),
'other':list(np.arange(100,110))})
dfEp = expandData(df, 3, 5)

输出：

df
     a  other
0   0    100
1   1    101
2   2    102
3   3    103
4   4    104
5  15    105
6  16    106
7  17    107
8  18    108
9  19    109

dfEp
     a  other
0   0    100
1   1    101
2   2    102
1   1    101
2   2    102
3   3    103
2   2    102
3   3    103
4   4    104
5  15    105
6  16    106
7  17    107
6  16    106
7  17    107
8  18    108
7  17    107
8  18    108
9  19    109

应为：

我希望有一种更好的方法能够以良好的性能实现它，因为如果数据帧具有较大的行大小，例如40000行，那么我的代码将运行大约20分钟

编辑：

实际上，我希望以

timeStep

的大小重复一个小序列。我已经将

expandData（df，2，5）

更改为

expandData（df，3，5）

如果

的值间隔均匀，则可以测试序列中的断点，然后根据以下条件复制每个连续序列中的行：

样本输出：

    a  other  start/stop
0   0    100         NaN
1   1    101         0.0
1   1    101         0.0
2   2    102         0.0
2   2    102         0.0
3   3    103         0.0
3   3    103         0.0
4   4    104        10.0
5  15    105       -10.0
6  16    106         0.0
6  16    106         0.0
7  17    107         0.0
7  17    107         0.0
8  18    108         0.0
8  18    108         0.0
9  19    109         NaN

如果只是大约纪元长度（您没有明确指定规则），则更简单：

df = pd.DataFrame({'a':list(np.arange(5))+list(np.arange(15,20)),
'other':list(np.arange(100,110))})

sampleLen = 5
repeat = np.repeat([2], sampleLen)
repeat[0] = repeat[-1] = 1
repeat = np.tile(repeat, len(df)//sampleLen)

df = df.loc[np.repeat(df.index.values, repeat)]

指数0,4,15,19不重复，您是否可以显示合并的约束？似乎您试图将连续间隔分为逐步间隔。

中的此步骤是否始终为1？你确定你需要这个吗？听起来像个XY问题。您希望对这些新定义的间隔执行的下一个计算是什么？我将其用于

RNN

模型，您可以将

到

的数字作为样本，将

到

的数字作为另一个样本。很抱歉误导了你。嗨，谢谢！当

timeStep

不等于

时，可能不是我所期望的。

df = pd.DataFrame({'a':list(np.arange(5))+list(np.arange(15,20)),
'other':list(np.arange(100,110))})

sampleLen = 5
repeat = np.repeat([2], sampleLen)
repeat[0] = repeat[-1] = 1
repeat = np.tile(repeat, len(df)//sampleLen)

df = df.loc[np.repeat(df.index.values, repeat)]