Python 如何优化更改数据帧列中的值_Python_Pandas

Python 如何优化更改数据帧列中的值

python pandas

Python 如何优化更改数据帧列中的值,python,pandas,Python,Pandas,我试图找出一只股票在未来从一天到n天的变化量。唯一的问题是，在1000行数据上运行它大约需要一分钟，而我有数百万行数据。我认为“滞后”是由线路引起的： stocks[0][i][string][line[index]]=adjPctChange（line[adjClose]，line[num]）我在想，每次这条线被击中时，500只股票的整个3d数据帧可能都会被复制，但我只是不确定，或者不知道如何让它更快。此外，它还发出以下警告： SettingWithCopyWarning: 试图在数据帧的切

我试图找出一只股票在未来从一天到n天的变化量。唯一的问题是，在1000行数据上运行它大约需要一分钟，而我有数百万行数据。我认为“滞后”是由线路引起的：

stocks[0][i][string][line[index]]=adjPctChange（line[adjClose]，line[num]）

我在想，每次这条线被击中时，500只股票的整个3d数据帧可能都会被复制，但我只是不确定，或者不知道如何让它更快。此外，它还发出以下警告：

SettingWithCopyWarning:
试图在数据帧的切片副本上设置值

这是我的密码：

daysForeward = 2
for days in range(1,daysForeward+1):
    string = 'closeShift'+str(days)
    stocks[0][i][string] = stocks[0][i]['adjClose'].shift(days-(days*2))

for line in stocks[0][i].itertuples():
    num = 6 #first closeShift columnb
    for days in range(1,daysForeward+1):
        string = 'closeShift'+str(days)
        stocks[0][i][string][line[index]] = adjPctChange(line[adjClose],line[num])
        num+=1

以下是应用百分比变化前后的数据：

       date     open    close  adjClose  closeShift1  closeShift2
0  19980102  20.3835  20.4417       NaN          NaN     0.984507
1  19980105  20.5097  20.5679       NaN     0.984507     1.034904
2  19980106  20.1408  20.0826  0.984507     1.034904     0.994047
3  19980107  20.1408  20.9950  1.034904     0.994047     0.982926
4  19980108  21.1115  20.0244  0.994047     0.982926     0.989441

       date     open    close  adjClose  closeShift1  closeShift2
0  19980102  20.3835  20.4417       NaN          NaN          NaN
1  19980105  20.5097  20.5679       NaN          NaN          NaN
2  19980106  20.1408  20.0826  0.984507     4.869735     0.959720
3  19980107  20.1408  20.9950  1.034904    -3.947904    -5.022423
4  19980108  21.1115  20.0244  0.994047    -1.118683    -0.463311

一些解释：

stocks[0][i]

中的

[0]

只是为了在3d数据帧中达到适当的级别，

[i]

是指在更高的for循环中迭代的股票中的股票名称

adjClose

列只是

close

的一个修改版本，我更喜欢使用它而不是

close

adjPctChange（）

是一个自定义的百分比变化函数，它可以切换方程式，使100到50将产生与50到100相同的结果，从而可以平均结果，而不会向上倾斜

def adjPctChange(startPoint, currentPoint):
    if startPoint < currentPoint:
        x = abs(((float(startPoint)-currentPoint)/float(currentPoint))*100.0)
    else:
        x = ((float(currentPoint)-startPoint)/float(startPoint))*100.0    
    return x

def伴随更改（起始点，当前点）：
如果起始点<当前点：
x=abs（（（浮动（起始点）-当前点）/浮动（当前点））*100.0）
其他：
x=（（浮点（当前点）-起始点）/浮点（起始点））*100.0
返回x

感谢所有能帮忙的人

IIUC：

我从这个数据帧开始：

print df

       date     open    close  adjclose
0  19980102  20.3835  20.4417  0.984507
1  19980105  20.5097  20.5679  1.034904
2  19980106  20.1408  20.0826  0.994047
3  19980107  20.1408  20.9950  0.982926
4  19980108  21.1115  20.0244  0.989441

然后我创建了这些函数：

def get_lags(s, n):
    return pd.concat([s.shift(i) for i in range(n + 1)],
                     axis=1, keys=range(n + 1))

def get_comps(lags):
    comps = []
    for i, cni in enumerate(lags.columns):
        if i > 0:
            max_ = lags.iloc[:, [0, i]].max(1)
            min_ = lags.iloc[:, [0, i]].min(1)
            comps.append((max_ / min_ - 1) * 100)
    return pd.concat(comps, axis=1)

然后我得到滞后并比较它们：

print get_comps(get_lags(df.adjclose, 2))



          0         1
0  0.000000  0.000000
1  5.119009  0.000000
2  4.110168  0.969013
3  1.131418  5.288089
4  0.662817  0.465515

最后，我将它们与df连接起来

print pd.concat([df, get_comps(get_lags(df.adjclose, 2))], axis=1)

       date     open    close  adjclose         0         1
0  19980102  20.3835  20.4417  0.984507  0.000000  0.000000
1  19980105  20.5097  20.5679  1.034904  5.119009  0.000000
2  19980106  20.1408  20.0826  0.994047  4.110168  0.969013
3  19980107  20.1408  20.9950  0.982926  1.131418  5.288089
4  19980108  21.1115  20.0244  0.989441  0.662817  0.465515

根据需要修改。

您不应该在数据帧上循环；只要使用数组函数就可以了

之前：

In [30]: df
Out[30]:
       date     open    close  adjClose  closeShift1  closeShift2
0  19980102  20.3835  20.4417       NaN          NaN     0.984507
1  19980105  20.5097  20.5679       NaN     0.984507     1.034904
2  19980106  20.1408  20.0826  0.984507     1.034904     0.994047
3  19980107  20.1408  20.9950  1.034904     0.994047     0.982926
4  19980108  21.1115  20.0244  0.994047     0.982926     0.989441

数组表示法：

daysForeward = 2
for day in range(1, daysForeward+1):
    column = 'closeShift' + str(day)
    df[column] = (df[column] - df.adjClose) / np.maximum(df[column], df.adjClose) * 100.0

之后：

In [33]: df
Out[33]:
       date     open    close  adjClose  closeShift1  closeShift2
0  19980102  20.3835  20.4417       NaN          NaN          NaN
1  19980105  20.5097  20.5679       NaN          NaN          NaN
2  19980106  20.1408  20.0826  0.984507     4.869727     0.959713
3  19980107  20.1408  20.9950  1.034904    -3.947902    -5.022495
4  19980108  21.1115  20.0244  0.994047    -1.118760    -0.463358