Python numpy阈值分析中的数据矢量化处理_Python_Numpy_Vectorization

Python numpy阈值分析中的数据矢量化处理

python numpy

Python numpy阈值分析中的数据矢量化处理,python,numpy,vectorization,Python,Numpy,Vectorization,基本上我有一些网格化的气象数据和维度（时间、纬度、经度）我需要检查每个gridsquare的每个时间序列，确定连续的几天（“事件”）当变量高于阈值时，将其存储到新变量（THdays）然后我查看新的变量，找到比某个持续时间长的事件（TheEvents）目前，我有一个超级好斗（非矢量化）的迭代，我非常感谢您关于如何加快迭代速度的建议。谢谢 import numpy as np import itertools as it ##### Parameters lg = 2000 # leng

基本上我有一些网格化的气象数据和维度（时间、纬度、经度）

我需要检查每个gridsquare的每个时间序列，确定连续的几天（“事件”）当变量高于阈值时，将其存储到新变量（THdays）

然后我查看新的变量，找到比某个持续时间长的事件（TheEvents）

目前，我有一个超级好斗（非矢量化）的迭代，我非常感谢您关于如何加快迭代速度的建议。谢谢

import numpy as np
import itertools as it
##### Parameters
lg = 2000  # length to initialise array (must be long to store large number of events)
rl = 180  # e.g latitude
cl = 360  # longitude
pcts = [95, 97, 99] # percentiles which are the thresholds that will be compared
dt = [1,2,3] #duration thresholds, i.e. consecutive values (days) above threshold

##### Data
data   # this is the gridded data that is (time,lat,lon) , e.g. data = np.random.rand(1000,rl,cl)
# From this data calculate the percentiles at each gridsquare (lat/lon combination) which will act as our thresholds
histpcts = np.percentile(data, q=pcts, axis = 0)


##### Initialize arrays to store the results
THdays = np.ndarray((rl, cl, lg, len(pcts)), dtype='int16') #Array to store consecutive threshold timesteps
THevents = np.ndarray((rl,cl,lg,len(pcts),len(dt)),dtype='int16')

##### Start iteration to identify events
for p in range(len(pcts)):  # for each threshold value
    br = data>histpcts[p,:,:]  # Make boolean array where data is bigger than threshold

    # for every lat/lon combination
    for r,c in it.product(range(rl),range(cl)): 
        if br[:,r,c].any()==True: # This is to skip timeseries with nans only and so the iteration is skipped. Important to keep this or something that ignores an array of nans
            a = [ sum( 1 for _ in group ) for key, group in it.groupby( br[:,r,c] ) if key ] # Find the consecutive instances
            tm = np.full(lg-len(a), np.nan)   # create an array of nans to fill in the rest


            # Assign to new array
            THdays[r,c,0:len(a),p] = a  # Consecutive Thresholds days
            THdays[r,c,len(a):,p] = tm  # Fill the rest of array

            # Now cycle through and identify events 
            # (consecutive values) longer than a certain duration (dt)
            for d in range(len(dt)):
                b = THdays[r,c,THdays[r,c,:,p]>=dt[d],p]
                THevents[r,c,0:len(b),p,d] = b

你试过了吗？当您在numpy中使用简单循环时，它将为您提供极大的加速。您只需将代码放入函数中，然后应用decorator

@jit

来修饰函数。就这些

@jit
def myfun(inputs):
    ## crazy nested loops here

当然，您提供的信息越多，您将获得更好的加速：

您可以在此处找到更多信息：

只需从内到外工作即可。您有一个三重嵌套for循环。只需将内部循环提取到一个函数中，并找出如何将其矢量化。与任何其他编程任务一样，将问题分解为可管理的部分。以下是我之前按照同样的思路所做的一些工作：和@JohnZwinck谢谢我来看看如果你想获得最佳加速，你必须使用

njit（）

，而不是

jit（）

，然后你必须处理一些限制。OP的代码不会神奇地转化为最佳代码。也许更快，但不是最理想的。我同意。但是作为一种开始使用numba jit的方式更好。另外，因为njit的错误消息在您第一次偶然发现时将很难解释。感谢@giolelm和@JohnZwinck，我尝试了

jit

和

njit

，然后锁定了错误消息。你知道这是什么意思吗<代码>引发修补的\u异常降低错误：生成\u函数（closure=None，代码=，name=None，defaults=None）并尝试重构所有内容以使用numpy数组（而不是列表）和numpy函数（例如np.sum而不是sum）