Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/entity-framework/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python numpy阈值分析中的数据矢量化处理_Python_Numpy_Vectorization - Fatal编程技术网

Python numpy阈值分析中的数据矢量化处理

Python numpy阈值分析中的数据矢量化处理,python,numpy,vectorization,Python,Numpy,Vectorization,基本上我有一些网格化的气象数据和维度(时间、纬度、经度) 我需要检查每个gridsquare的每个时间序列,确定连续的几天 (“事件”)当变量高于阈值时,将其存储到 新变量(THdays) 然后我查看新的变量,找到比某个持续时间长的事件(TheEvents) 目前,我有一个超级好斗(非矢量化)的迭代,我非常感谢您关于如何加快迭代速度的建议。谢谢 import numpy as np import itertools as it ##### Parameters lg = 2000 # leng

基本上我有一些网格化的气象数据和维度(时间、纬度、经度)

  • 我需要检查每个gridsquare的每个时间序列,确定连续的几天 (“事件”)当变量高于阈值时,将其存储到 新变量(THdays)
  • 然后我查看新的变量,找到比某个持续时间长的事件(TheEvents)
  • 目前,我有一个超级好斗(非矢量化)的迭代,我非常感谢您关于如何加快迭代速度的建议。谢谢

    import numpy as np
    import itertools as it
    ##### Parameters
    lg = 2000  # length to initialise array (must be long to store large number of events)
    rl = 180  # e.g latitude
    cl = 360  # longitude
    pcts = [95, 97, 99] # percentiles which are the thresholds that will be compared
    dt = [1,2,3] #duration thresholds, i.e. consecutive values (days) above threshold
    
    ##### Data
    data   # this is the gridded data that is (time,lat,lon) , e.g. data = np.random.rand(1000,rl,cl)
    # From this data calculate the percentiles at each gridsquare (lat/lon combination) which will act as our thresholds
    histpcts = np.percentile(data, q=pcts, axis = 0)
    
    
    ##### Initialize arrays to store the results
    THdays = np.ndarray((rl, cl, lg, len(pcts)), dtype='int16') #Array to store consecutive threshold timesteps
    THevents = np.ndarray((rl,cl,lg,len(pcts),len(dt)),dtype='int16')
    
    ##### Start iteration to identify events
    for p in range(len(pcts)):  # for each threshold value
        br = data>histpcts[p,:,:]  # Make boolean array where data is bigger than threshold
    
        # for every lat/lon combination
        for r,c in it.product(range(rl),range(cl)): 
            if br[:,r,c].any()==True: # This is to skip timeseries with nans only and so the iteration is skipped. Important to keep this or something that ignores an array of nans
                a = [ sum( 1 for _ in group ) for key, group in it.groupby( br[:,r,c] ) if key ] # Find the consecutive instances
                tm = np.full(lg-len(a), np.nan)   # create an array of nans to fill in the rest
    
    
                # Assign to new array
                THdays[r,c,0:len(a),p] = a  # Consecutive Thresholds days
                THdays[r,c,len(a):,p] = tm  # Fill the rest of array
    
                # Now cycle through and identify events 
                # (consecutive values) longer than a certain duration (dt)
                for d in range(len(dt)):
                    b = THdays[r,c,THdays[r,c,:,p]>=dt[d],p]
                    THevents[r,c,0:len(b),p,d] = b
    
    你试过了吗? 当您在numpy中使用简单循环时,它将为您提供极大的加速。 您只需将代码放入函数中,然后应用decorator
    @jit
    来修饰函数。就这些

    @jit
    def myfun(inputs):
        ## crazy nested loops here
    
    当然,您提供的信息越多,您将获得更好的加速:
    您可以在此处找到更多信息:

    只需从内到外工作即可。您有一个三重嵌套for循环。只需将内部循环提取到一个函数中,并找出如何将其矢量化。与任何其他编程任务一样,将问题分解为可管理的部分。以下是我之前按照同样的思路所做的一些工作:和@JohnZwinck谢谢我来看看如果你想获得最佳加速,你必须使用
    njit()
    ,而不是
    jit()
    ,然后你必须处理一些限制。OP的代码不会神奇地转化为最佳代码。也许更快,但不是最理想的。我同意。但是作为一种开始使用numba jit的方式更好。另外,因为njit的错误消息在您第一次偶然发现时将很难解释。感谢@giolelm和@JohnZwinck,我尝试了
    jit
    njit
    ,然后锁定了错误消息。你知道这是什么意思吗<代码>引发修补的\u异常降低错误:生成\u函数(closure=None,代码=,name=None,defaults=None)并尝试重构所有内容以使用numpy数组(而不是列表)和numpy函数(例如np.sum而不是sum)