Python numpy阈值分析中的数据矢量化处理
基本上我有一些网格化的气象数据和维度(时间、纬度、经度)Python numpy阈值分析中的数据矢量化处理,python,numpy,vectorization,Python,Numpy,Vectorization,基本上我有一些网格化的气象数据和维度(时间、纬度、经度) 我需要检查每个gridsquare的每个时间序列,确定连续的几天 (“事件”)当变量高于阈值时,将其存储到 新变量(THdays) 然后我查看新的变量,找到比某个持续时间长的事件(TheEvents) 目前,我有一个超级好斗(非矢量化)的迭代,我非常感谢您关于如何加快迭代速度的建议。谢谢 import numpy as np import itertools as it ##### Parameters lg = 2000 # leng
import numpy as np
import itertools as it
##### Parameters
lg = 2000 # length to initialise array (must be long to store large number of events)
rl = 180 # e.g latitude
cl = 360 # longitude
pcts = [95, 97, 99] # percentiles which are the thresholds that will be compared
dt = [1,2,3] #duration thresholds, i.e. consecutive values (days) above threshold
##### Data
data # this is the gridded data that is (time,lat,lon) , e.g. data = np.random.rand(1000,rl,cl)
# From this data calculate the percentiles at each gridsquare (lat/lon combination) which will act as our thresholds
histpcts = np.percentile(data, q=pcts, axis = 0)
##### Initialize arrays to store the results
THdays = np.ndarray((rl, cl, lg, len(pcts)), dtype='int16') #Array to store consecutive threshold timesteps
THevents = np.ndarray((rl,cl,lg,len(pcts),len(dt)),dtype='int16')
##### Start iteration to identify events
for p in range(len(pcts)): # for each threshold value
br = data>histpcts[p,:,:] # Make boolean array where data is bigger than threshold
# for every lat/lon combination
for r,c in it.product(range(rl),range(cl)):
if br[:,r,c].any()==True: # This is to skip timeseries with nans only and so the iteration is skipped. Important to keep this or something that ignores an array of nans
a = [ sum( 1 for _ in group ) for key, group in it.groupby( br[:,r,c] ) if key ] # Find the consecutive instances
tm = np.full(lg-len(a), np.nan) # create an array of nans to fill in the rest
# Assign to new array
THdays[r,c,0:len(a),p] = a # Consecutive Thresholds days
THdays[r,c,len(a):,p] = tm # Fill the rest of array
# Now cycle through and identify events
# (consecutive values) longer than a certain duration (dt)
for d in range(len(dt)):
b = THdays[r,c,THdays[r,c,:,p]>=dt[d],p]
THevents[r,c,0:len(b),p,d] = b
你试过了吗?
当您在numpy中使用简单循环时,它将为您提供极大的加速。
您只需将代码放入函数中,然后应用decorator@jit
来修饰函数。就这些
@jit
def myfun(inputs):
## crazy nested loops here
当然,您提供的信息越多,您将获得更好的加速:
您可以在此处找到更多信息:只需从内到外工作即可。您有一个三重嵌套for循环。只需将内部循环提取到一个函数中,并找出如何将其矢量化。与任何其他编程任务一样,将问题分解为可管理的部分。以下是我之前按照同样的思路所做的一些工作:和@JohnZwinck谢谢我来看看如果你想获得最佳加速,你必须使用
njit()
,而不是jit()
,然后你必须处理一些限制。OP的代码不会神奇地转化为最佳代码。也许更快,但不是最理想的。我同意。但是作为一种开始使用numba jit的方式更好。另外,因为njit的错误消息在您第一次偶然发现时将很难解释。感谢@giolelm和@JohnZwinck,我尝试了jit
和njit
,然后锁定了错误消息。你知道这是什么意思吗<代码>引发修补的\u异常降低错误:生成\u函数(closure=None,代码=,name=None,defaults=None)
并尝试重构所有内容以使用numpy数组(而不是列表)和numpy函数(例如np.sum而不是sum)