python中两个大型多维数组之间的Spedup距离和摘要计算
我只有一年使用python的经验。我想查找基于两个多维数组的摘要统计数据python中两个大型多维数组之间的Spedup距离和摘要计算,python,numpy,Python,Numpy,我只有一年使用python的经验。我想查找基于两个多维数组的摘要统计数据DF_All和DF_on。两者都有X,Y值。创建了一个函数,将距离计算为sqrt((X-X0)^2+(Y-Y0)^2),并生成如下代码所示的摘要。我的问题是:有没有办法让这段代码运行得更快?我更喜欢本地python方法,但也欢迎其他策略(如numba) 下面的示例(玩具)代码在我的windows-7 x64桌面上运行只需50毫秒。但是我的DF_All有10000多行,而且我还需要进行大量的计算,导致执行时间过长 import
DF_All
和DF_on
。两者都有X
,Y
值。创建了一个函数,将距离计算为sqrt((X-X0)^2+(Y-Y0)^2)
,并生成如下代码所示的摘要。我的问题是:有没有办法让这段代码运行得更快?我更喜欢本地python方法,但也欢迎其他策略(如numba
)
下面的示例(玩具)代码在我的windows-7 x64桌面上运行只需50毫秒。但是我的DF_All
有10000多行,而且我还需要进行大量的计算,导致执行时间过长
import numpy as np
import pandas as pd
import json, random
# create data
KY = ['ER','WD','DF']
DS = ['On','Off']
DF_All = pd.DataFrame({'KY': np.random.choice(KY,20,replace = True),
'DS': np.random.choice(DS,20,replace = True),
'X': random.sample(range(1,100),20),
'Y': random.sample(range(1,100),20)})
DF_On = DF_All[DF_All['DS']=='On']
# function
def get_values(DF_All,X = list(DF_On['X'])[0],Y = list(DF_On['Y'])[0]):
dist_vector = np.sqrt((DF_All['X'] - X)**2 + (DF_All['Y'] - Y)**2) # computes distance
DF_All = DF_All[dist_vector<35] # filters if distance is < 35
# print(DF_All.shape)
DS_summary = [sum(DF_All['DS']==x) for x in ['On','Off']] # get summary
KY_summary = [sum(DF_All['KY']==x) for x in ['ER','WD','DF']] # get summary
joined_summary = DS_summary + KY_summary # join two summary lists
return(joined_summary) # return
Array_On = DF_On.values.tolist() # convert to array then to list
Values = [get_values(DF_All,ZZ[2],ZZ[3]) for ZZ in Array_On] # list comprehension to get DS and KY summary for all rows of Array_On list
Array_Updated = [x + y for x,y in zip(Array_On,Values)] # appending the summary list to Array_On list
Array_Updated = pd.DataFrame(Array_Updated) # converting to pandas dataframe
print(Array_Updated)
将numpy导入为np
作为pd进口熊猫
导入json,随机
#创建数据
KY=['ER','WD','DF']
DS=[“开”,“关”]
DF_All=pd.DataFrame({'KY':np.random.choice(KY,20,replace=True),
“DS”:np.random.choice(DS,20,replace=True),
“X”:随机样本(范围(1100),20),
“Y”:随机。样本(范围(1100),20)})
DF_On=DF_All[DF_All['DS']='On']
#作用
def get_值(DF_All,X=list(DF_在['X'])[0],Y=list(DF_在['Y'])[0]):
距离向量=np.sqrt((DF_All['X']-X)**2+(DF_All['Y']-Y)**2)计算距离
DF_All=DF_All[dist_vector这里有一种方法,通过消除那里的循环,利用矢量化-
from scipy.spatial.distance import cdist
def get_values_vectorized(DF_All, Array_On):
a = DF_All[['X','Y']].values
b = np.array(Array_On)[:,2:].astype(int)
v_mask = (cdist(b,a) < 35).astype(int)
DF_DS = DF_All.DS.values
DS_sums = v_mask.dot(DF_DS[:,None] == ['On','Off'])
DF_KY = DF_All.KY.values
KY_sums = v_mask.dot(DF_KY[:,None] == ['ER','WD','DF'])
return np.column_stack(( DS_sums, KY_sums ))
案例2:样本量2000
In [420]: %timeit [get_values(DF_All,ZZ[2],ZZ[3]) for ZZ in Array_On]
1 loops, best of 3: 1.39 s per loop
In [421]: %timeit get_values_vectorized(DF_All, Array_On)
100 loops, best of 3: 18 ms per loop
我必须对代码进行大量尝试和调整,才能在真实数据上运行。(***==x)[:,None]&v_掩码)。sum(0)
代码是加速代码的关键步骤。谢谢
In [417]: %timeit [get_values(DF_All,ZZ[2],ZZ[3]) for ZZ in Array_On]
100 loops, best of 3: 16.3 ms per loop
In [418]: %timeit get_values_vectorized(DF_All, Array_On)
1000 loops, best of 3: 386 µs per loop
In [420]: %timeit [get_values(DF_All,ZZ[2],ZZ[3]) for ZZ in Array_On]
1 loops, best of 3: 1.39 s per loop
In [421]: %timeit get_values_vectorized(DF_All, Array_On)
100 loops, best of 3: 18 ms per loop