优化Python代码：更快的groupby和for循环_Python_Pandas_For Loop_Group By

优化Python代码：更快的groupby和for循环

python pandas for-loop

优化Python代码：更快的groupby和for循环,python,pandas,for-loop,group-by,Python,Pandas,For Loop,Group By,我想创建一个下面给出的For循环，在python中更快 import pandas as pd import numpy as np import scipy np.random.seed(1) xl = pd.DataFrame({'Concat' : np.arange(101,999), 'ships_x' : np.random.randint(1001,3000,size=898)}) yl = pd.DataFrame({'PickDate' : np.random.randint

我想创建一个下面给出的For循环，在python中更快

import pandas as pd
import numpy as np
import scipy

np.random.seed(1)
xl = pd.DataFrame({'Concat' : np.arange(101,999), 'ships_x' : np.random.randint(1001,3000,size=898)})
yl = pd.DataFrame({'PickDate' : np.random.randint(1,8,size=10000),'Concat' : np.random.randint(101,999,size=10000), 'ships_x' : np.random.randint(101,300,size=10000), 'ships_y' : np.random.randint(1001,3000,size=10000)})
tempno = [np.random.randint(1,100,size=5)]
k=1
p = pd.DataFrame(0,index=np.arange(len(xl)),columns=['temp','cv']).astype(object)

for ib in [xb for xb in range(0,len(xl))]:
                tempno1 = np.append(tempno,ib)
                temp = list(set(tempno1))
                temptab = yl[yl['Concat'].isin(np.array(xl['Concat'][tempno1]))].groupby('PickDate')['ships_x','ships_y'].sum().reset_index()
                temptab['contri'] = temptab['ships_x']/temptab['ships_y']
                p.ix[k-1,'cv'] = 1 if math.isnan(scipy.stats.variation(temptab['contri'])) else scipy.stats.variation(temptab['contri'])
                p.ix[k-1,'temp'] = temp
                k = k+1

在哪里,

xl，yl-我正在处理的两个数据帧与Concat、x_ships和y_ships等列有关

tempno-xl数据帧索引的初始列表，指的是“Concat”值列表

所以，在for循环中，我们在每次迭代中向tempno添加一个额外的索引，然后根据与“xl”数据帧匹配的“Concat”值对“yl”数据帧进行子集。然后，我们找到“变异系数”（取自scipy lib），并在新的数据帧“p”中做记录

问题是由于for循环的迭代次数千差万别，需要花费太多时间。“分组依据”行占用的时间最长。我已经试过并做了一些更改，现在代码看起来像下面，在注释中提到的更改。有一点改进，但这并不能解决我的问题。请建议最快的实现方法。非常感谢

# Getting all tempno1 into a list with one step
tempno1 = [np.append(tempno,ib) for ib in [xb for xb in range(0,len(xl))]]
temp = [list(set(tempk)) for tempk in tempno1]
# Taking only needed columns from x and y dfs
xtemp = xl[['Concat']]
ytemp = yl[['Concat','ships_x','ships_y','PickDate']]
#Shortlisting y df and groupby in two diff steps
ytemp = [ytemp[ytemp['Concat'].isin(np.array(xtemp['Concat'][tempnokk]))] for tempnokk in tempno1]
temptab = [ytempk.groupby('PickDate')['ships_x','ships_y'].sum().reset_index() for ytempk in ytemp]
tempkcontri = [tempk['ships_x']/tempk['ships_y'] for tempk in temptab]
tempkcontri = [pd.DataFrame(tempkcontri[i],columns=['contri']) for i in range(0,len(tempkcontri))]
temptab = [temptab[i].join(tempkcontri[i]) for i in range(0,len(temptab))]
pcv = [1 if math.isnan(scipy.stats.variation(temptabkk['contri'])) else scipy.stats.variation(temptabkk['contri']) for temptabkk in temptab]
p = pd.DataFrame({'temp' : temp,'cv': pcv})

在for循环中，为什么在[xb for xb In range（0，len（x））]中有ib的

，

而不是在范围（0，len（x））中有ib的

，

？1）您几乎不需要带熊猫的循环；2）如果没有数据，这个问题是不可能排除的。因此，请阅读此内容并编辑您的问题，使其重复性：@jambrothers:如果xb不在myconcat（组[rs]['temp']），并且xb不在tempno]]，那么实际查询是[xb为范围内的xb（0，len（x））中的ib的

，

我能做得更好吗？@Akhileshchander阅读起来有点困难，正如保罗所说，不可能重现你的经历，恐怕我不能再评论了。@PaulH:我已经按照建议编辑了。请帮忙。