Python-For循环数百万行
我有一个数据框Python-For循环数百万行,python,python-3.x,pandas,performance,vectorization,Python,Python 3.x,Pandas,Performance,Vectorization,我有一个数据框c,有很多不同的列。另外,arr是一个数据帧,对应于c的子集:arr=c[c['a\u D']=='a'] 我的代码的主要思想是迭代c-数据帧中的所有行,并搜索所有可能发生某些特定情况的情况(在arr数据帧中): 只需在以下行上进行迭代:c['A_D']==D和c['ready_linked']==0 arr数据帧中的hour必须小于c数据帧中的hour\u aux arr数据帧的ready\u linked列必须为零:arr.ready\u linked==0 终端和操作员在c
c
,有很多不同的列。另外,arr
是一个数据帧,对应于c
的子集:arr=c[c['a\u D']=='a']
我的代码的主要思想是迭代c
-数据帧中的所有行,并搜索所有可能发生某些特定情况的情况(在arr
数据帧中):
- 只需在以下行上进行迭代:
和c['A_D']==D
c['ready_linked']==0
数据帧中的arr
必须小于hour
数据帧中的c
hour\u aux
数据帧的arr
列必须为零:ready\u linked
arr.ready\u linked==0
和终端
在c和操作员
数据帧中需要相同arr
- 分组按
数据帧选择相同的操作员和终端:arr
)g=groups.get_组((row.Operator,row.Terminal
- 仅选择时间小于
dataframe中小时且已链接==0:c
vb=g[(g.ready_linked==0)和(g.hour您的问题是,是否有一种方法可以将for循环矢量化,但我认为这个问题隐藏了您真正想要的东西,这是一种加速代码的简单方法。。对于性能问题,一个好的起点始终是评测。然而,我强烈怀疑,您代码中的主要操作是
。如果。query(row.query\u string)
很大,则为每一行运行该命令的成本很高 对于任意查询,如果不删除迭代之间的依赖关系并并行化昂贵的步骤,则无法真正改善该运行时。不过,您可能会幸运一些。您的查询字符串总是检查两个不同的列,以查看它们是否等于您关心的内容。但是,对于每一行,都需要进行检查您的整个arr
片段。由于片段每次都会更改,这可能会导致问题,但以下是一些想法:arr
- 由于每次都要对
进行切片,因此只需维护arr
行的视图,这样就可以在较小的对象上进行迭代arr.ready_Linked==0
- 更好的是,在执行任何循环之前,您应该首先通过
终端和
操作员对
进行分组。然后,不要运行所有arr
,而是首先选择所需的组,然后进行切片和筛选。这需要重新考虑arr
的确切实现一点点,但优点是如果你有很多终端和操作符,你通常会在一个比query\u string
小得多的对象上工作。此外,你甚至不需要查询这个对象,因为这是由groupby隐式完成的arr
- 根据
通常与aux.hour
相关的方式,您可以通过在开始时对row.hour\u aux
进行排序来改进aux
。仅使用不等式运算符,您可能看不到任何增益,但您可以将其与对截止点和n只要切到截止点hour
- 依此类推。我再次怀疑,对所有
的每一行重新构造查询的任何方法都将提供比切换框架或矢量化比特和片段多得多的收益arr
groups = arr.groupby(['Operator', 'Terminal']) for row in c[(c.A_D == 'D') & (c.Already_linked == 0)].itertuples(): g = groups.get_group((row.Operator, row.Terminal)) vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)] try: aux = (vb.START - row.x).abs().idxmin() print(row.x) c.loc[row.Index, 'a'] = vb.loc[aux,'FlightID'] g.loc[aux, 'Already_linked'] = 1 continue except: continue
groups=arr.groupby(['Operator','Terminal']) 对于c[(c.A_D='D')&(c.ready_linked==0)]中的行。itertuples(): g=组。获取组((row.Operator,row.Terminal))
vb=g[(g.ready_linked==0)和(g.hour虽然这不是一个矢量化的解决方案,但如果您的样本数据集模仿真实的数据集,它应该会加快速度。目前,您正在浪费时间在每一行上循环,但您只关心在
和['a_D']='D'
。取而代之的是删除if并在仅为初始数据帧30%的截断数据帧上循环['ready_linked']的行上循环==0
for row in c[(c.A_D == 'D') & (c.Already_linked == 0)].itertuples(): vb = arr[(arr.Already_linked == 0) & (arr.hour < row.hour_aux)].copy().query(row.query_string) try: aux = (vb.START - row.x).abs().idxmin() print(row.x) c.loc[row.Index, 'a'] = vb.loc[aux,'FlightID'] arr.loc[aux, 'Already_linked'] = 1 continue except: continue
您的问题看起来像是数据库操作中最常见的问题之一。我不完全理解您想要得到什么,因为您尚未制定任务。现在来看看可能的解决方案-完全避免循环 你有一个很长的对于c[(c.A\U D='D')&(c.A\U已链接==0)]中的行。itertuples(): vb=arr[(arr.ready_linked==0)和(arr.hour
,列有表格
。如果我理解正确,其他列和日期都无关紧要。而且时间、飞行ID、操作员、终端、a_D
和开始时间
在每一行都是相同的。顺便说一句,你可以用代码结束时间
表格得到
列。loc[:,'time']=table.loc[:,'START'].dt.time时间
。您的table=table.删除重复项(subset=['time','FlightID','Operator','Terminal'])
表将大大缩短
- 根据
值将A_D
拆分为table
和table_arr
:table_dep
table_arr=table.loc[table.loc[:,'A_D']='A',['FlightID','Operator','Terminal'
groups = arr.groupby(['Operator', 'Terminal']) for row in c[(c.A_D == "D") & (c.Already_linked == 0)].itertuples(): try: g = groups.get_group((row.Operator, row.Terminal)) vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)] aux = (vb.START - row.x).abs().idxmin() c.loc[row.Index, 'a'] = vb.loc[aux].FlightID arr.loc[aux, 'Already_linked'] = 1 continue except: continue c['Already_linked'] = np.where((c.a != 0) & (c.a != 'No_link_found') & (c.A_D == 'D'), 1, c['Already_linked']) c.Already_linked.loc[arr.Already_linked.index] = arr.Already_linked c['a'] = np.where((c.Already_linked == 0) & (c.A_D == 'D'),'No_link_found',c['a'])
import numpy as np import pandas as pd import io s = ''' A_D Operator FlightID Terminal TROUND_ID tot A QR QR001 4 QR002 70 D DL DL001 3 " " 84 D DL DL001 3 " " 78 D VS VS001 3 " " 45 A DL DL401 3 " " 9 A DL DL401 3 " " 19 A DL DL401 3 " " 3 A DL DL401 3 " " 32 A DL DL401 3 " " 95 A DL DL402 3 " " 58 ''' data_aux = pd.read_table(io.StringIO(s), delim_whitespace=True) data_aux.Terminal = data_aux.Terminal.astype(str) data_aux.tot= data_aux.tot.astype(str) d = {'START': ['2017-03-26 16:55:00', '2017-03-26 09:30:00','2017-03-27 09:30:00','2017-10-08 15:15:00', '2017-03-26 06:50:00','2017-03-27 06:50:00','2017-03-29 06:50:00','2017-05-03 06:50:00', '2017-06-25 06:50:00','2017-03-26 07:45:00'], 'END': ['2017-10-28 16:55:00' ,'2017-06-11 09:30:00' , '2017-10-28 09:30:00' ,'2017-10-22 15:15:00','2017-06-11 06:50:00' ,'2017-10-28 06:50:00', '2017-04-19 06:50:00' ,'2017-10-25 06:50:00','2017-10-22 06:50:00' ,'2017-10-28 07:45:00']} aux_df = pd.DataFrame(data=d) aux_df.START = pd.to_datetime(aux_df.START) aux_df.END = pd.to_datetime(aux_df.END) c = pd.concat([aux_df, data_aux], axis = 1) c['A_D'] = c['A_D'].astype(str) c['Operator'] = c['Operator'].astype(str) c['Terminal'] = c['Terminal'].astype(str) c['hour'] = pd.to_datetime(c['START'], format='%H:%M').dt.time c['hour_aux'] = pd.to_datetime(c['START'] - pd.Timedelta(15, unit='m'), format='%H:%M').dt.time c['start_day'] = c['START'].astype(str).str[0:10] c['end_day'] = c['END'].astype(str).str[0:10] c['x'] = c.START - pd.to_timedelta(c.tot.astype(int), unit='m') c["a"] = 0 c["Already_linked"] = np.where(c.TROUND_ID != " ", 1 ,0) arr = c[c['A_D'] == 'A']
groups = arr.groupby(['Operator', 'Terminal']) for row in c[(c.A_D == 'D') & (c.Already_linked == 0)].itertuples(): g = groups.get_group((row.Operator, row.Terminal)) vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)] try: aux = (vb.START - row.x).abs().idxmin() print(row.x) c.loc[row.Index, 'a'] = vb.loc[aux,'FlightID'] g.loc[aux, 'Already_linked'] = 1 continue except: continue
for row in c[(c.A_D == 'D') & (c.Already_linked == 0)].itertuples(): vb = arr[(arr.Already_linked == 0) & (arr.hour < row.hour_aux)].copy().query(row.query_string) try: aux = (vb.START - row.x).abs().idxmin() print(row.x) c.loc[row.Index, 'a'] = vb.loc[aux,'FlightID'] arr.loc[aux, 'Already_linked'] = 1 continue except: continue
FlightID_arr Operator Terminal time_arr FlightID_dep time_dep 0 DL401 DL 3 06:50:00 DL001 09:30:00 1 DL402 DL 3 07:45:00 DL001 09:30:00 2 NaN VS 3 NaN VS001 15:15:00
import io import pandas as pd data = ''' START,END,A_D,Operator,FlightID,Terminal,TROUND_ID,tot 2017-03-26 16:55:00,2017-10-28 16:55:00,A,QR,QR001,4,QR002,70 2017-03-26 09:30:00,2017-06-11 09:30:00,D,DL,DL001,3,,84 2017-03-27 09:30:00,2017-10-28 09:30:00,D,DL,DL001,3,,78 2017-10-08 15:15:00,2017-10-22 15:15:00,D,VS,VS001,3,,45 2017-03-26 06:50:00,2017-06-11 06:50:00,A,DL,DL401,3,,9 2017-03-27 06:50:00,2017-10-28 06:50:00,A,DL,DL401,3,,19 2017-03-29 06:50:00,2017-04-19 06:50:00,A,DL,DL401,3,,3 2017-05-03 06:50:00,2017-10-25 06:50:00,A,DL,DL401,3,,32 2017-06-25 06:50:00,2017-10-22 06:50:00,A,DL,DL401,3,,95 2017-03-26 07:45:00,2017-10-28 07:45:00,A,DL,DL402,3,,58 ''' table = pd.read_csv(io.StringIO(data), parse_dates=[0, 1]) table.loc[:, 'time'] = table.loc[:, 'START'].dt.time table = table.drop_duplicates(subset=['time', 'FlightID', 'Operator', 'Terminal']) table_arr = table.loc[table.loc[:, 'A_D'] == 'A', ['FlightID', 'Operator', 'Terminal', 'time']] table_dep = table.loc[table.loc[:, 'A_D'] == 'D', ['FlightID', 'Operator', 'Terminal', 'time']] table_result = table_arr.merge( table_dep, how='right', on=['Operator', 'Terminal'], suffixes=('_arr', '_dep')) print(table_result)
def apply_do_g(it_row): """ This is your function, but using isin and apply """ keep = {'Operator': [it_row.Operator], 'Terminal': [it_row.Terminal]} # dict for isin combined mask holder1 = arr[list(keep)].isin(keep).all(axis=1) # create boolean mask holder2 = arr.Already_linked.isin([0]) # create boolean mask holder3 = arr.hour < it_row.hour_aux # create boolean mask holder = holder1 & holder2 & holder3 # combine the masks holder = arr.loc[holder] if not holder.empty: aux = np.absolute(holder.START - it_row.x).idxmin() c.loc[it_row.name, 'a'] = holder.loc[aux].FlightID # use with apply 'it_row.name' arr.loc[aux, 'Already_linked'] = 1 def new_way_2(): keep = {'A_D': ['D'], 'Already_linked': [0]} df_test = c[c[list(keep)].isin(keep).all(axis=1)].copy() # returns the resultant df df_test.apply(lambda row: apply_do_g(row), axis=1) # g is multiple DataFrames" #call the function new_way_2()
- 由于每次都要对