如何在Python中实现并行处理?
我正在尝试用python进行并行处理。我有一个超过4M行的巨大数据帧。因此,作为下面给出的示例,我想划分数据帧(如何在Python中实现并行处理?,python,python-3.x,pandas,dataframe,parallel-processing,Python,Python 3.x,Pandas,Dataframe,Parallel Processing,我正在尝试用python进行并行处理。我有一个超过4M行的巨大数据帧。因此,作为下面给出的示例,我想划分数据帧(df将被划分为df1,df2)在不同的结果数据帧上应用相同的转置操作集。感谢Jezrael帮助我达到这个水平。请在下面找到我的输入数据框 df = pd.DataFrame({ 'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4], 'readings' : ['READ_1','READ_2','READ_1','READ_3','READ_1',
df将被划分为df1,df2
)在不同的结果数据帧上应用相同的转置操作集。感谢Jezrael帮助我达到这个水平。请在下面找到我的输入数据框
df = pd.DataFrame({
'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4],
'readings' : ['READ_1','READ_2','READ_1','READ_3','READ_1','READ_5','READ_6','READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'],
'val' :[5,6,7,11,5,7,16,12,13,56,32,13,45,43,46],
})
用于分割数据帧的代码
N=2 # dividing into two dataframes.
dfs = [x for _,x in df.groupby(pd.factorize(df['subject_id'])[0] // N)] # dfs is an iterable which will have two dataframes
import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())
results = []
def transpose_ope(df): #this function does the transformation like I want
df_op = (df.groupby(['subject_id','readings'])['val']
.describe()
.unstack()
.swaplevel(0,1,axis=1)
.reindex(df['readings'].unique(), axis=1, level=0))
df_op.columns = df_op.columns.map('_'.join)
df_op = df_op.reset_index()
results.append(pool.map(transpose_ope, [df for df in dfs])) # am I storing the output correctly here?
并行处理代码
N=2 # dividing into two dataframes.
dfs = [x for _,x in df.groupby(pd.factorize(df['subject_id'])[0] // N)] # dfs is an iterable which will have two dataframes
import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())
results = []
def transpose_ope(df): #this function does the transformation like I want
df_op = (df.groupby(['subject_id','readings'])['val']
.describe()
.unstack()
.swaplevel(0,1,axis=1)
.reindex(df['readings'].unique(), axis=1, level=0))
df_op.columns = df_op.columns.map('_'.join)
df_op = df_op.reset_index()
results.append(pool.map(transpose_ope, [df for df in dfs])) # am I storing the output correctly here?
实际上,我想将每个阶段的输出附加到主数据帧
你能帮我做这个吗?我的代码即使只运行了大约10-15条记录也能继续运行 在map中使用的函数需要返回所需的对象
我还将使用池中更惯用的上下文管理器
编辑:固定导入
import multiprocessing as mp
def transpose_ope(df): #this function does the transformation like I want
df_op = (df.groupby(['subject_id','readings'])['val']
.describe()
.unstack()
.swaplevel(0,1,axis=1)
.reindex(df['readings'].unique(), axis=1, level=0))
df_op.columns = df_op.columns.map('_'.join)
df_op = df_op.reset_index()
return df_op
def main():
with mp.Pool(mp.cpu_count()) as pool:
res = pool.map(transpose_ope, [df for df in dfs])
if __name__=='__main__':
main()
不知道为什么要将一个列表附加到另一个列表中……但如果您只是想要一个[transformed(df)for df in dfs]的最终列表,map将返回该列表。这不是您问题的直接答案,但您是否尝试使用:@QuantChristo-不幸的是,我现在在windows中。看起来它只适用于Linux和LinuxMacos@QuantChristo-嗨,你试过用这个吗?当我尝试他们文档中给出的一个示例时,它仍然在运行30000000行×2列
。我可以知道,对于这种大小的数据帧,使用pandarallel
需要多长时间吗?我大约2个月前使用过它,simpleapplyparallel
,它工作正常,但它也必须在windows机器上运行,所以我需要停止使用它。还有,要检查它,您只需导入modin.pandas as pd
您是否尝试过他们的示例,速度是否很快?比如在2-3分钟内?好吧,我猜是mp.Pool
。实际上你有没有试着在你这边运行这个?它仍在为示例数据帧运行。可能是一个主要问题,取决于您的操作系统,我会将其编辑为main和main(),看看是否有效。代码是否有任何问题,我的意思是transpose\u ope
函数?在我这方面效果很好。