Python &引用;类型错误:';类型';对象不可下标";在进行多重处理时。我做错了什么?

Python &引用;类型错误:';类型';对象不可下标";在进行多重处理时。我做错了什么?,python,pandas,performance,dataframe,multiprocessing,Python,Pandas,Performance,Dataframe,Multiprocessing,我尝试“多”处理函数func,但总是出现以下错误: File "c:\...programs\python\python37\lib\multiprocessing\pool.py", line 268, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "c:\...\programs\python\python37\lib\multiproces

我尝试“多”处理函数
func
,但总是出现以下错误:

File "c:\...programs\python\python37\lib\multiprocessing\pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()

  File "c:\...\programs\python\python37\lib\multiprocessing\pool.py", line 657, in get
    raise self._value

TypeError: 'type' object is not subscriptable
我做错了什么?每个
job
都是一个字典,包含
func

最小可复制样本:

import multiprocessing as mp,pandas as pd
def func(name, raw_df=pd.DataFrame, df={}, width=0):
    # 3. do some column operations. (actually theres more than just this operation)  
    seriesF =  raw_df[[name]].dropna()
    afterDropping_indices = seriesF.index.copy(deep=True) 
    list_ = list(raw_df[name])[width:]  
    df[name]=pd.Series(list_.copy(), index=afterDropping_indices[width:]) 
       
def preprocess_columns(raw_df ):
 
    # get all inputs.
    df, width = {}, 137 
    args = {"raw_df":raw_df, "df":df, 'width': width }  
    column_names = raw_df.columns

    # get input-dict for every single job.
    jobs=[]
    for i in range(len(column_names)):
        job = {"name":column_names[i]}
        job.update(args) 
        jobs.append(job) 

    # mutliprocessing
    pool = mp.Pool(len(column_names))  
    pool.map(func, jobs)    
    
    # create df from dict and reindex 
    df=pd.concat(df,axis=1) 
    df=df.reindex(df.index[::-1])
    return df 

if __name__=='__main__': 
    raw_df = pd.DataFrame({"A":[ 1.1 ]*100000, "B":[ 2.2 ]*100000, "C":[ 3.3 ]*100000}) 
    raw_df = preprocess_columns(raw_df ) 
编辑:只传递列而不是原始数据的版本

import multiprocessing as mp,pandas as pd
def func(name, series, df, width):
    # 3. do some column operations. (actually theres more than just this operation)  
    seriesF =  series.dropna()
    afterDropping_indices = seriesF.index.copy(deep=True) 
    list_ = list(series)[width:]  
    df[name]=pd.Series(list_.copy(), index=afterDropping_indices[width:]) 
       
def preprocess_columns(raw_df ):
 
    df, width = {}, 137 
    args = {"df":df, 'width': width } 
     
    column_names = raw_df.columns
    jobs=[]
    for i in range(len(column_names)):
        job = {"name":column_names[i], "series":raw_df[column_names[i]]}
        job.update(args)  
        jobs.append(job)
    
    pool = mp.Pool(len(column_names))  
    pool.map(func, jobs)    
    
    # create df from dict and reindex 
    df=pd.concat(df,axis=1) 
    df=df.reindex(df.index[::-1])
    return df 

if __name__=='__main__': 
    raw_df = pd.DataFrame({"A":[ 1.1 ]*100000, "B":[ 2.2 ]*100000, "C":[ 3.3 ]*100000}) 
    raw_df = preprocess_columns(raw_df ) 
其结果是:

TypeError: func() missing 3 required positional arguments: 'series', 'df', and 'width'
我找到了解决办法: 总结:

  • 添加了expand_call()函数(请参见下文)
  • 迭代输出结果并将元素追加到普通列表
  • 注意:这只处理多个线程

    
    import multiprocessing as mp,pandas as pd
    def func(name, raw_df, df, width):
        # 3. do some column operations. (actually theres more than just this operation)  
        seriesF =  raw_df[name].dropna()
        afterDropping_indices = seriesF.index.copy(deep=True) 
        list_ = list(raw_df[name])[width:]  
        df[name]=pd.Series(list_.copy(), index=afterDropping_indices[width:])  
        df[name].name = name
        return df
    
    def expandCall(kargs): 
        # Expand the arguments of a callback function, kargs[’func’] 
        func=kargs['func'] 
        del kargs['func']  
        out=func(**kargs)  
        return out
     
    def preprocess_columns(raw_df ): 
        df, width = pd.DataFrame(), 137
        args = {"df":df, "raw_df":raw_df, 'width': width }
         
        column_names = raw_df.columns
        jobs=[]
        for i in range(len(column_names)):
            job = {"func":func,"name":column_names[i]}
            job.update(args)
            jobs.append(job)
        
        pool = mp.Pool(len(column_names))
        task=jobs[0]['func'].__name__
        outputs= pool.imap_unordered(expandCall, jobs)
        
        out = [];  
        for i,out_ in enumerate(outputs,1):
            out.append(out_)  
        pool.close(); pool.join() # this is needed to prevent memory leaks return out
          
        # create df from dict and reindex
        df=pd.concat(out,axis=1)  
        df=df.reindex(df.index[::-1]) 
        print(df)
        return df 
    
    if __name__=='__main__': 
        raw_df = pd.DataFrame({"A":[ 1.1 ]*100000, "B":[ 2.2 ]*100000, "C":[ 3.3 ]*100000}) 
        raw_df = preprocess_columns(raw_df ) 
    
    

    raw_df=pd。数据帧
    没有意义。您的工作人员需要实际的数据帧,而不是
    pd.dataframe
    。(事实上,他们只需要他们将要处理的列,您应该更改代码以只通过该列,以减少进程间的通信开销。)@user2357112supportsMonica请原谅,我忘了在发布问题之前,我在那里放了这些关键字。不幸的是,这些关键字并不是导致错误的原因。您关于只传递列的建议听起来很不错,但是否有办法只传递名称作为将要进行并行化的元素?编辑的代码会产生完全不同的错误。@user2357112supportsMonica您能告诉我,我做错了什么吗?(再次编辑)。对于之前的评论:
    raw_df
    args
    字典中