Python 熊猫数据帧applymap并行执行_Python_Pandas_Dataframe_Parallel Processing_Python Multiprocessing

Python 熊猫数据帧applymap并行执行

python pandas dataframe parallel-processing

Python 熊猫数据帧applymap并行执行,python,pandas,dataframe,parallel-processing,python-multiprocessing,Python,Pandas,Dataframe,Parallel Processing,Python Multiprocessing,我有以下函数将一组正则表达式应用于数据帧中的每个元素。我正在应用正则表达式的数据帧是一个5MB块 def apply_all_regexes(data, regexes): # find all regex matches is applied to the pandas' dataframe new_df = data.applymap( partial(apply_re_to_cell, regexes)) return regex_applied

我有以下函数将一组正则表达式应用于数据帧中的每个元素。我正在应用正则表达式的数据帧是一个5MB块

def apply_all_regexes(data, regexes):
    # find all regex matches is applied to the pandas' dataframe
    new_df = data.applymap(
        partial(apply_re_to_cell, regexes))
    return regex_applied

def apply_re_to_cell(regexes, cell):
    cell = str(cell)
    regex_matches = []
    for regex in regexes:
        regex_matches.extend(re.findall(regex, cell))
    return regex_matches

由于applymap的串行执行，处理所需的时间是~elements*串行执行1个元素的正则表达式。是否存在调用并行性的方法？我尝试了ProcessPoolExecutor，但这似乎比串行执行需要更长的时间

您是否尝试过将一个大数据帧拆分为多个线程小数据帧，并行应用regex映射并将每个小df粘在一起

我可以用一个关于基因表达的数据框做类似的事情。如果你得到预期的输出，我会小规模地运行它并控制它

不幸的是，我没有足够的声誉发表评论

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    for x in df_split:
        print(x.shape)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()


    return df

这是我使用的一般函数

您是否尝试过将一个大数据帧拆分为多个线程小数据帧，并行应用regex映射并将每个小df粘在一起

我可以用一个关于基因表达的数据框做类似的事情。如果你得到预期的输出，我会小规模地运行它并控制它

不幸的是，我没有足够的声誉发表评论

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    for x in df_split:
        print(x.shape)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()


    return df

这是我使用的一般函数

很好。我不是在数据帧上进行拆分，而是试图通过正则表达式并行执行任务。我无法用它来加快进程。如果你有办法，请告诉我。拆分数据帧有点麻烦执行速度确实加快了。需要对num_分区和num_内核进行一些调整。在我的例子中，我将它们保持不变，以匹配运行的内核数。锯减少了50%左右。很好。我不是在数据帧上进行拆分，而是试图通过正则表达式并行执行任务。我无法用它来加快进程。如果你有办法，请告诉我。拆分数据帧有点麻烦执行速度确实加快了。需要对num_分区和num_内核进行一些调整。在我的例子中，我将它们保持不变，以匹配运行的内核数。锯切减少约50%。