Python 如何将函数并行应用于数据帧的多个列_Python_Pandas_Concurrent.futures

Python 如何将函数并行应用于数据帧的多个列

python pandas

Python 如何将函数并行应用于数据帧的多个列,python,pandas,concurrent.futures,Python,Pandas,Concurrent.futures,我有一个包含数十万行的pandas数据帧，我想在该数据帧的多个列上并行应用一个耗时的函数我知道如何连续应用这个函数。例如： import hashlib import pandas as pd df = pd.DataFrame( {'col1': range(100_000), 'col2': range(100_000, 200_000)}, columns=['col1', 'col2']) def foo(col1, col2): # This fun

我有一个包含数十万行的pandas数据帧，我想在该数据帧的多个列上并行应用一个耗时的函数

我知道如何连续应用这个函数。例如：

import hashlib

import pandas as pd


df = pd.DataFrame(
    {'col1': range(100_000), 'col2': range(100_000, 200_000)},
    columns=['col1', 'col2'])


def foo(col1, col2):
    # This function is actually much more time consuming in real life
    return hashlib.md5(f'{col1}-{col2}'.encode('utf-8')).hexdigest()


df['md5'] = df.apply(lambda row: foo(row.col1, row.col2), axis=1)

df.head()
# Out[5]: 
#    col1    col2                               md5
# 0     0  100000  92e2a2c7a6b7e3ee70a1c5a5f2eafd13
# 1     1  100001  01d14f5020a8ba2715cbad51fd4c503d
# 2     2  100002  c0e01b86d0a219cd71d43c3cc074e323
# 3     3  100003  d94e31d899d51bc00512938fc190d4f6
# 4     4  100004  7710d81dc7ded13326530df02f8f8300

但是，如何利用我机器上所有可用的内核并行应用函数

foo

？

最简单的方法是使用

指定

chunksize=1_000

可以加快运行速度，因为每个进程一次将处理

行（即每1000行只需支付一次初始化进程的开销）

请注意，这只适用于Python 3.2或更高版本。

最简单的方法是使用

指定

chunksize=1_000

可以加快运行速度，因为每个进程一次将处理

行（即每1000行只需支付一次初始化进程的开销）

请注意，这只适用于Python3.2或更高版本

import concurrent.futures

with concurrent.futures.ProcessPoolExecutor(16) as pool:
    df['md5'] = list(pool.map(foo, df['col1'], df['col2'], chunksize=1_000))

df.head()
# Out[10]: 
#    col1    col2                               md5
# 0     0  100000  92e2a2c7a6b7e3ee70a1c5a5f2eafd13
# 1     1  100001  01d14f5020a8ba2715cbad51fd4c503d
# 2     2  100002  c0e01b86d0a219cd71d43c3cc074e323
# 3     3  100003  d94e31d899d51bc00512938fc190d4f6
# 4     4  100004  7710d81dc7ded13326530df02f8f8300