Python 将函数以多线程方式应用于DataFrame中的每个单元格
是否可以将函数以多线程方式应用于数据帧中的每个单元格 我知道,但它似乎不允许本机使用多线程:Python 将函数以多线程方式应用于DataFrame中的每个单元格,python,multithreading,pandas,dataframe,Python,Multithreading,Pandas,Dataframe,是否可以将函数以多线程方式应用于数据帧中的每个单元格 我知道,但它似乎不允许本机使用多线程: import numpy as np import pandas as pd np.random.seed(1) frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) print(frame) for
import numpy as np
import pandas as pd
np.random.seed(1)
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(frame)
format = lambda x: '%.2f' % x
frame = frame.applymap(format)
print(frame)
返回:
b d e
Utah 1.624345 -0.611756 -0.528172
Ohio -1.072969 0.865408 -2.301539
Texas 1.744812 -0.761207 0.319039
Oregon -0.249370 1.462108 -2.060141
b d e
Utah 1.62 -0.61 -0.53
Ohio -1.07 0.87 -2.30
Texas 1.74 -0.76 0.32
Oregon -0.25 1.46 -2.06
b d e
Utah 1.624345 -0.611756 -0.528172
Ohio -1.072969 0.865408 -2.301539
Texas 1.744812 -0.761207 0.319039
Oregon -0.249370 1.462108 -2.060141
[array([[ 1.62434536, -0.61175641, -0.52817175],
[-1.07296862, 0.86540763, -2.3015387 ]]), array([[ 1.74481176, -0.7612069 , 0.3190391 ],
[-0.24937038, 1.46210794, -2.06014071]])]
b d e
Utah 3.624345 1.388244 1.471828
Ohio 0.927031 2.865408 -0.301539
Texas 3.744812 1.238793 2.319039
Oregon 1.750630 3.462108 -0.060141
相反,我希望使用多个内核来执行操作,因为应用的函数可能很复杂。按列拆分:
from multiprocessing import Pool
def format(col):
return col.apply(lambda x: '%.2f' % x)
cores = 5
pool = Pool(cores)
for out_col in pool.imap(format, [frame[i] for i in frame]):
frame[out_col.name] = out_col
pool.close()
pool.join()
或按注释中提到的分区大小拆分:
size = 10
frame_split = np.array_split(frame, size)
frame = pd.concat(pool.imap(func, frame_split))
注意:在Microsoft Windows上为避免此问题,必须将代码放在主函数中,例如:
import numpy as np
import pandas as pd
from multiprocessing import Pool
def format(col):
return col.apply(lambda x: '%.2f' % x)
if __name__ == "__main__":
np.random.seed(1)
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(frame)
cores = 2
pool = Pool(cores)
for out_col in pool.imap(format, [frame[i] for i in frame]):
frame[out_col.name] = out_col
pool.close()
pool.join()
print(frame)
关于
np.array\u split
的使用,由于数据帧被转换为numpy数组,因此它只适用于数字。例如:
import numpy as np
import pandas as pd
from multiprocessing import Pool
def myfunc(a, b):
'''
Return a-b if a>b, otherwise return a+b
Taken from https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html
'''
if a > b:
return a - b
else:
return a + b
def format(col):
vfunc = np.vectorize(myfunc)
return pd.DataFrame(vfunc(col,2))
if __name__ == "__main__":
np.random.seed(1)
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(frame)
cores = 2
size = 2
pool = Pool(cores)
frame_split = np.array_split(frame.as_matrix(), size)
print (frame_split)
columns = frame.columns
frame = pd.concat(pool.imap(format, frame_split)).set_index(frame.index)
frame.columns = columns
pool.close()
pool.join()
print(frame)
返回:
b d e
Utah 1.624345 -0.611756 -0.528172
Ohio -1.072969 0.865408 -2.301539
Texas 1.744812 -0.761207 0.319039
Oregon -0.249370 1.462108 -2.060141
b d e
Utah 1.62 -0.61 -0.53
Ohio -1.07 0.87 -2.30
Texas 1.74 -0.76 0.32
Oregon -0.25 1.46 -2.06
b d e
Utah 1.624345 -0.611756 -0.528172
Ohio -1.072969 0.865408 -2.301539
Texas 1.744812 -0.761207 0.319039
Oregon -0.249370 1.462108 -2.060141
[array([[ 1.62434536, -0.61175641, -0.52817175],
[-1.07296862, 0.86540763, -2.3015387 ]]), array([[ 1.74481176, -0.7612069 , 0.3190391 ],
[-0.24937038, 1.46210794, -2.06014071]])]
b d e
Utah 3.624345 1.388244 1.471828
Ohio 0.927031 2.865408 -0.301539
Texas 3.744812 1.238793 2.319039
Oregon 1.750630 3.462108 -0.060141
使用
np.array\u split
将数据帧划分为块,并将这些块发送到不同的进程。返回的结果可以连接回(pd.concat
)您开始使用的巨型数据帧。