Python 多处理代码中出错-无法调用数据帧_Python_Pandas_Multiprocessing

Python 多处理代码中出错-无法调用数据帧

python pandas

Python 多处理代码中出错-无法调用数据帧,python,pandas,multiprocessing,Python,Pandas,Multiprocessing,我正在尝试使用多处理优化大文件上的距离计算。我已经设计了下面的代码，但是有人能解释为什么它抛出错误['DataFrame'对象不可调用] 这似乎与parallelize_dataframe中的“map”有关，可能是由于我如何设计test_func造成的，但不确定如何解决。提前感谢您的帮助 import multiprocessing as mp nearest_calc3 = None nearest_calc3 = postcodes.head(1000).copy() # Test top

我正在尝试使用多处理优化大文件上的距离计算。我已经设计了下面的代码，但是有人能解释为什么它抛出错误['DataFrame'对象不可调用]

这似乎与parallelize_dataframe中的“map”有关，可能是由于我如何设计test_func造成的，但不确定如何解决。提前感谢您的帮助

import multiprocessing as mp

nearest_calc3 = None
nearest_calc3 = postcodes.head(1000).copy() # Test top 1000

partitions = 5
cores = mp.cpu_count()

def parallelize_dataframe(data, func):
    data_split = np.array_split(data, partitions)
    pool = mp.Pool(cores)
    data = pd.concat(pool.map(func, data_split)) # <-- Problem here?
    pool.close()
    pool.join()
    return data

def nearest(inlat1, inlon1, inlat2, inlon2, store, postcode):
    lat1 = np.radians(inlat1)
    lat2 = np.radians(inlat2)
    longdif = np.radians(inlon2 - inlon1)
    r = 6371.1009 # gives d in kilometers
    d = np.arccos(np.sin(lat1)*np.sin(lat2) + np.cos(lat1)*np.cos(lat2) * np.cos(longdif)) * r
    near = pd.DataFrame({'store': store, 'postcode': postcode, 'distance': d})
    near_min = near.loc[near['distance'].idxmin()]
    x = str(near_min['store']) + '~' + str(near_min['postcode']) + '~' + str(near_min['distance'])
    return x

def test_func(data, stores): # <-- Or maybe here?
    data['appended'] = data['lat'].apply(nearest, args=(data['long'], stores['lat'], stores['long'], stores['index'], stores['pcds']))
    data[['store','store_postcode','distance_km']] = data['appended'].str.split("~",expand=True)
    return data

if __name__ == '__main__':
    test = parallelize_dataframe(nearest_calc3, test_func(nearest_calc3, stores))

将多处理导入为mp
最近的_calc3=无
最近的_calc3=邮政编码.head（1000）.copy（）#测试顶部1000
分区=5
cores=mp.cpu\u计数（）
def并行化_数据帧（数据，函数）：
数据分割=np.数组分割（数据、分区）
池=mp.池（核心）
data=pd.concat（pool.map（func，数据分割））#11 data=pd.concat（pool.map（func，数据分割））
12泳池关闭（）
13.加入
映射中的~\AppData\Local\Continuum\anaconda3\lib\multiprocessing\pool.py（self、func、iterable、chunksize）
266在返回的列表中。
267         '''
-->268返回self.\u map\u async（func、iterable、mapstar、chunksize）.get（）
269
270 def星图（self、func、iterable、chunksize=None）：
get中的~\AppData\Local\Continuum\anaconda3\lib\multiprocessing\pool.py（self，超时）
655返回自身值
656其他：
-->657提高自我价值
658
659 def_装置（自身、i、obj）：
TypeError:“DataFrame”对象不可调用

问题在最后一行：

test = parallelize_dataframe(nearest_calc3, test_func(nearest_calc3, stores))

test\u func（…）

将返回一个数据帧，您将其传递到

parallelize\u dataframe

。但是这个函数需要一个可调用的

你想要这样的东西：

test = parallelize_dataframe(nearest_calc3, test_func)

由于您希望始终将

存储

传递到

测试功能

中，除了

最近的计算C3

之外，还可以使用

部分

执行此操作：

test_func_with_stores = functools.partial(test_func, stores)

test\u func\u with\u stores

是一个可调用的函数，它只接受一个参数。

不幸的是，

partial

从左到右填充参数，因此您必须更改

test_func

，使

stores

成为第一个参数。