Python 赢得';t存储函数的返回值
我正在使用Python 赢得';t存储函数的返回值,python,multiprocessing,Python,Multiprocessing,我正在使用multiprocessing.imap_unordered并发运行一个函数,但我的RAM使用量一直在增加 问题如下:我有数百万个数据组合(使用itertools.product)需要传递给函数。该函数将使用支持向量机计算分数,然后存储分数和当前组合此函数不会返回任何值它只计算分数并将其存储在共享值中。我不需要所有其他的组合,只需要最好的 使用imap_unordered时,RAM使用量不断增加,直到因缺少RAM而崩溃。我认为发生这种情况是因为imap将存储函数的结果,该结果不会返回任
multiprocessing.imap_unordered
并发运行一个函数,但我的RAM使用量一直在增加
问题如下:我有数百万个数据组合(使用itertools.product
)需要传递给函数。该函数将使用支持向量机计算分数,然后存储分数和当前组合此函数不会返回任何值它只计算分数并将其存储在共享值中。我不需要所有其他的组合,只需要最好的
使用imap_unordered
时,RAM使用量不断增加,直到因缺少RAM而崩溃。我认为发生这种情况是因为imap
将存储函数的结果,该结果不会返回任何值,但可能会保留None
或Null
值
下面是一个示例代码:
from functools import partial
import itertools
import multiprocessing
import time
def svm(input_data, params):
# Copy the data to avoid changing the original data
# as input_data is a reference to a pandas dataframe
# and I need to shift columns up and down.
dataset = input_data.copy()
# Use svm here to analyse data
score = sum(dataset) + sum(params) # simulate score of svm
# Simulate a process that takes a bit of time
time.sleep(0.5)
return (score, params)
if __name__ == "__main__":
# Without this, multiprocessing gives error
multiprocessing.freeze_support()
# Set the number of worker processes
# Empty for all the cores
# Int for number of processes
pool = multiprocessing.Pool()
# iterable settings
total_combinations = 2
total_features = 45
# Keep track of best score
best_score = -1000
best_param = [0 for _ in range(total_features)]
input_data = [x * x for x in range(10000)]
# Create a partial function with the necessary args
func = partial(svm, input_data)
params = itertools.product(range(total_combinations), repeat=total_features)
# Calculate scores concurrently
# As the iterable is in the order of millions, this value
# will get continuously large until it uses all available
# memory as the map stores the results, that in this case
# it's not needed.
for score, param in pool.imap_unordered(func, iterable=params, chunksize=100):
if score > best_score:
best_score = score
best_param = param
# Wait for all the processes to terminate their tasks
pool.close()
pool.join()
print(best_score)
print(best_param)
在本例中,您将注意到RAM的使用量随着时间的推移而增加。虽然在这种情况下,它并不多,但如果您单独呆一天或某件事(通过增加iterable的范围),它将达到GBs的RAM。正如我所说,我有数百万种组合
我应该如何解决这个问题?是否有替代imap的方法根本不存储函数的任何内容?我是否应该创建
进程
,而不是使用池
?这可能是因为我正在复制数据集,但后来它没有被垃圾收集器清除吗?您可以使用,或者我已经通过导入objgraph和打印objgraph来跟踪内存使用情况。显示最常见的类型(限制=20)
。我注意到元组和列表的数量在子进程期间不断增加。为了解决这个问题,我更改了池中的maxtasksparchild
,在一段时间后强制关闭进程,从而释放内存
from functools import partial
import itertools
import multiprocessing
import random
import time
# Tracing memory leaks
import objgraph
def svm(input_data, params):
# Copy the data to avoid changing the original data
# as input_data is a reference to a pandas dataframe.
dataset = input_data.copy()
# Use svm here to analyse data
score = sum(dataset) + sum(params) # simulate score of svm
# Simulate a process that takes a bit of time
time.sleep(0.5)
return (score, params)
if __name__ == "__main__":
# iterable settings
total_combinations = 2
total_features = 12
# Keep track of best score
best_score = -1000
best_param = [0 for _ in range(total_features)]
# Simulate a dataframe with random data
input_data = [random.random() for _ in range(100000)]
# Create a partial function with the necessary args
func = partial(svm, input_data)
params = itertools.product(range(total_combinations), repeat=total_features)
# Without this, multiprocessing gives error
multiprocessing.freeze_support()
# Set the number of worker processes
# Empty for all the cores
# Int for number of processes
with multiprocessing.Pool(maxtasksperchild=5) as pool:
# Calculate scores concurrently
# As the iterable is in the order of millions, this value
# will get continuously large until it uses all available
# memory as the map stores the results, that in this case
# it's not needed.
for score, param in pool.imap_unordered(func, iterable=params, chunksize=10):
if score > best_score:
best_score = score
best_param = param
# print(best_score)
# Count the number of objects in the memory
# If the number of objects keep increasing, it's a memory leak
print(objgraph.show_most_common_types(limit=20))
# Wait for all the processes to terminate their tasks
pool.close()
pool.join()
print(best_score)
print(best_param)
从我所阅读的内容来看,apply
似乎不会使用iterable。我的意思是,它将尝试评估整个列表,而不是延迟加载。这将是一个问题,因为我的列表太大了。我会尽力解释我自己。您可以为params:
中的param迭代所有参数,然后将它们应用于池池。apply_async(func=your_func,args=(param,)
。您还可以使用callback参数筛选出最佳分数def callback_func(分数,参数):if score>best_score:…
。您还应该在回调函数中使用锁,以避免在分配最佳分数和最佳参数时出现问题。我可以使用管理器.Value
并保留最佳分数。这个解决方案有效。我尝试过的另一个解决方案是在池中设置maxstasksparchild
,因为它将关闭进程并打开一个新进程。