Python 赢得';t存储函数的返回值

Python 赢得';t存储函数的返回值,python,multiprocessing,Python,Multiprocessing,我正在使用multiprocessing.imap_unordered并发运行一个函数,但我的RAM使用量一直在增加 问题如下:我有数百万个数据组合(使用itertools.product)需要传递给函数。该函数将使用支持向量机计算分数,然后存储分数和当前组合此函数不会返回任何值它只计算分数并将其存储在共享值中。我不需要所有其他的组合,只需要最好的 使用imap_unordered时,RAM使用量不断增加,直到因缺少RAM而崩溃。我认为发生这种情况是因为imap将存储函数的结果,该结果不会返回任

我正在使用
multiprocessing.imap_unordered
并发运行一个函数,但我的RAM使用量一直在增加

问题如下:我有数百万个数据组合(使用
itertools.product
)需要传递给函数。该函数将使用支持向量机计算分数,然后存储分数和当前组合此函数不会返回任何值它只计算分数并将其存储在共享值中。我不需要所有其他的组合,只需要最好的

使用
imap_unordered
时,RAM使用量不断增加,直到因缺少RAM而崩溃。我认为发生这种情况是因为
imap
将存储函数的结果,该结果不会返回任何值,但可能会保留
None
Null

下面是一个示例代码:

from functools import partial
import itertools
import multiprocessing
import time


def svm(input_data, params):

    # Copy the data to avoid changing the original data
    # as input_data is a reference to a pandas dataframe
    # and I need to shift columns up and down.
    dataset = input_data.copy()

    # Use svm here to analyse data
    score = sum(dataset) + sum(params)  # simulate score of svm

    # Simulate a process that takes a bit of time
    time.sleep(0.5)

    return (score, params)


if __name__ == "__main__":
    
    # Without this, multiprocessing gives error
    multiprocessing.freeze_support()

    # Set the number of worker processes
    # Empty for all the cores
    # Int for number of processes
    pool = multiprocessing.Pool()

    # iterable settings
    total_combinations = 2
    total_features = 45

    # Keep track of best score
    best_score = -1000
    best_param = [0 for _ in range(total_features)]

    input_data = [x * x for x in range(10000)]

    # Create a partial function with the necessary args
    func = partial(svm, input_data)
    params = itertools.product(range(total_combinations), repeat=total_features)

    # Calculate scores concurrently
    # As the iterable is in the order of millions, this value
    # will get continuously large until it uses all available
    # memory as the map stores the results, that in this case
    # it's not needed.
    for score, param in pool.imap_unordered(func, iterable=params, chunksize=100):
        if score > best_score:
            best_score = score
            best_param = param

    # Wait for all the processes to terminate their tasks
    pool.close()
    pool.join()

    print(best_score)
    print(best_param)
在本例中,您将注意到RAM的使用量随着时间的推移而增加。虽然在这种情况下,它并不多,但如果您单独呆一天或某件事(通过增加iterable的范围),它将达到GBs的RAM。正如我所说,我有数百万种组合


我应该如何解决这个问题?是否有替代imap的方法根本不存储函数的任何内容?我是否应该创建
进程
,而不是使用
?这可能是因为我正在复制数据集,但后来它没有被垃圾收集器清除吗?

您可以使用,或者

我已经通过导入objgraph和打印
objgraph来跟踪内存使用情况。显示最常见的类型(限制=20)
。我注意到元组和列表的数量在子进程期间不断增加。为了解决这个问题,我更改了
池中的
maxtasksparchild
,在一段时间后强制关闭进程,从而释放内存

from functools import partial
import itertools
import multiprocessing
import random
import time

# Tracing memory leaks
import objgraph


def svm(input_data, params):

    # Copy the data to avoid changing the original data
    # as input_data is a reference to a pandas dataframe.
    dataset = input_data.copy()

    # Use svm here to analyse data
    score = sum(dataset) + sum(params)  # simulate score of svm

    # Simulate a process that takes a bit of time
    time.sleep(0.5)

    return (score, params)


if __name__ == "__main__":

    # iterable settings
    total_combinations = 2
    total_features = 12

    # Keep track of best score
    best_score = -1000
    best_param = [0 for _ in range(total_features)]

    # Simulate a dataframe with random data
    input_data = [random.random() for _ in range(100000)]

    # Create a partial function with the necessary args
    func = partial(svm, input_data)
    params = itertools.product(range(total_combinations), repeat=total_features)

    # Without this, multiprocessing gives error
    multiprocessing.freeze_support()

    # Set the number of worker processes
    # Empty for all the cores
    # Int for number of processes
    with multiprocessing.Pool(maxtasksperchild=5) as pool:

        # Calculate scores concurrently
        # As the iterable is in the order of millions, this value
        # will get continuously large until it uses all available
        # memory as the map stores the results, that in this case
        # it's not needed.
        for score, param in pool.imap_unordered(func, iterable=params, chunksize=10):
            if score > best_score:
                best_score = score
                best_param = param
                # print(best_score)

            # Count the number of objects in the memory
            # If the number of objects keep increasing, it's a memory leak
            print(objgraph.show_most_common_types(limit=20))

    # Wait for all the processes to terminate their tasks
    pool.close()
    pool.join()

    print(best_score)
    print(best_param)

从我所阅读的内容来看,
apply
似乎不会使用iterable。我的意思是,它将尝试评估整个列表,而不是延迟加载。这将是一个问题,因为我的列表太大了。我会尽力解释我自己。您可以为params:
中的param迭代所有参数
,然后将它们应用于池
池。apply_async(func=your_func,args=(param,)
。您还可以使用callback参数筛选出最佳分数
def callback_func(分数,参数):if score>best_score:…
。您还应该在回调函数中使用锁,以避免在分配最佳分数和最佳参数时出现问题。我可以使用
管理器.Value
并保留最佳分数。这个解决方案有效。我尝试过的另一个解决方案是在
池中设置
maxstasksparchild
,因为它将关闭进程并打开一个新进程。