Python Pymongo并发与多处理模块

Python Pymongo并发与多处理模块,python,mongodb,multiprocessing,pymongo,Python,Mongodb,Multiprocessing,Pymongo,我试图理解使用pymongo并行处理查询或查询结果的最佳方法 我所有的阅读都说应该有少量的MongoClient()对象。假设我有两个不同的模块data_interface.py实现 from pymongo import MongoClient client = MongoClient('localhost',27017) def execute_query(id_to_find): db = client['mydatabase'] my_collection = db.m

我试图理解使用pymongo并行处理查询或查询结果的最佳方法

我所有的阅读都说应该有少量的MongoClient()对象。假设我有两个不同的模块data_interface.py实现

from pymongo import MongoClient
client = MongoClient('localhost',27017)

def execute_query(id_to_find):
    db = client['mydatabase']
    my_collection = db.my_collection
    data_cursor = my_collection.find({'_id':my_id_to_find})
    return data_cursor

假设函数process_data执行一些简单的计算,并且集合是多对一的(一个id查询返回一千个结果),假设我有:

import data_interface
from multiprocessing import Pool

def process_data(ids_to_process):
    # ids_to_process is a list of ids to query
    pool = Pool(processes=4)
    results = pool.map(query_and_process_data, ids_to_process)

def query_and_process_data(id_to_query):
    cursor = data_interface.execute_query(id_to_query)
    processed_results = []
    for result in cursor:
        processed_result = process_data(result)
        processed_results.append(processed_result)

    return processed_results
或:

这里有4种不同的实现。这些实现中是否存在明显的缺陷?在每个实现中,我相信当池创建一个新的python解释器时,会为每个进程创建一个MongoClient。数据接口的第二个实现是否支持并行查询?或者我需要集合对象的新实例来实现这一点?process_数据的两种实现之间的区别在于是否执行并行查询,或者是否并行处理每个文档


注意:这些代码都没有经过测试,其中可能有错误。我希望它足够清楚地传达我的想法

哪个效率最高?测量它们。测量它们需要几个月的工作。我将对这个问题进行编辑,以明确我在寻找某人基于经验或我不理解的东西的直觉。没有人对Python的直觉是非常好的,因为它带有随机模块,在任意配置上与一个非平凡的数据库对话。几个月的工作?查找。我说几个月的工作是因为我们处于开发阶段,而不是因为这是一件很难做的事情。
import data_interface
from multiprocessing import Pool

def process_data(ids_to_process):
    # ids_to_process is a list of ids to query
    pool = Pool(processes=4)
    results = pool.map(query_and_process_data, ids_to_process)

def query_and_process_data(id_to_query):
    cursor = data_interface.execute_query(id_to_query)
    processed_results = []
    for result in cursor:
        processed_result = process_data(result)
        processed_results.append(processed_result)

    return processed_results
import data_interface
from multiprocessing import Pool

def process_data(ids_to_process):
    # ids_to_process is a list of ids to query
    pool = Pool(processes=4)
    for id in ids_to_process:
        cursor = data_interface.execute_query(id)
        data_returned = cursor[:]
        results = pool.map(process_data, data_returned)