Python Pymongo并发与多处理模块
我试图理解使用pymongo并行处理查询或查询结果的最佳方法 我所有的阅读都说应该有少量的MongoClient()对象。假设我有两个不同的模块data_interface.py实现Python Pymongo并发与多处理模块,python,mongodb,multiprocessing,pymongo,Python,Mongodb,Multiprocessing,Pymongo,我试图理解使用pymongo并行处理查询或查询结果的最佳方法 我所有的阅读都说应该有少量的MongoClient()对象。假设我有两个不同的模块data_interface.py实现 from pymongo import MongoClient client = MongoClient('localhost',27017) def execute_query(id_to_find): db = client['mydatabase'] my_collection = db.m
from pymongo import MongoClient
client = MongoClient('localhost',27017)
def execute_query(id_to_find):
db = client['mydatabase']
my_collection = db.my_collection
data_cursor = my_collection.find({'_id':my_id_to_find})
return data_cursor
及
假设函数process_data执行一些简单的计算,并且集合是多对一的(一个id查询返回一千个结果),假设我有:
import data_interface
from multiprocessing import Pool
def process_data(ids_to_process):
# ids_to_process is a list of ids to query
pool = Pool(processes=4)
results = pool.map(query_and_process_data, ids_to_process)
def query_and_process_data(id_to_query):
cursor = data_interface.execute_query(id_to_query)
processed_results = []
for result in cursor:
processed_result = process_data(result)
processed_results.append(processed_result)
return processed_results
或:
这里有4种不同的实现。这些实现中是否存在明显的缺陷?在每个实现中,我相信当池创建一个新的python解释器时,会为每个进程创建一个MongoClient。数据接口的第二个实现是否支持并行查询?或者我需要集合对象的新实例来实现这一点?process_数据的两种实现之间的区别在于是否执行并行查询,或者是否并行处理每个文档
注意:这些代码都没有经过测试,其中可能有错误。我希望它足够清楚地传达我的想法 哪个效率最高?测量它们。测量它们需要几个月的工作。我将对这个问题进行编辑,以明确我在寻找某人基于经验或我不理解的东西的直觉。没有人对Python的直觉是非常好的,因为它带有随机模块,在任意配置上与一个非平凡的数据库对话。几个月的工作?查找。我说几个月的工作是因为我们处于开发阶段,而不是因为这是一件很难做的事情。
import data_interface
from multiprocessing import Pool
def process_data(ids_to_process):
# ids_to_process is a list of ids to query
pool = Pool(processes=4)
results = pool.map(query_and_process_data, ids_to_process)
def query_and_process_data(id_to_query):
cursor = data_interface.execute_query(id_to_query)
processed_results = []
for result in cursor:
processed_result = process_data(result)
processed_results.append(processed_result)
return processed_results
import data_interface
from multiprocessing import Pool
def process_data(ids_to_process):
# ids_to_process is a list of ids to query
pool = Pool(processes=4)
for id in ids_to_process:
cursor = data_interface.execute_query(id)
data_returned = cursor[:]
results = pool.map(process_data, data_returned)