Python 3.x 使用Python在GPU（MXNet）和CPU上进行并行处理_Python 3.x_Multithreading_Parallel Processing_Gpu_Mxnet

Python 3.x 使用Python在GPU（MXNet）和CPU上进行并行处理

python-3.x multithreading parallel-processing

Python 3.x 使用Python在GPU（MXNet）和CPU上进行并行处理,python-3.x,multithreading,parallel-processing,gpu,mxnet,Python 3.x,Multithreading,Parallel Processing,Gpu,Mxnet,我有一个数据处理管道，我想通过在CPU上运行一些处理线程，同时在GPU上运行MXNet预测模型（Python 3.6）来优化它我想申请的想法如下（假设我的机器上有N个GPU）： GPU作业调度器从视频中读取N帧序列，并将每个帧发送到一个GPU 每个GPU处理其帧并使用MXNet预测其内容一旦所有N个GPU都完成了预测，我希望同时执行以下操作：将预测输出发送到队列在GPU中读取并处理接下来的N帧队列由在CPU上运行的多线程进程使用以下是工作流的可视化描述：其思想是在GPU忙于

我有一个数据处理管道，我想通过在CPU上运行一些处理线程，同时在GPU上运行MXNet预测模型（Python 3.6）来优化它

我想申请的想法如下（假设我的机器上有N个GPU）：

GPU作业调度器从视频中读取N帧序列，并将每个帧发送到一个GPU
每个GPU处理其帧并使用MXNet预测其内容
一旦所有N个GPU都完成了预测，我希望同时执行以下操作：
将预测输出发送到队列

在GPU中读取并处理接下来的N帧

队列由在CPU上运行的多线程进程使用

以下是工作流的可视化描述：

其思想是在GPU忙于处理帧时利用空闲CPU
通过使用线程库，我成功地读取和处理了前N个帧，但GPU无法处理下一批帧
请注意，下面的源代码经过简化以澄清工作流程。
以下是函数的代码，该函数读取帧并将其分派到GPU，然后将输出发送到cpu队列：

def dispatch_jobs(video_capture, detection_workers, number_of_gpu, cpu_queue): # detection_workers is a list of N similar MXNet models, each one works on a different GPU is_last_frame = False while not is_last_frame: frames_batch = [] for i in range(0, number_of_gpu): success, frame = read_frame_from_video(video_capture) if not success: logging.warning("Can't receive frame. Exiting.") is_last_frame = True break frames_batch.append(frame) workers = [] for detection_worker_id in range(0, len(frames_batch)): frame_image = frames_batch[detection_worker_id] thread = Thread(target=detection_workers[detection_worker_id].predict, kwargs={'image': frame_image}) workers.append(thread) for w in workers: w.start() for w in workers: w.join() # sending to the CPU queue for detection_worker_id in range(0, len(frames_batch)): detector_output = detection_workers[detection_worker_id].output cpu_queue.put(detector_output) logging.info("While loop is broken... putting -1 in the queue") cpu_queue.put(-1) return
如上所述，有一个消费线程从
cpu\u队列
读取输出，并将其发送到多线程函数（在cpu上），以下是消费函数的代码：

def consume_cpu_queue(cpu_queue): while cpu_queue.empty(): logging.info("Sleeping 1 second") time.sleep(1) prediction_output = cpu_queue.get() if prediction_output == -1: return process_output_multithread(prediction_output) consume_cpu_queue() def process_output_multithread(pred_output, number_of_process): workers = [] for i in range(0, number_of_process): thread = Thread(target=process, kwargs={'pred_output': pred_output}) workers.append(thread) for w in workers: w.start() for w in workers: w.join() return # Here is how the consumer thread is initiated cpu_consumer_thread = Thread(target=consume_cpu_queue) # Here is how I run the application cpu_consumer_thread.start() dispatch_jobs(video_capture, detection_workers) cpu_consumer_thread.join()
我已经检查过了，但我不确定Numba是否能解决我的问题

任何建议或指针都会非常有用。
这可能会有帮助：您是否尝试/查看async/await？@RyabchenkoAlexander我在这里没有实际经验，任何示例都值得一看