Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/327.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/multithreading/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用并发python 3.5处理大文件的最快方法_Python_Multithreading_Python 3.x_Multiprocessing_Concurrent.futures - Fatal编程技术网

使用并发python 3.5处理大文件的最快方法

使用并发python 3.5处理大文件的最快方法,python,multithreading,python-3.x,multiprocessing,concurrent.futures,Python,Multithreading,Python 3.x,Multiprocessing,Concurrent.futures,我正在尝试使用并发未来来掌握多线程/多处理 我已经尝试使用以下代码集。我知道我总是会有磁盘IO问题,但我想最大限度地利用我的ram和CPU 大规模加工最常用/最好的方法是什么 如何使用并发未来处理大型数据集 是否有比以下方法更可取的方法 方法1: for folders in os.path.isdir(path): p = multiprocessing.Process(pool.apply_async(process_largeFiles(folders))) jobs.ap

我正在尝试使用并发未来来掌握多线程/多处理

我已经尝试使用以下代码集。我知道我总是会有磁盘IO问题,但我想最大限度地利用我的ram和CPU

大规模加工最常用/最好的方法是什么

如何使用并发未来处理大型数据集

是否有比以下方法更可取的方法

方法1:

for folders in os.path.isdir(path):
    p = multiprocessing.Process(pool.apply_async(process_largeFiles(folders)))
    jobs.append(p)
    p.start()
方法2:

with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
    for folders in os.path.isdir(path):
        executor.submit(process_largeFiles(folders), 100)
方法3:

with concurrent.futures.ProcessPoolExecutor(max_workers=10) as executor:
    for folders in os.path.isdir(path):
        executor.submit(process_largeFiles(folders), 10)
我应该尝试同时使用进程池和线程池吗

方法(思路):

在最广泛的使用情况下,最大化ram和cpu的最有效方法是什么


我知道启动进程需要一点时间,但它是否会超过正在处理的文件的大小

使用TreadPoolExecutor打开并读取文件,然后使用ProcessPoolExecutor处理数据

import concurrent.futures
from collections import deque

TPExecutor = concurrent.futures.ThreadPoolExecutor
PPExecutor = concurrent.futures.ProcessPoolExecutor
def get_file(path):
    with open(path) as f:
        data = f.read()
    return data

def process_large_file(s):
    return sum(ord(c) for c in s)

files = [filename1, filename2, filename3, filename4, filename5,
         filename6, filename7, filename8, filename9, filename0]

results = []
completed_futures = collections.deque()

def callback(future, completed=completed_futures):
    completed.append(future)

with TPExecutor(max_workers = 4) as thread_pool_executor:
    data_futures = [thread_pool_executor.submit(get_file, path) for path in files]
with PPExecutor() as process_pool_executor:
    for data_future in concurrent.futures.as_completed(data_futures):
        future = process_pool_executor.submit(process_large_file, data_future.result())
        future.add_done_callback(callback)
        # collect any that have finished
        while completed_futures:
            results.append(completed_futures.pop().result())

使用已完成的回调,因此它不必等待已完成的未来。我不知道这会如何影响效率-主要用于简化
循环中的逻辑/代码


如果由于内存限制而需要限制文件或数据提交,则需要对其进行重构。根据文件读取时间和处理时间,很难说在任何给定时刻内存中会有多少数据。我认为在你完成的
中收集结果应该有助于缓解这种情况<代码>数据\u在ProcessPoolExecutor设置时可能会开始完成-可能需要优化排序。

您是否有任何可用于比较的测试数据和函数?你自己做过对比测试吗?结果如何?你得出了什么结论?这是一个非常广泛的问题,有许多未知因素会影响比较的结果。您的解决方案是否有效?您是否分析了代码以找出哪些部分比较慢?
import concurrent.futures
from collections import deque

TPExecutor = concurrent.futures.ThreadPoolExecutor
PPExecutor = concurrent.futures.ProcessPoolExecutor
def get_file(path):
    with open(path) as f:
        data = f.read()
    return data

def process_large_file(s):
    return sum(ord(c) for c in s)

files = [filename1, filename2, filename3, filename4, filename5,
         filename6, filename7, filename8, filename9, filename0]

results = []
completed_futures = collections.deque()

def callback(future, completed=completed_futures):
    completed.append(future)

with TPExecutor(max_workers = 4) as thread_pool_executor:
    data_futures = [thread_pool_executor.submit(get_file, path) for path in files]
with PPExecutor() as process_pool_executor:
    for data_future in concurrent.futures.as_completed(data_futures):
        future = process_pool_executor.submit(process_large_file, data_future.result())
        future.add_done_callback(callback)
        # collect any that have finished
        while completed_futures:
            results.append(completed_futures.pop().result())