是否可以使用python多线程从多个文件夹中读取多个文件并处理这些文件以获得组合结果？_Python_Python 2.7_Python Multithreading

是否可以使用python多线程从多个文件夹中读取多个文件并处理这些文件以获得组合结果？

python python-2.7

是否可以使用python多线程从多个文件夹中读取多个文件并处理这些文件以获得组合结果？,python,python-2.7,python-multithreading,Python,Python 2.7,Python Multithreading,我必须从文件夹的数量中解析日志文件的数量，我想加快这些日志文件的解析速度。我必须从所有这些文件的行中找到一些特定的字符串，以获得组合形式的最终统计信息。我不确定如何使用python多线程实现这一点，以及它的效率有多高。我阅读了不同的教程，但不清楚在多线程中使用可变文件时如何执行文件处理。任何关于这方面的建议都会很好。提前非常感谢。我认为学习使用线程的最简单方法是在concurrent.futures模块中使用ThreadPoolExecutor类，因为它比通常的同步for循环多出几行。尤其是在P

我必须从文件夹的数量中解析日志文件的数量，我想加快这些日志文件的解析速度。我必须从所有这些文件的行中找到一些特定的字符串，以获得组合形式的最终统计信息。我不确定如何使用python多线程实现这一点，以及它的效率有多高。我阅读了不同的教程，但不清楚在多线程中使用可变文件时如何执行文件处理。任何关于这方面的建议都会很好。提前非常感谢。

我认为学习使用线程的最简单方法是在

concurrent.futures

模块中使用

ThreadPoolExecutor

类，因为它比通常的同步for循环多出几行。尤其是在Python3中，但这可以适用于Python2.7

基本上，您有一个线程池（一堆）正在等待工作。Work通常只是一个方法/函数，您可以将它与参数一起发送到池中，

ThreadPool

处理所有其他事情（将任务分配给可用资源和调度）

假设我的日志目录结构如下：

~ ❯ tree log
log
├── 1.log
├── 2.log
├── 3.log
└── schedules
    ├── 1.log
    ├── 2.log
    └── 3.log

1 directory, 6 files

因此，首先您将获得文件列表（Python3）

每个文件（现在只是一个字符串变量）都是您希望线程处理的文件。因此，您有一个通用方法，接受一个文件参数，在每个文件中查找感兴趣的字符串。基本上相同，如果您使用普通的Python程序，例如：

def find_string(file):
    # insert your specific code to find your string
    # including opening the file and such
    # returning values also possible see further down
    print(file)

因此，现在您只需将这些工作发送到

线程池

from concurrent.futures import ThreadPoolExecutor

# We can use a with statement to ensure threads are cleaned up promptly
with ThreadPoolExecutor() as executor:
    # Basically the same as if you did the normal for-loop
    for file in list_of_files:
        # But you submit your method to the Pool instead.
        future = executor.submit(find_string, file) # see future.result() too

    print("All tasks complete")

有一个很好的完整示例，search for

ThreadPoolExecutor example

，它确实会打开一个网站列表并以字节为单位打印大小。您可以将其修改为文件搜索

这里的瓶颈可能是文件量巨大，磁盘读取速度太慢。如果您有多个磁盘上的日志文件，那么将是一个解决方案

另一个建议是，多线程通常用于网络操作或I/O。因此，读取文件是一个很好的用途。但是，您的应用程序也在进行一些处理。根据CPU密集程度的不同，您可能希望查看

ProcessPoolExecutor

，了解使用

多处理

模块的处理器。它与

线程池执行器

共享相同的接口

希望这有意义。

非常感谢您的建议。我会调查的。

from concurrent.futures import ThreadPoolExecutor

# We can use a with statement to ensure threads are cleaned up promptly
with ThreadPoolExecutor() as executor:
    # Basically the same as if you did the normal for-loop
    for file in list_of_files:
        # But you submit your method to the Pool instead.
        future = executor.submit(find_string, file) # see future.result() too

    print("All tasks complete")