Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/301.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python concurrent.futures是否可以加快PyPDF2的速度?_Python_Multiprocessing_Pypdf2 - Fatal编程技术网

Python concurrent.futures是否可以加快PyPDF2的速度?

Python concurrent.futures是否可以加快PyPDF2的速度?,python,multiprocessing,pypdf2,Python,Multiprocessing,Pypdf2,我创建了一个程序,可以搜索目录中包含的所有PDF文件中的单词或短语。如果在给定的PDF中找到该短语,则包含该术语的页面将被提取并保存为新的PDF 节目很慢。我将需要运行1000多个PDF,因此使用multi-processing/concurrent.futures加快速度将非常有益。然而,我似乎无法让事情正常运转 在下面的代码中是否有一种直接的方法来启用多处理 import PyPDF2 import re import os import glob from pathlib import P

我创建了一个程序,可以搜索目录中包含的所有PDF文件中的单词或短语。如果在给定的PDF中找到该短语,则包含该术语的页面将被提取并保存为新的PDF

节目很慢。我将需要运行1000多个PDF,因此使用multi-processing/concurrent.futures加快速度将非常有益。然而,我似乎无法让事情正常运转

在下面的代码中是否有一种直接的方法来启用多处理

import PyPDF2
import re
import os
import glob
from pathlib import Path

String = input("Enter search string: ")
inputDir = Path(input("Enter path to directory containing PDFs to search: "))
outputDir = Path(input("Enter path to directory where you would like PDFs saved: "))
outputAppend = input("Text (including separator) to be appended to end of filenames (blank if none): ")
inputDir_glob = str(inputDir) + "/*.pdf"

PDFlist = sorted(glob.glob(inputDir_glob))

if not os.path.exists(str(outputDir)):
    os.makedirs(str(outputDir))


for filename in PDFlist:

    object = PyPDF2.PdfFileReader(filename, strict=False)

    # Get number of pages in the pdf
    NumPages = object.getNumPages()

    # Setup the file writer
    output = PyPDF2.PdfFileWriter()

    # Do the search
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        Text = PageObj.extractText()
        if re.search(String, Text):
            print("File: " + filename + "  |  " + "Page: " + str(i))
            output.addPage(object.getPage(i))
            outputStream = open(str(outputDir) + "/" + os.path.splitext(os.path.basename(filename))[0] + outputAppend + ".pdf", "wb")
            output.write(outputStream)
            outputStream.close()

我最终弄明白了这一点,我想我会和大家分享,以防其他人也面临类似的问题。下面的解决方案比原始代码(上面发布的)快得多:

import PyPDF2
import re
import os
import glob
from pathlib import Path
import concurrent.futures

# Enter the search term here:
String = input("Enter search string: ")

#Enter directory containing original PDFs:
inputDir = Path(input("Enter path to directory containing PDFs to search: "))
outputDir = Path(input("Enter path to directory where you would like PDFs saved: "))
outputAppend = input("Text (including separator) to be appended to end of filenames (blank if none): ")
inputDir_glob = str(inputDir) + "/*.pdf"

PDFlist = sorted(glob.glob(inputDir_glob))

if not os.path.exists(str(outputDir)):
    os.makedirs(str(outputDir))

def process_file(filename):
    object = PyPDF2.PdfFileReader(filename, strict=False)
    NumPages = object.getNumPages()
    output = PyPDF2.PdfFileWriter()

    # Do the search
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        Text = PageObj.extractText()
        if re.search(String, Text):
            print("File: " + filename + "  |  " + "Page: " + str(i))
            output.addPage(object.getPage(i))
            outputStream = open(str(outputDir) + "/" + os.path.splitext(os.path.basename(filename))[0] + outputAppend + ".pdf", "wb")
            output.write(outputStream)
            outputStream.close()
            #os.rename(filename, Path(str(outputDir) + "/Originals/" + str(os.path.basename(filename))))

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    result = executor.map(process_file, (PDFlist))