PyPDF2写入不'；t处理一些PDF文件（Python 3.5.1）_Python_Python 3.x_Pdf_Reportlab_Pypdf2

PyPDF2写入不'；t处理一些PDF文件（Python 3.5.1）

python python-3.x pdf

PyPDF2写入不'；t处理一些PDF文件（Python 3.5.1）,python,python-3.x,pdf,reportlab,pypdf2,Python,Python 3.x,Pdf,Reportlab,Pypdf2,首先，我使用的是Python 3.5.1（32位版本）我编写了以下程序，使用PyPDF2和reportlab在我的pdf文件的所有页面上添加页码： #import modules from os import listdir from PyPDF2 import PdfFileWriter, PdfFileReader import io from reportlab.pdfgen import canvas from reportlab.lib.pagesizes import A4 #in

首先，我使用的是Python 3.5.1（32位版本）我编写了以下程序，使用PyPDF2和reportlab在我的pdf文件的所有页面上添加页码：

#import modules
from os import listdir
from PyPDF2 import PdfFileWriter, PdfFileReader
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4
#initial values of variable declarations
PDFlist=[]
X_value=460
Y_value=820
#Make a list of al files in de directory
filelist = listdir()
#Make a list of all pdf files in the directory
for i in range(0,len(filelist)):
    filename=filelist[i]
    for j in range(0,len(filename)):
        char=filename[j]
        if char=='.':
            extension=filename[j+1:j+4]
            if extension=='pdf':
                PDFlist.append(filename)
        j=j+1
    i=i+1
# Give the horizontal position for the page number (Enter = use default value of 480)
User = input('Give horizontal position page number (ENTER = default 460): ')
if User != "":
    X_value=int(User)
# Give the vertical position for the page number (Enter = use default value of 820)
User = input('Give horizontal position page number (ENTER = default 820): ')
if User != "":
    Y_value=int(User)

for i in range(0,len(PDFlist)):
    filename=PDFlist[i]

    # read the PDF
    existing_pdf = PdfFileReader(open(filename, "rb"))
    print("File: "+filename)
    # count the number of pages
    number_of_pages = existing_pdf.getNumPages()
    print("Number of pages detected:"+str(number_of_pages))
    output = PdfFileWriter()

    for k in range(0,number_of_pages):
        packet = io.BytesIO()

        # create a new PDF with Reportlab
        can = canvas.Canvas(packet, pagesize=A4)
        Pagenumber=" Page "+str(k+1)+"/"+str(number_of_pages)
        # we first make a white rectangle to cover any existing text in the pdf
        can.setFillColorRGB(1,1,1)
        can.setStrokeColorRGB(1,1,1)
        can.rect(X_value-10,Y_value-5,120,20,fill=1)
        # set the font and size
        can.setFont("Helvetica",14)
        # choose color of page numbers (red)
        can.setFillColorRGB(1,0,0)
        can.drawString(X_value, Y_value, Pagenumber)
        can.save()
        print(Pagenumber)

        #move to the beginning of the StringIO buffer
        packet.seek(0)
        new_pdf = PdfFileReader(packet)
        # add the "watermark" (which is the new pdf) on the existing page
        page = existing_pdf.getPage(k)
        page.mergePage(new_pdf.getPage(0))
        output.addPage(page)
        k=k+1
    # finally, write "output" to a real file

    ResultPDF="Output/"+filename
    outputStream = open(ResultPDF, "wb")
    output.write(outputStream)
    outputStream.close()
    i=i+1

该程序适用于相当多的PDF文件（尽管有时会生成类似“

PdfReadWarning:在对象头b'16'b'0'[PDF.py:1666]

”的警告，但生成的输出文件对我来说没有问题）。但是，该程序无法处理某些PDF文件，尽管使用我的Adobe Acrobat可以完全读取和编辑这些文件。我的印象是，错误主要出现在扫描的PDF文件上，而不是所有文件上（我还对扫描的PDF文件进行了编号，这些文件没有产生任何错误）。我收到以下错误消息（前8行是我自己的打印命令的结果）：

文件：扫描文件.pdf
检测到的页数：6
第1页，共6页
第2页，共6页
第3页，共6页
第4页，共6页
第5页，共6页
第6页，共6页
PdfReadWarning:未定义对象25 1。[pdf.py:1629]
回溯（最近一次呼叫最后一次）：
文件“C:\Users\User\AppData\Local\Programs\Python35-32\Sourcecode\PDFPager.py”，第83行，在
output.write（outputStream）
写入文件“C:\Users\User\AppData\Local\Programs\Python35-32\lib\site packages\PyPDF2\pdf.py”，第482行
self.\u扫描间接引用（externalReferenceMap，self.\u根）
文件“C:\Users\User\AppData\Local\Programs\Python35-32\lib\site packages\PyPDF2\pdf.py”，第571行，在_-sweepIndirectReferences中
self.\u扫描间接引用（externMap、realdata）
文件“C:\Users\User\AppData\Local\Programs\Python35-32\lib\site packages\PyPDF2\pdf.py”，第547行，在_-sweepIndirectReferences中
值=自身。\u扫描间接引用（外部映射，值）
文件“C:\Users\User\AppData\Local\Programs\Python35-32\lib\site packages\PyPDF2\pdf.py”，第571行，在_-sweepIndirectReferences中
self.\u扫描间接引用（externMap、realdata）
文件“C:\Users\User\AppData\Local\Programs\Python35-32\lib\site packages\PyPDF2\pdf.py”，第547行，在_-sweepIndirectReferences中
值=自身。\u扫描间接引用（外部映射，值）
文件“C:\Users\User\AppData\Local\Programs\Python35-32\lib\site packages\PyPDF2\pdf.py”，第556行，在_-sweepIndirectReferences中
value=self.\u扫描间接引用（外部映射，数据[i]）
文件“C:\Users\User\AppData\Local\Programs\Python35-32\lib\site packages\PyPDF2\pdf.py”，第571行，在_-sweepIndirectReferences中
self.\u扫描间接引用（externMap、realdata）
文件“C:\Users\User\AppData\Local\Programs\Python35-32\lib\site packages\PyPDF2\pdf.py”，第547行，在_-sweepIndirectReferences中
值=自身。\u扫描间接引用（外部映射，值）
文件“C:\Users\User\AppData\Local\Programs\Python35-32\lib\site packages\PyPDF2\pdf.py”，第556行，在_-sweepIndirectReferences中
value=self.\u扫描间接引用（外部映射，数据[i]）
文件“C:\Users\User\AppData\Local\Programs\Python35-32\lib\site packages\PyPDF2\pdf.py”，第577行，在_-sweepIndirectReferences中
newobj=data.pdf.getObject（数据）
文件“C:\Users\User\AppData\Local\Programs\Python35-32\lib\site packages\PyPDF2\pdf.py”，第1631行，在getObject中
raise utils.PdfReadError（“找不到对象”）
PyPDF2.utils.PdfReadError:找不到对象。

显然，这些页面与reportlab创建的PDF合并（参见第6/6页之前的行），但最终PyPDF2无法生成任何输出PDF文件（我得到一个0字节的不可读取输出文件）。

有人能解释一下如何解决这个问题吗？我搜索了互联网，但没有找到答案。

在pdf.py上执行以下更改：

pdf第1633行。py（这意味着取消对if self.strict的注释）

并在pdf.py的第501行进行以下更改（添加一个try，除了块）

干杯。

使用“strict=false”对我来说很有效

from PyPDF2 import PdfFileMerger

pdfs = [r'file 1.pdf', r'file 2.pdf']

merger = PdfFileMerger(strict=False)

for pdf in pdfs:
    merger.append(pdf)

merger.write(r"thanks mate.pdf")

这是我的解决办法。尝试将文件写入一个伪ByteIO流，以检查它是否已损坏

    try:
        reader = PdfFileReader(input_file)
        print("Opening '{}', pages={}".format(file_path, reader.getNumPages()))
        # Try to write it into an dummy ByteIO stream to check whether pdf is broken
        writer = PdfFileWriter()
        writer.addPage(reader.getPage(0))
        writer.write(io.BytesIO())
    except PdfReadError:
        print("Error reading '{}".format(file_path))
        continue

调用同一个函数时，我收到了相同的错误消息。你的PDF文件可以填写吗？当我将PDF转换为“常规”只读PDF时，问题得到了解决。与此同时，我还找到了一个解决方法，通过PDF打印机打印PDF文件。问题得到了解决。哈哈，是的，这确实是等效的。我认为在合并文件之前，首先检查文件是否损坏。然后合并它们。如果文件已损坏或未完全下载，合并将不会成功。文件未损坏。我可以用pdf阅读器毫无问题地阅读它们。但是，我无法使用python代码合并它们。嘿，是的，我只是重新运行并将其设置为True，然后创建了文档，只是有一堆警告。我认为它修复了一个新文档未创建的问题，但是我的问题一定不同。我认为在合并文件之前，首先检查文件是否损坏。然后合并它们。如果文件被破坏或未完全下载，合并将不会成功。我尝试不合并整个pdf文件，而是合并一些页面。我仍然得到strict=False的错误。使用所述更改修改pdf.py是可行的。那么，pdf.py从未被更正过？酷。这个补丁一定要推到master中。但是，pypdf2现在似乎未维护：（相同的修复修复了pypdf4上的相同问题；我在线程上发布了一个相关错误的主题链接。pypdf4似乎没有pypdf2那么活跃。@Watusimoto感谢您让我知道！我在下面添加了一条评论。希望回购所有者注意到它。@bmg-我也在相关的上发布了这个问题，请随意在此处或此处回复。）这里，我将发布X-post。我们希望纳入您的解决方案来解决此问题，但不确定其后果。看起来错误只是被忽略了，而有人会认为是故意的，有条件的未提交。您是否理解为什么这会解决此问题

    try:
        obj.writeToStream(stream, key)
        stream.write(b_("\nendobj\n"))
    except:
        pass

from PyPDF2 import PdfFileMerger

pdfs = [r'file 1.pdf', r'file 2.pdf']

merger = PdfFileMerger(strict=False)

for pdf in pdfs:
    merger.append(pdf)

merger.write(r"thanks mate.pdf")

    try:
        reader = PdfFileReader(input_file)
        print("Opening '{}', pages={}".format(file_path, reader.getNumPages()))
        # Try to write it into an dummy ByteIO stream to check whether pdf is broken
        writer = PdfFileWriter()
        writer.addPage(reader.getPage(0))
        writer.write(io.BytesIO())
    except PdfReadError:
        print("Error reading '{}".format(file_path))
        continue