用Python搜索并替换PDF中的占位符文本_Python_Pdf

用Python搜索并替换PDF中的占位符文本

python pdf

用Python搜索并替换PDF中的占位符文本,python,pdf,Python,Pdf,我需要生成模板文档的自定义PDF副本。我认为，最简单的方法是创建一个源PDF，其中包含一些需要进行自定义的占位符文本，即和，然后用正确的值替换它们我已经到处搜索了，但是真的没有办法基本上获取源模板PDF，用实际值替换占位符并写入新的PDF吗我查看了PyPDF2和ReportLab，但它们似乎都无法做到这一点。有什么建议吗？我的大多数搜索都会使用Perl应用程序CAM:：PDF，但我更愿意将其全部保存在Python中。没有直接的方法可以可靠地完成这项工作。PDF与HTML不同：它们逐个字符

我需要生成模板文档的自定义PDF副本。我认为，最简单的方法是创建一个源PDF，其中包含一些需要进行自定义的占位符文本，即

和

，然后用正确的值替换它们

我已经到处搜索了，但是真的没有办法基本上获取源模板PDF，用实际值替换占位符并写入新的PDF吗

我查看了PyPDF2和ReportLab，但它们似乎都无法做到这一点。

有什么建议吗？我的大多数搜索都会使用Perl应用程序CAM:：PDF，但我更愿意将其全部保存在Python中。

没有直接的方法可以可靠地完成这项工作。PDF与HTML不同：它们逐个字符指定文本的位置。它们甚至可能不包括用于呈现文本的整个字体，只包括呈现文档中特定文本所需的字符。我发现没有一个库可以在更新文本后重新包装段落。PDF在很大程度上是一种仅显示的格式，因此使用将标记转换为PDF的工具比就地更新PDF要好得多

如果这不是一个选项，那么您可以在Acrobat之类的东西中创建一个PDF操作库，然后使用or之类的PDF操作库，它有一个很好的clojure包装器，可以处理其中的一些问题

根据我的经验，Python对写入PDF的支持非常有限。到目前为止，Java拥有最好的语言支持。此外，您可以获得您所支付的费用，因此，如果您将iText许可证用于商业目的，那么为iText许可证支付费用可能是值得的。我在围绕PDF操作CLI工具（如pdfboxing和ghostscript）编写python包装器方面取得了相当好的效果。对于您的用例来说，这可能比将其硬塞进Python的PDF生态系统要容易得多。

没有明确的解决方案，但我找到了两种在大多数情况下都有效的解决方案

在python中提供了很好的结果。下面是示例代码：

# Redact things that look like social security numbers, replacing the
# text with X's.
options.content_filters = [
        # First convert all dash-like characters to dashes.
        (
                re.compile(u"Tom Xavier"),
                lambda m : "XXXXXXX"
        ),

        # Then do an actual SSL regex.
        # See https://github.com/opendata/SSN-Redaction for why this regex is complicated.
        (
                re.compile(r"(?<!\d)(?!666|000|9\d{2})([OoIli0-9]{3})([\s-]?)(?!00)([OoIli0-9]{2})\2(?!0{4})([OoIli0-9]{4})(?!\d)"),
                lambda m : "XXX-XX-XXXX"
        ),
]

# Perform the redaction using PDF on standard input and writing to standard output.
pdf_redactor.redactor(options)

require 'hexapdf'

class ShowTextProcessor < HexaPDF::Content::Processor

  def initialize(page, to_hide_arr)
    super()
    @canvas = page.canvas(type: :overlay)
    @to_hide_arr = to_hide_arr
  end

  def show_text(str)
    boxes = decode_text_with_positioning(str)
    return if boxes.string.empty?
    if @to_hide_arr.include? boxes.string
        @canvas.stroke_color(0, 0 , 0)

        boxes.each do |box|
          x, y = *box.lower_left
          tx, ty = *box.upper_right
          @canvas.rectangle(x, y, tx - x, ty - y).fill
        end
    end

  end
  alias :show_text_with_positioning :show_text

end

file_name = ARGV[0]
strings_to_black = ARGV[1].split("|")

doc = HexaPDF::Document.open(file_name)
puts "Blacken strings [#{strings_to_black}], inside [#{file_name}]."
doc.pages.each.with_index do |page, index|
  processor = ShowTextProcessor.new(page, strings_to_black)
  page.process_contents(processor)
end

new_file_name = "#{file_name.split('.').first}_updated.pdf"
doc.write(new_file_name, optimize: true)

puts "Writing updated file [#{new_file_name}]."

#编辑看起来像社会保险号码的内容，替换
#带X的文本。
options.content\u过滤器=[
#首先，将所有类似破折号的字符转换为破折号。
(
重新编译（u“Tom Xavier”），
lambda m:“XXXXXXX”
),
#然后执行一个实际的SSL正则表达式。
#看https://github.com/opendata/SSN-Redaction 为什么这个正则表达式很复杂。
(
re.compile（r）（？作为您可以尝试的另一种解决方案，它提供了替换PDF文档中文本的功能
首先，安装Aspose.PDF Cloud SDK for Python
pip install asposepdfcloud

示例代码将PDF文件上载到云存储，并替换PDF文档中的多个字符串
import os 
import asposepdfcloud 
from asposepdfcloud.apis.pdf_api import PdfApi 
 
# Get App key and App SID from https://aspose.cloud 
pdf_api_client = asposepdfcloud.api_client.ApiClient( 
    app_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', 
    app_sid='xxxxx-xxxx-xxxx-xxxx-xxxxxxxx') 
 
pdf_api = PdfApi(pdf_api_client) 
filename = '02_pages.pdf' 
remote_name = '02_pages.pdf' 
 
#upload PDF file to storage 
pdf_api.upload_file(remote_name,filename) 
 
#Replace Text 
text_replace1 = asposepdfcloud.models.TextReplace(old_value='origami',new_value='aspose',regex='true') 
text_replace2 = asposepdfcloud.models.TextReplace(old_value='candy',new_value='biscuit',regex='true') 
text_replace_list = asposepdfcloud.models.TextReplaceListRequest(text_replaces=[text_replace1,text_replace2]) 
 
response = pdf_api.post_document_text_replace(remote_name, text_replace_list) 
print(response)


我是aspose的开发者传道者。
PDF可以采用多种形式，并且使用压缩，所以不太容易。如果你想使用其他丰富格式，你可以选择.docx
或.rtf
。docx是压缩的xml=>文本，.rtf是带标记的文本。html也是模板的一个不错的选择。不过，我还是建议使用好的，在reportlab。一旦你找到了想要生成的pdf文件的源代码，使之灵活是非常简单的。看看这里的示例：你可以使用reportlab RML生成模板（只是一个文本文件），然后动态添加内容。看一下“随波逐流”“这里的部分你有没有试用python中的pdf redactor？我发现问题hexapdf站点上的示例很好，并且有一个类似的示例。据我所知（还没有这样做）您也可以将文本写入覆盖，并具有纯白背景，从而覆盖旧的，这样就解决了问题。当我登录时，它显示以下消息：Oops！出错。登录外部提供程序时出错。错误消息是：access_denied Request Id:904f9113-d60b-4a50-9645-7284f257a0fe@FarhanKhan请分享有关您面临的问题的更多详细信息，即引发错误的链接或代码。