如何使用AcrobatSDK将PDF转换为Word?

如何使用AcrobatSDK将PDF转换为Word?,pdf,ms-word,acrobat-sdk,Pdf,Ms Word,Acrobat Sdk,我的.Net应用程序需要以编程方式将PDF文档转换为Word格式 我评估了几个产品,发现它提供了另存为选项,我们可以将文档保存为Word/Excel格式。我尝试使用AcrobatSDK,但从何处开始找不到合适的文档 我查看了他们的IAC示例,但不明白如何调用菜单项并使其执行“另存为”选项。Adobe不支持PDF到Word的转换,除非您使用的是Acrobat PDF客户端。 Maeaning你不能用他们的SDK,也不能调用命令行。您只能手动执行此操作。您可以使用Acrobat X Pro执行此操作

我的.Net应用程序需要以编程方式将PDF文档转换为Word格式

我评估了几个产品,发现它提供了另存为选项,我们可以将文档保存为Word/Excel格式。我尝试使用AcrobatSDK,但从何处开始找不到合适的文档


我查看了他们的IAC示例,但不明白如何调用菜单项并使其执行“另存为”选项。

Adobe不支持PDF到Word的转换,除非您使用的是Acrobat PDF客户端。
Maeaning你不能用他们的SDK,也不能调用命令行。您只能手动执行此操作。

您可以使用Acrobat X Pro执行此操作,但需要使用c#中的javascript API


希望能有所帮助。

我使用WinPython x64 2.7.6.3和Acrobat X Pro做了类似的事情,并使用JSObject接口将PDF转换为DOCX。从本质上讲,解决方案与

以下是将一组PDF转换为DOCX的完整代码:

# gets all files under ROOT_INPUT_PATH with FILE_EXTENSION and tries to extract text from them into ROOT_OUTPUT_PATH with same filename as the input file but with INPUT_FILE_EXTENSION replaced by OUTPUT_FILE_EXTENSION
from win32com.client import Dispatch
from win32com.client.dynamic import ERRORS_BAD_CONTEXT

import winerror

# try importing scandir and if found, use it as it's a few magnitudes of an order faster than stock os.walk
try:
    from scandir import walk
except ImportError:
    from os import walk

import fnmatch

import sys
import os

ROOT_INPUT_PATH = None
ROOT_OUTPUT_PATH = None
INPUT_FILE_EXTENSION = "*.pdf"
OUTPUT_FILE_EXTENSION = ".docx"

def acrobat_extract_text(f_path, f_path_out, f_basename, f_ext):
    avDoc = Dispatch("AcroExch.AVDoc") # Connect to Adobe Acrobat

    # Open the input file (as a pdf)
    ret = avDoc.Open(f_path, f_path)
    assert(ret) # FIXME: Documentation says "-1 if the file was opened successfully, 0 otherwise", but this is a bool in practise?

    pdDoc = avDoc.GetPDDoc()

    dst = os.path.join(f_path_out, ''.join((f_basename, f_ext)))

    # Adobe documentation says "For that reason, you must rely on the documentation to know what functionality is available through the JSObject interface. For details, see the JavaScript for Acrobat API Reference"
    jsObject = pdDoc.GetJSObject()

    # Here you can save as many other types by using, for instance: "com.adobe.acrobat.xml"
    jsObject.SaveAs(dst, "com.adobe.acrobat.docx") # NOTE: If you want to save the file as a .doc, use "com.adobe.acrobat.doc"

    pdDoc.Close()
    avDoc.Close(True) # We want this to close Acrobat, as otherwise Acrobat is going to refuse processing any further files after a certain threshold of open files are reached (for example 50 PDFs)
    del pdDoc

if __name__ == "__main__":
    assert(5 == len(sys.argv)), sys.argv # <script name>, <script_file_input_path>, <script_file_input_extension>, <script_file_output_path>, <script_file_output_extension>

    #$ python get.docx.from.multiple.pdf.py 'C:\input' '*.pdf' 'C:\output' '.docx' # NOTE: If you want to save the file as a .doc, use '.doc' instead of '.docx' here and ensure you use "com.adobe.acrobat.doc" in the jsObject.SaveAs call

    ROOT_INPUT_PATH = sys.argv[1]
    INPUT_FILE_EXTENSION = sys.argv[2]
    ROOT_OUTPUT_PATH = sys.argv[3]
    OUTPUT_FILE_EXTENSION = sys.argv[4]

    # tuples are of schema (path_to_file, filename)
    matching_files = ((os.path.join(_root, filename), os.path.splitext(filename)[0]) for _root, _dirs, _files in walk(ROOT_INPUT_PATH) for filename in fnmatch.filter(_files, INPUT_FILE_EXTENSION))

    # patch ERRORS_BAD_CONTEXT as per https://mail.python.org/pipermail/python-win32/2002-March/000265.html
    global ERRORS_BAD_CONTEXT
    ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)

    for filename_with_path, filename_without_extension in matching_files:
        print "Processing '{}'".format(filename_without_extension)
        acrobat_extract_text(filename_with_path, ROOT_OUTPUT_PATH, filename_without_extension, OUTPUT_FILE_EXTENSION)
#获取根目录下所有文件的输入路径和文件扩展名,并尝试将这些文件中的文本提取到根目录输出路径,该路径的文件名与输入文件相同,但输入文件扩展名被输出文件扩展名替换
从win32com.client导入分派
从win32com.client.dynamic import ERRORS\u BAD\u上下文
进口葡萄酒
#尝试导入scandir,如果找到了,就使用它,因为它比stock os.walk快几个数量级
尝试:
从scandir导入步行
除恐怖外:
从操作系统导入漫游
导入fnmatch
导入系统
导入操作系统
根\输入\路径=无
根\输出\路径=无
输入文件扩展名=“*.pdf”
输出文件扩展名=“.docx”
def acrobat_提取_文本(f_路径、f_路径、f_输出、f_基本名称、f_外部):
avDoc=Dispatch(“AcroExch.avDoc”)#连接到Adobe Acrobat
#打开输入文件(以pdf格式)
ret=avDoc.Open(f_路径,f_路径)
assert(ret)#FIXME:文档中说“-1如果文件打开成功,则为0”,但实际上这是一个bool?
pdDoc=avDoc.GetPDDoc()
dst=os.path.join(f_path_out,''.join((f_basename,f_ext)))
#Adobe文档说“因此,您必须依赖文档来了解通过JSObject接口可以使用哪些功能。有关详细信息,请参阅JavaScript For Acrobat API参考”
jsObject=pdDoc.GetJSObject()
#在这里,您可以使用“com.adobe.acrobat.xml”保存其他类型的文件
jsObject.SaveAs(dst,“com.adobe.acrobat.docx”)#注意:如果要将文件另存为.doc,请使用“com.adobe.acrobat.doc”
pdDoc.Close()
avDoc.Close(True)#我们希望关闭Acrobat,否则Acrobat将在达到打开文件的某个阈值(例如50个PDF)后拒绝处理任何进一步的文件
德尔pdDoc
如果名称=“\uuuuu main\uuuuuuuu”:
断言(5==len(sys.argv)),sys.argv#,
#$python get.docx.from.multiple.pdf.py'C:\input'*.pdf''C:\output'.docx'#注意:如果要将文件另存为.doc,请在此处使用'.doc'而不是'.docx',并确保在jsObject.SaveAs调用中使用“com.adobe.acrobat.doc”
根目录输入路径=sys.argv[1]
输入文件扩展名=sys.argv[2]
根目录输出路径=sys.argv[3]
输出文件扩展名=sys.argv[4]
#元组是模式(路径到文件,文件名)
为fnmatch.filter(_文件,输入文件扩展名)中的_root、_dirs、_walk中的文件(根输入路径)匹配_files=((os.path.join(_root,filename)、os.path.splitext(filename)[0])
#修补程序错误\u错误\u上下文符合https://mail.python.org/pipermail/python-win32/2002-March/000265.html
全局错误\u错误\u上下文
错误\u错误\u上下文.append(winerror.E\u NOTIMPL)
对于带路径的文件名\u,匹配的\u文件中不带扩展名的文件名\u:
打印“正在处理{}.”格式(文件名不带扩展名)
acrobat_提取_文本(文件名_,带路径,根路径,输出路径,不带扩展名的文件名,输出文件扩展名)

jle或me发布的解决方案展示了通过编程实现这一点的方法。如果你有Acrobat X Pro,你可以试用我的脚本,因为一旦你安装了WinPython x64 2.7.6.3(免费),它就可以开箱即用。嗨,我没有相同的东西。。谢谢你的回答。但这一过程似乎需要相当长的时间才能完成。如果我要处理1000个文件,那就需要5到6个多小时。。有没有更快的方法?我在末尾添加了一个pdfd.Close()来解锁文件。谢谢!太有用了。对于有兴趣导出到excel的用户,只需将newFile.doc更改为newFile.xlsx,将“com.adobe.acrobat.doc”更改为“com.adobe.acrobat.xlsx”,Mac上的调度模块的替代方案是什么?
# gets all files under ROOT_INPUT_PATH with FILE_EXTENSION and tries to extract text from them into ROOT_OUTPUT_PATH with same filename as the input file but with INPUT_FILE_EXTENSION replaced by OUTPUT_FILE_EXTENSION
from win32com.client import Dispatch
from win32com.client.dynamic import ERRORS_BAD_CONTEXT

import winerror

# try importing scandir and if found, use it as it's a few magnitudes of an order faster than stock os.walk
try:
    from scandir import walk
except ImportError:
    from os import walk

import fnmatch

import sys
import os

ROOT_INPUT_PATH = None
ROOT_OUTPUT_PATH = None
INPUT_FILE_EXTENSION = "*.pdf"
OUTPUT_FILE_EXTENSION = ".docx"

def acrobat_extract_text(f_path, f_path_out, f_basename, f_ext):
    avDoc = Dispatch("AcroExch.AVDoc") # Connect to Adobe Acrobat

    # Open the input file (as a pdf)
    ret = avDoc.Open(f_path, f_path)
    assert(ret) # FIXME: Documentation says "-1 if the file was opened successfully, 0 otherwise", but this is a bool in practise?

    pdDoc = avDoc.GetPDDoc()

    dst = os.path.join(f_path_out, ''.join((f_basename, f_ext)))

    # Adobe documentation says "For that reason, you must rely on the documentation to know what functionality is available through the JSObject interface. For details, see the JavaScript for Acrobat API Reference"
    jsObject = pdDoc.GetJSObject()

    # Here you can save as many other types by using, for instance: "com.adobe.acrobat.xml"
    jsObject.SaveAs(dst, "com.adobe.acrobat.docx") # NOTE: If you want to save the file as a .doc, use "com.adobe.acrobat.doc"

    pdDoc.Close()
    avDoc.Close(True) # We want this to close Acrobat, as otherwise Acrobat is going to refuse processing any further files after a certain threshold of open files are reached (for example 50 PDFs)
    del pdDoc

if __name__ == "__main__":
    assert(5 == len(sys.argv)), sys.argv # <script name>, <script_file_input_path>, <script_file_input_extension>, <script_file_output_path>, <script_file_output_extension>

    #$ python get.docx.from.multiple.pdf.py 'C:\input' '*.pdf' 'C:\output' '.docx' # NOTE: If you want to save the file as a .doc, use '.doc' instead of '.docx' here and ensure you use "com.adobe.acrobat.doc" in the jsObject.SaveAs call

    ROOT_INPUT_PATH = sys.argv[1]
    INPUT_FILE_EXTENSION = sys.argv[2]
    ROOT_OUTPUT_PATH = sys.argv[3]
    OUTPUT_FILE_EXTENSION = sys.argv[4]

    # tuples are of schema (path_to_file, filename)
    matching_files = ((os.path.join(_root, filename), os.path.splitext(filename)[0]) for _root, _dirs, _files in walk(ROOT_INPUT_PATH) for filename in fnmatch.filter(_files, INPUT_FILE_EXTENSION))

    # patch ERRORS_BAD_CONTEXT as per https://mail.python.org/pipermail/python-win32/2002-March/000265.html
    global ERRORS_BAD_CONTEXT
    ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)

    for filename_with_path, filename_without_extension in matching_files:
        print "Processing '{}'".format(filename_without_extension)
        acrobat_extract_text(filename_with_path, ROOT_OUTPUT_PATH, filename_without_extension, OUTPUT_FILE_EXTENSION)