在Python中从.docx文件中的表中提取Acrobat文档对象

在Python中从.docx文件中的表中提取Acrobat文档对象,python,python-docx,Python,Python Docx,我有一个.docx文件,里面有一个表。有些单元格包含嵌入的Acrobat文档对象 我正在使用python docx模块从.docx文件中读取和提取数据,但它没有获取那些嵌入的文档(当要求输入单元格值时,它返回一个空字符串) 由于我认为自己不够好,无法尝试修改模块源代码,所以我曾想过从.docx文件本身提取嵌入文档(通过[将扩展名更改为.zip][1]),但pdf文件似乎是.bin格式 所以我在考虑四种可能的解决方案,你们可以帮我: 获取python docx模块以识别嵌入的Acrobat文档对象

我有一个.docx文件,里面有一个表。有些单元格包含嵌入的Acrobat文档对象

我正在使用
python docx
模块从.docx文件中读取和提取数据,但它没有获取那些嵌入的文档(当要求输入单元格值时,它返回一个空字符串)

由于我认为自己不够好,无法尝试修改模块源代码,所以我曾想过从.docx文件本身提取嵌入文档(通过[将扩展名更改为.zip][1]),但pdf文件似乎是.bin格式

所以我在考虑四种可能的解决方案,你们可以帮我:

  • 获取
    python docx
    模块以识别嵌入的Acrobat文档对象
  • 让Python将.bin文件转换为pdf
  • 也许可以向我推荐另一个Python模块,它可以从.docx文件中提取表(显然还有它的数据),并且还支持Acrobat文档对象
  • (我不太喜欢的那一个)我看过一些Visual Studio或宏代码(不确定是什么,我会在最后粘贴),据说这些代码在提取pdf嵌入文件时非常有效。问题是我并不真正喜欢VisualStudio的东西,所以我需要帮助从Python执行宏(使用简单的语言,并且记住我不是VisualStudio专家)
  • 方法必须使用Python(或另一种不太复杂的自动化方法)完成,因为此过程必须多次执行,并且完全放弃手动提取

    事先非常感谢

    我在Internet上看到的Visual Studio代码(复制粘贴):

    Sub-extractedEmbeddedDocs()
    作为对象的Dim MyObj
    将xlApp作为对象
    Dim xlWkb作为对象
    将我的形状变暗为Word.InlineShape
    Dim myFormat作为Word.WdSaveFormat
    将WDDoc设置为Word.Document
    '将OBJ作为对象
    Dim FileExtStr作为字符串
    Dim FileFormatNum尽可能长
    暗条纹为字符串,短条纹为字符串
    Dim StrDocFile作为字符串,Obj_App作为对象,i作为长
    Dim StrFile为字符串,StrFileList为字符串,StrMediaFile为字符串,j为长
    将输出文件名设置为字符串
    作为布尔值的Dim-SBar
    作为字符串的Dim exten
    Dim embedCount为整数、wordCount为整数、excelCount为整数、visioCount为整数、PPT为整数
    Dim pdfCount为整数
    Dim msg作为字符串,temp作为字符串
    将ok设置为布尔值
    变号文件、变号文件、变号临时文件
    Dim AcroApp作为Acrobat.CAcroApp
    Dim AcroPDDoc作为Acrobat.CAcroPDDoc
    将AcroAVDoc作为Acrobat.CAcroAVDoc进行调整
    Dim jso作为对象
    StrInFold=ActiveDocument.Path
    如果StrInFold=“”,则退出Sub
    '存储当前状态栏状态,然后打开
    SBar=Application.DisplayStatusBar
    Application.DisplayStatusBar=True
    StrOutFold=StrInFold&“\Embedded Files”
    Application.ScreenUpdating=False
    '在发生错误时转到错误\u处理程序
    '测试现有输出文件夹,如果它们不存在则创建
    如果Dir(StrOutFold,vbDirectory)=“”,则MkDir StrOutFold
    EmbeddeCount=ActiveDocument.InlineShapes.Count
    '这将打开嵌入的文档,每个文档都位于各自的程序实例中
    对于ActiveDocument.InlineShapes中的每个myshape
    如果(myshape.Type=wdInlineShapeEmbeddedDoleObject),则
    如果(InStr(myshape.OLEFormat.ClassType,“Word”)>0,则
    myshape.OLEFormat.DoVerb(wdOLEVerbOpen)'打开第一个嵌入的word文档
    “现在我想保存它
    Set MyObj=GetObject(,“Word.Application”)
    如果MyObj什么都不是
    'Word未运行,请创建新实例
    设置MyObj=CreateObject(“Word.Application”)
    如果结束
    设置WDDoc=MyObj.ActiveDocument
    嵌入文件的IconLabel的每一端都有“older Word files had”,从而生成一个路径名,其中包含“”
    '这在执行SaveAs行时导致了错误4148。如图所示:)
    温度=修剪(替换(替换(myshape.OLEFormat.IconLabel,Chr(34),“”),Chr(34),“”)
    如果正确(温度,3)=“doc”,则
    myFormat=wdFormatDocument
    其他的
    选择右侧案例(临时,4)
    案件“docx”
    myFormat=wdFormatXMLDocument
    案件“docm”
    myFormat=wdFormatXMLDocumentMacroEnabled
    其他情况
    myFormat=wdFormatDocument
    结束选择
    如果结束
    outFileName=StrOutFold&“\”+temp”
    WDDoc.SaveAs2文件名:=outFileName,文件格式:=myFormat'CompatibilityMode:=12'
    WDDoc.Close savechanges:=False“MyObj.Application.Documents.Item(1)
    字数=字数+1
    设置WDDoc=Nothing
    “MyObj.Quit”将退出现有的Word实例,请勿!
    设置MyObj=Nothing
    如果结束
    如果(InStr(myshape.OLEFormat.ClassType,“Visio”)>0,则
    myshape.OLEFormat.DoVerb(wdOLEVerbOpen)'打开第一个嵌入的Visio文件
    “现在我想保存它
    设置MyObj=GetObject(,“Visio.Application”)
    如果MyObj什么都不是
    'Visio未运行,请创建新实例
    设置MyObj=CreateObject(“Visio.Application”)
    如果结束
    “Debug.Print MyObj.Application.Documents.Item(1)”打印文件名
    嵌入文件的IconLabel的每一端都有“older Word files had”,从而生成一个路径名,其中包含“”
    '这在执行SaveAs行时导致错误4148。如图所示:)
    温度=修剪(替换(替换(myshape.OLEFormat.IconLabel,Chr(34),“”),Chr(34),“”)
    outFileName=StrOutFold&“\”+temp”
    MyObj.Application.Documents.Item(1).另存为文件名:=outFileName
    MyObj.Application.Documents.Item(1).Close'savechanges:=False
    visioCount=visioCount+1
    MyObj,退出
    设置MyObj=Nothing
    如果结束
    如果(InStr(myshape.OLEFormat.ClassType,“Excel”)>0,则
    myshape.OLEFormat.DoVerb(wdOLEVerbOpen)'打开第一个嵌入式Excel
    “现在我想保存它
    Set xlApp=GetObject(,“Excel.Application”)
    如果xlApp什么都不是,那么
    'Excel未运行,请创建新实例
    设置xlApp=CreateObject(“Excel.Application”)
    如果结束
    设置xlWkb=xlApp.workbook(1)
    嵌入文件的IconLabel的每一端都有“older Word files had”,从而生成一个路径名,其中包含“”
    '这在执行SaveAs行时导致了错误4148。如图所示:)
    温度=修剪(替换(替换(myshape.OLEFormat.IconLabel,Chr(34),“”)
    
    Sub ExtractEmbeddedDocs()
    Dim MyObj As Object
    Dim xlApp As Object
    Dim xlWkb As Object
    
    Dim myshape As Word.InlineShape
    Dim myFormat As Word.WdSaveFormat
    Dim WDDoc As Word.Document
    '    Dim embedObj As OLEObject
    
    Dim FileExtStr As String
    Dim FileFormatNum As Long
    Dim StrInFold As String, StrOutFold As String
    Dim StrDocFile As String, Obj_App As Object, i As Long
    Dim StrFile As String, StrFileList As String, StrMediaFile As String, j As Long
    Dim outFileName As String
    Dim SBar As Boolean
    Dim exten As String
    Dim embedCount As Integer, wordCount As Integer, excelCount As Integer, visioCount As Integer, pptCount As Integer
    Dim pdfCount As Integer
    Dim msg As String, temp As String
    Dim ok As Boolean
    
    Dim docs As Variant, doc As Variant    ', temp As Variant
    Dim AcroApp As Acrobat.CAcroApp
    Dim AcroPDDoc As Acrobat.CAcroPDDoc
    Dim AcroAVDoc As Acrobat.CAcroAVDoc
    Dim jso As Object
    
    
    StrInFold = ActiveDocument.Path
    If StrInFold = "" Then Exit Sub
    ' Store current Status Bar status, then switch on
    SBar = Application.DisplayStatusBar
    Application.DisplayStatusBar = True
    StrOutFold = StrInFold & "\Embedded Files"
    
    Application.ScreenUpdating = False
    
    '    On Error GoTo error_handler
    
    'Test for existing output folder, create if they don't already exist
    If Dir(StrOutFold, vbDirectory) = "" Then MkDir StrOutFold
    
    embedCount = ActiveDocument.InlineShapes.Count
    
    ' This opens the embedded documents, each in their own instance of the program
    For Each myshape In ActiveDocument.InlineShapes
    If (myshape.Type = wdInlineShapeEmbeddedOLEObject) Then
    If (InStr(myshape.OLEFormat.ClassType, "Word") > 0) Then
    myshape.OLEFormat.DoVerb (wdOLEVerbOpen) ' Open the first embedded word doc
    ' Now I want to save it
    
    Set MyObj = GetObject(, "Word.Application")
    If MyObj Is Nothing Then
    ' Word is not running, create new instance
    Set MyObj = CreateObject("Word.Application")
    End If
    
    Set WDDoc = MyObj.ActiveDocument
    
    
    ' older Word files had " on each end of the IconLabel of the embedded file, resulting in a path name with "" in it
    ' this was causing an error 4148 when the SaveAs line executed. Go figure :)
    temp = Trim(Replace(Replace(myshape.OLEFormat.IconLabel, Chr(34), ""), Chr(34), ""))
    If Right(temp, 3) = "doc" Then
    myFormat = wdFormatDocument
    Else
    Select Case Right(temp, 4)
    Case "docx"
    myFormat = wdFormatXMLDocument
    Case "docm"
    myFormat = wdFormatXMLDocumentMacroEnabled
    Case Else
    myFormat = wdFormatDocument
    End Select
    End If
    
    outFileName = StrOutFold & "\" + temp '
    WDDoc.SaveAs2 FileName:=outFileName, FileFormat:=myFormat   'CompatibilityMode:=12    '
    WDDoc.Close savechanges:=False 'MyObj.Application.Documents.Item(1)
    
    wordCount = wordCount + 1
    
    Set WDDoc = Nothing
    '                MyObj.Quit ' will quit the existing instance of Word, don't do!
    Set MyObj = Nothing
    End If
    
    If (InStr(myshape.OLEFormat.ClassType, "Visio") > 0) Then
    myshape.OLEFormat.DoVerb (wdOLEVerbOpen) ' Open the first embedded Visio file
    ' Now I want to save it
    Set MyObj = GetObject(, "Visio.Application")
    If MyObj Is Nothing Then
    ' Visio is not running, create new instance
    Set MyObj = CreateObject("Visio.Application")
    End If
    
    'Debug.Print MyObj.Application.Documents.Item(1) ' Prints the filename
    ' older Word files had " on each end of the IconLabel of the embedded file, resulting in a path name with "" in it
    ' this was causing an error 4148 when the SaveAs line executed. Go figure :)
    temp = Trim(Replace(Replace(myshape.OLEFormat.IconLabel, Chr(34), ""), Chr(34), ""))
    
    outFileName = StrOutFold & "\" + temp '
    MyObj.Application.Documents.Item(1).SaveAs FileName:=outFileName
    MyObj.Application.Documents.Item(1).Close 'savechanges:=False
    
    visioCount = visioCount + 1
    MyObj.Quit
    Set MyObj = Nothing
    End If
    
    If (InStr(myshape.OLEFormat.ClassType, "Excel") > 0) Then
    myshape.OLEFormat.DoVerb (wdOLEVerbOpen) ' Open the first embedded Excel
    ' Now I want to save it
    Set xlApp = GetObject(, "Excel.Application")
    If xlApp Is Nothing Then
    ' Excel is not running, create new instance
    Set xlApp = CreateObject("Excel.Application")
    End If
    Set xlWkb = xlApp.Workbooks(1)
    
    ' older Word files had " on each end of the IconLabel of the embedded file, resulting in a path name with "" in it
    ' this was causing an error 4148 when the SaveAs line executed. Go figure :)
    temp = Trim(Replace(Replace(myshape.OLEFormat.IconLabel, Chr(34), ""), Chr(34), ""))
    
    outFileName = StrOutFold & "\" + temp '
    
    ' find file format from extension
    With xlApp.ActiveWorkbook
    If Val(xlApp.Application.Version) < 12 Then
    'You use Excel 97-2003
    FileExtStr = ".xls": FileFormatNum = -4143
    Else
    'You use Excel 2007-2013
    Select Case .FileFormat
    Case 51: FileExtStr = ".xlsx": FileFormatNum = 51
    Case 52:
    If .HasVBProject Then
    FileExtStr = ".xlsm": FileFormatNum = 52
    Else
    FileExtStr = ".xlsx": FileFormatNum = 51
    End If
    Case 56: FileExtStr = ".xls": FileFormatNum = 56
    Case Else: FileExtStr = ".xlsb": FileFormatNum = 50
    End Select
    End If
    End With
    
    xlApp.Application.DisplayAlerts = False
    xlApp.ActiveWorkbook.SaveAs FileName:=outFileName, FileFormat:=FileFormatNum 'xlWorkbookNormal -4143 xlWorkbookDefault 51
    xlApp.ActiveWorkbook.Close savechanges:=False
    Set xlWkb = Nothing
    xlApp.Quit
    Set xlApp = Nothing
    
    excelCount = excelCount + 1
    End If
    
    If (InStr(myshape.OLEFormat.ClassType, "PowerPoint") > 0) Then
    myshape.OLEFormat.DoVerb (wdOLEVerbOpen) ' Open the first powerpoint file
    ' Now I want to save it
    Set MyObj = GetObject(, "PowerPoint.Application")
    If MyObj Is Nothing Then
    ' Powerpoint is not running, create new instance
    Set MyObj = CreateObject("PowerPoint.Application")
    End If
    
    ' older Word files had " on each end of the IconLabel of the embedded file, resulting in a path name with "" in it
    ' this was causing an error 4148 when the SaveAs line executed. Go figure :)
    temp = Trim(Replace(Replace(myshape.OLEFormat.IconLabel, Chr(34), ""), Chr(34), ""))
    
    outFileName = StrOutFold & "\" + temp '
    myshape.OLEFormat.Object.SaveAs FileName:=outFileName
    myshape.OLEFormat.Object.Close 'savechanges:=False
    
    pptCount = pptCount + 1
    MyObj.Quit
    Set MyObj = Nothing
    End If
    
    If (InStr(myshape.OLEFormat.ClassType, "Acro") > 0) Then
    myshape.OLEFormat.DoVerb (wdOLEVerbOpen) ' Open the first embedded pdf
    
    myshape.OLEFormat.Activate  ' probably not needed
    
    Set AcroAVDoc = CreateObject("AcroExch.AVDoc")
    
    Set AcroApp = CreateObject("AcroExch.App")
    '                If AcroApp Is Nothing Then
    '                ' Acrobat is not running, create new instance
    '                    Set AcroApp = CreateObject("AcroExch.App")
    '                End If
    
    Set AcroAVDoc = AcroApp.GetActiveDoc    ' get the logical doc
    Set AcroPDDoc = AcroAVDoc.GetPDDoc      ' get the physical doc
    
    'some code I found (KHK) for working with the javascript bridge, not needed here
    '                Set jso = AcroPDDoc.GetJSObject     ' get the javascript bridge
    
    '                docs = jso.app.activeDocs       ' get array of active docs
    '
    '                For Each doc In docs
    '                    If doc.documentFileName = AcroPDDoc.GetFileName Then
    '                        ' insert template document
    '                    End If
    '
    '                Next
    
    ' older Word files had " on each end of the IconLabel of the embedded file, resulting in a path name with "" in it
    ' this was causing an error 4148 when the SaveAs line executed. Go figure :)
    temp = Trim(Replace(Replace(myshape.OLEFormat.IconLabel, Chr(34), ""), Chr(34), ""))
    
    outFileName = StrOutFold & "\" + temp '
    
    If AcroPDDoc.Save(PDSaveFull, outFileName) = False Then
    MsgBox "Cannot save document"
    End If
    AcroAVDoc.Close (1)
    AcroPDDoc.Close
    
    pdfCount = pdfCount + 1
    
    AcroApp.Exit
    Set AcroApp = Nothing
    Set AcroAVDoc = Nothing
    Set AcroPDDoc = Nothing
    End If
    
    End If
    Next myshape
    
    ' Clear the Status Bar
    Application.StatusBar = False
    ' Restore original Status Bar status
    Application.DisplayStatusBar = SBar
    Application.ScreenUpdating = True
    
    temp = "Embedded file counts" & vbCrLf & "Total " & vbTab & vbTab & embedCount & vbCrLf & "Word Files " & vbTab & wordCount & vbCrLf & _
    "Excel Files " & vbTab & vbTab & excelCount & vbCrLf & "Visio Files " & vbTab & vbTab & visioCount & vbCrLf & "PowerPoint Files " & vbTab & pptCount & vbCrLf
    temp = temp & "PDF Files " & vbTab & vbTab & pdfCount & vbCrLf & "Unknown files" & vbTab & embedCount - (wordCount + excelCount + visioCount)
    msg = temp
    
    msg = msg & vbCrLf & vbCrLf & "You should have " & vbTab & (wordCount + excelCount + visioCount + pptCount + pdfCount) & " files"
    
    MsgBox msg, vbInformation + vbOKOnly
    
    Exit Sub
    
    error_handler:
    
    If Err.Number Then
    MsgBox Err.Number & "  " & Err.Description, vbCritical + vbOKOnly
    End If
    
    If Err.Number = 1004 Then
    If Err.Description = "No cells were found." Then
    '            GoTo get_filename
    ElseIf Err.Description = "You cannot save this workbook with the " & _
    "same name as another open workbook or " & _
    "add-in. Choose a different name, or " & _
    "close the other workbook or add-in " & _
    "before saving." Then
    MsgBox "There is another file with the same name " & _
    "already open.  Please chose a different name " & _
    "for this file."
    '            GoTo get_filename
    End If
    End If
    
    Exit Sub
    
    pdfError:
    MsgBox "PDF files require Adobe Acrobat (not Reader) to work"
    Resume
    End Sub
    
      [1]: https://www.howtogeek.com/50628/easily-extract-images-text-and-embedded-files-from-an-office-2007-document/