在Python中从.docx文件中的表中提取Acrobat文档对象
我有一个.docx文件,里面有一个表。有些单元格包含嵌入的Acrobat文档对象 我正在使用在Python中从.docx文件中的表中提取Acrobat文档对象,python,python-docx,Python,Python Docx,我有一个.docx文件,里面有一个表。有些单元格包含嵌入的Acrobat文档对象 我正在使用python docx模块从.docx文件中读取和提取数据,但它没有获取那些嵌入的文档(当要求输入单元格值时,它返回一个空字符串) 由于我认为自己不够好,无法尝试修改模块源代码,所以我曾想过从.docx文件本身提取嵌入文档(通过[将扩展名更改为.zip][1]),但pdf文件似乎是.bin格式 所以我在考虑四种可能的解决方案,你们可以帮我: 获取python docx模块以识别嵌入的Acrobat文档对象
python docx
模块从.docx文件中读取和提取数据,但它没有获取那些嵌入的文档(当要求输入单元格值时,它返回一个空字符串)
由于我认为自己不够好,无法尝试修改模块源代码,所以我曾想过从.docx文件本身提取嵌入文档(通过[将扩展名更改为.zip][1]),但pdf文件似乎是.bin格式
所以我在考虑四种可能的解决方案,你们可以帮我:
python docx
模块以识别嵌入的Acrobat文档对象Sub-extractedEmbeddedDocs()
作为对象的Dim MyObj
将xlApp作为对象
Dim xlWkb作为对象
将我的形状变暗为Word.InlineShape
Dim myFormat作为Word.WdSaveFormat
将WDDoc设置为Word.Document
'将OBJ作为对象
Dim FileExtStr作为字符串
Dim FileFormatNum尽可能长
暗条纹为字符串,短条纹为字符串
Dim StrDocFile作为字符串,Obj_App作为对象,i作为长
Dim StrFile为字符串,StrFileList为字符串,StrMediaFile为字符串,j为长
将输出文件名设置为字符串
作为布尔值的Dim-SBar
作为字符串的Dim exten
Dim embedCount为整数、wordCount为整数、excelCount为整数、visioCount为整数、PPT为整数
Dim pdfCount为整数
Dim msg作为字符串,temp作为字符串
将ok设置为布尔值
变号文件、变号文件、变号临时文件
Dim AcroApp作为Acrobat.CAcroApp
Dim AcroPDDoc作为Acrobat.CAcroPDDoc
将AcroAVDoc作为Acrobat.CAcroAVDoc进行调整
Dim jso作为对象
StrInFold=ActiveDocument.Path
如果StrInFold=“”,则退出Sub
'存储当前状态栏状态,然后打开
SBar=Application.DisplayStatusBar
Application.DisplayStatusBar=True
StrOutFold=StrInFold&“\Embedded Files”
Application.ScreenUpdating=False
'在发生错误时转到错误\u处理程序
'测试现有输出文件夹,如果它们不存在则创建
如果Dir(StrOutFold,vbDirectory)=“”,则MkDir StrOutFold
EmbeddeCount=ActiveDocument.InlineShapes.Count
'这将打开嵌入的文档,每个文档都位于各自的程序实例中
对于ActiveDocument.InlineShapes中的每个myshape
如果(myshape.Type=wdInlineShapeEmbeddedDoleObject),则
如果(InStr(myshape.OLEFormat.ClassType,“Word”)>0,则
myshape.OLEFormat.DoVerb(wdOLEVerbOpen)'打开第一个嵌入的word文档
“现在我想保存它
Set MyObj=GetObject(,“Word.Application”)
如果MyObj什么都不是
'Word未运行,请创建新实例
设置MyObj=CreateObject(“Word.Application”)
如果结束
设置WDDoc=MyObj.ActiveDocument
嵌入文件的IconLabel的每一端都有“older Word files had”,从而生成一个路径名,其中包含“”
'这在执行SaveAs行时导致了错误4148。如图所示:)
温度=修剪(替换(替换(myshape.OLEFormat.IconLabel,Chr(34),“”),Chr(34),“”)
如果正确(温度,3)=“doc”,则
myFormat=wdFormatDocument
其他的
选择右侧案例(临时,4)
案件“docx”
myFormat=wdFormatXMLDocument
案件“docm”
myFormat=wdFormatXMLDocumentMacroEnabled
其他情况
myFormat=wdFormatDocument
结束选择
如果结束
outFileName=StrOutFold&“\”+temp”
WDDoc.SaveAs2文件名:=outFileName,文件格式:=myFormat'CompatibilityMode:=12'
WDDoc.Close savechanges:=False“MyObj.Application.Documents.Item(1)
字数=字数+1
设置WDDoc=Nothing
“MyObj.Quit”将退出现有的Word实例,请勿!
设置MyObj=Nothing
如果结束
如果(InStr(myshape.OLEFormat.ClassType,“Visio”)>0,则
myshape.OLEFormat.DoVerb(wdOLEVerbOpen)'打开第一个嵌入的Visio文件
“现在我想保存它
设置MyObj=GetObject(,“Visio.Application”)
如果MyObj什么都不是
'Visio未运行,请创建新实例
设置MyObj=CreateObject(“Visio.Application”)
如果结束
“Debug.Print MyObj.Application.Documents.Item(1)”打印文件名
嵌入文件的IconLabel的每一端都有“older Word files had”,从而生成一个路径名,其中包含“”
'这在执行SaveAs行时导致错误4148。如图所示:)
温度=修剪(替换(替换(myshape.OLEFormat.IconLabel,Chr(34),“”),Chr(34),“”)
outFileName=StrOutFold&“\”+temp”
MyObj.Application.Documents.Item(1).另存为文件名:=outFileName
MyObj.Application.Documents.Item(1).Close'savechanges:=False
visioCount=visioCount+1
MyObj,退出
设置MyObj=Nothing
如果结束
如果(InStr(myshape.OLEFormat.ClassType,“Excel”)>0,则
myshape.OLEFormat.DoVerb(wdOLEVerbOpen)'打开第一个嵌入式Excel
“现在我想保存它
Set xlApp=GetObject(,“Excel.Application”)
如果xlApp什么都不是,那么
'Excel未运行,请创建新实例
设置xlApp=CreateObject(“Excel.Application”)
如果结束
设置xlWkb=xlApp.workbook(1)
嵌入文件的IconLabel的每一端都有“older Word files had”,从而生成一个路径名,其中包含“”
'这在执行SaveAs行时导致了错误4148。如图所示:)
温度=修剪(替换(替换(myshape.OLEFormat.IconLabel,Chr(34),“”)
Sub ExtractEmbeddedDocs()
Dim MyObj As Object
Dim xlApp As Object
Dim xlWkb As Object
Dim myshape As Word.InlineShape
Dim myFormat As Word.WdSaveFormat
Dim WDDoc As Word.Document
' Dim embedObj As OLEObject
Dim FileExtStr As String
Dim FileFormatNum As Long
Dim StrInFold As String, StrOutFold As String
Dim StrDocFile As String, Obj_App As Object, i As Long
Dim StrFile As String, StrFileList As String, StrMediaFile As String, j As Long
Dim outFileName As String
Dim SBar As Boolean
Dim exten As String
Dim embedCount As Integer, wordCount As Integer, excelCount As Integer, visioCount As Integer, pptCount As Integer
Dim pdfCount As Integer
Dim msg As String, temp As String
Dim ok As Boolean
Dim docs As Variant, doc As Variant ', temp As Variant
Dim AcroApp As Acrobat.CAcroApp
Dim AcroPDDoc As Acrobat.CAcroPDDoc
Dim AcroAVDoc As Acrobat.CAcroAVDoc
Dim jso As Object
StrInFold = ActiveDocument.Path
If StrInFold = "" Then Exit Sub
' Store current Status Bar status, then switch on
SBar = Application.DisplayStatusBar
Application.DisplayStatusBar = True
StrOutFold = StrInFold & "\Embedded Files"
Application.ScreenUpdating = False
' On Error GoTo error_handler
'Test for existing output folder, create if they don't already exist
If Dir(StrOutFold, vbDirectory) = "" Then MkDir StrOutFold
embedCount = ActiveDocument.InlineShapes.Count
' This opens the embedded documents, each in their own instance of the program
For Each myshape In ActiveDocument.InlineShapes
If (myshape.Type = wdInlineShapeEmbeddedOLEObject) Then
If (InStr(myshape.OLEFormat.ClassType, "Word") > 0) Then
myshape.OLEFormat.DoVerb (wdOLEVerbOpen) ' Open the first embedded word doc
' Now I want to save it
Set MyObj = GetObject(, "Word.Application")
If MyObj Is Nothing Then
' Word is not running, create new instance
Set MyObj = CreateObject("Word.Application")
End If
Set WDDoc = MyObj.ActiveDocument
' older Word files had " on each end of the IconLabel of the embedded file, resulting in a path name with "" in it
' this was causing an error 4148 when the SaveAs line executed. Go figure :)
temp = Trim(Replace(Replace(myshape.OLEFormat.IconLabel, Chr(34), ""), Chr(34), ""))
If Right(temp, 3) = "doc" Then
myFormat = wdFormatDocument
Else
Select Case Right(temp, 4)
Case "docx"
myFormat = wdFormatXMLDocument
Case "docm"
myFormat = wdFormatXMLDocumentMacroEnabled
Case Else
myFormat = wdFormatDocument
End Select
End If
outFileName = StrOutFold & "\" + temp '
WDDoc.SaveAs2 FileName:=outFileName, FileFormat:=myFormat 'CompatibilityMode:=12 '
WDDoc.Close savechanges:=False 'MyObj.Application.Documents.Item(1)
wordCount = wordCount + 1
Set WDDoc = Nothing
' MyObj.Quit ' will quit the existing instance of Word, don't do!
Set MyObj = Nothing
End If
If (InStr(myshape.OLEFormat.ClassType, "Visio") > 0) Then
myshape.OLEFormat.DoVerb (wdOLEVerbOpen) ' Open the first embedded Visio file
' Now I want to save it
Set MyObj = GetObject(, "Visio.Application")
If MyObj Is Nothing Then
' Visio is not running, create new instance
Set MyObj = CreateObject("Visio.Application")
End If
'Debug.Print MyObj.Application.Documents.Item(1) ' Prints the filename
' older Word files had " on each end of the IconLabel of the embedded file, resulting in a path name with "" in it
' this was causing an error 4148 when the SaveAs line executed. Go figure :)
temp = Trim(Replace(Replace(myshape.OLEFormat.IconLabel, Chr(34), ""), Chr(34), ""))
outFileName = StrOutFold & "\" + temp '
MyObj.Application.Documents.Item(1).SaveAs FileName:=outFileName
MyObj.Application.Documents.Item(1).Close 'savechanges:=False
visioCount = visioCount + 1
MyObj.Quit
Set MyObj = Nothing
End If
If (InStr(myshape.OLEFormat.ClassType, "Excel") > 0) Then
myshape.OLEFormat.DoVerb (wdOLEVerbOpen) ' Open the first embedded Excel
' Now I want to save it
Set xlApp = GetObject(, "Excel.Application")
If xlApp Is Nothing Then
' Excel is not running, create new instance
Set xlApp = CreateObject("Excel.Application")
End If
Set xlWkb = xlApp.Workbooks(1)
' older Word files had " on each end of the IconLabel of the embedded file, resulting in a path name with "" in it
' this was causing an error 4148 when the SaveAs line executed. Go figure :)
temp = Trim(Replace(Replace(myshape.OLEFormat.IconLabel, Chr(34), ""), Chr(34), ""))
outFileName = StrOutFold & "\" + temp '
' find file format from extension
With xlApp.ActiveWorkbook
If Val(xlApp.Application.Version) < 12 Then
'You use Excel 97-2003
FileExtStr = ".xls": FileFormatNum = -4143
Else
'You use Excel 2007-2013
Select Case .FileFormat
Case 51: FileExtStr = ".xlsx": FileFormatNum = 51
Case 52:
If .HasVBProject Then
FileExtStr = ".xlsm": FileFormatNum = 52
Else
FileExtStr = ".xlsx": FileFormatNum = 51
End If
Case 56: FileExtStr = ".xls": FileFormatNum = 56
Case Else: FileExtStr = ".xlsb": FileFormatNum = 50
End Select
End If
End With
xlApp.Application.DisplayAlerts = False
xlApp.ActiveWorkbook.SaveAs FileName:=outFileName, FileFormat:=FileFormatNum 'xlWorkbookNormal -4143 xlWorkbookDefault 51
xlApp.ActiveWorkbook.Close savechanges:=False
Set xlWkb = Nothing
xlApp.Quit
Set xlApp = Nothing
excelCount = excelCount + 1
End If
If (InStr(myshape.OLEFormat.ClassType, "PowerPoint") > 0) Then
myshape.OLEFormat.DoVerb (wdOLEVerbOpen) ' Open the first powerpoint file
' Now I want to save it
Set MyObj = GetObject(, "PowerPoint.Application")
If MyObj Is Nothing Then
' Powerpoint is not running, create new instance
Set MyObj = CreateObject("PowerPoint.Application")
End If
' older Word files had " on each end of the IconLabel of the embedded file, resulting in a path name with "" in it
' this was causing an error 4148 when the SaveAs line executed. Go figure :)
temp = Trim(Replace(Replace(myshape.OLEFormat.IconLabel, Chr(34), ""), Chr(34), ""))
outFileName = StrOutFold & "\" + temp '
myshape.OLEFormat.Object.SaveAs FileName:=outFileName
myshape.OLEFormat.Object.Close 'savechanges:=False
pptCount = pptCount + 1
MyObj.Quit
Set MyObj = Nothing
End If
If (InStr(myshape.OLEFormat.ClassType, "Acro") > 0) Then
myshape.OLEFormat.DoVerb (wdOLEVerbOpen) ' Open the first embedded pdf
myshape.OLEFormat.Activate ' probably not needed
Set AcroAVDoc = CreateObject("AcroExch.AVDoc")
Set AcroApp = CreateObject("AcroExch.App")
' If AcroApp Is Nothing Then
' ' Acrobat is not running, create new instance
' Set AcroApp = CreateObject("AcroExch.App")
' End If
Set AcroAVDoc = AcroApp.GetActiveDoc ' get the logical doc
Set AcroPDDoc = AcroAVDoc.GetPDDoc ' get the physical doc
'some code I found (KHK) for working with the javascript bridge, not needed here
' Set jso = AcroPDDoc.GetJSObject ' get the javascript bridge
' docs = jso.app.activeDocs ' get array of active docs
'
' For Each doc In docs
' If doc.documentFileName = AcroPDDoc.GetFileName Then
' ' insert template document
' End If
'
' Next
' older Word files had " on each end of the IconLabel of the embedded file, resulting in a path name with "" in it
' this was causing an error 4148 when the SaveAs line executed. Go figure :)
temp = Trim(Replace(Replace(myshape.OLEFormat.IconLabel, Chr(34), ""), Chr(34), ""))
outFileName = StrOutFold & "\" + temp '
If AcroPDDoc.Save(PDSaveFull, outFileName) = False Then
MsgBox "Cannot save document"
End If
AcroAVDoc.Close (1)
AcroPDDoc.Close
pdfCount = pdfCount + 1
AcroApp.Exit
Set AcroApp = Nothing
Set AcroAVDoc = Nothing
Set AcroPDDoc = Nothing
End If
End If
Next myshape
' Clear the Status Bar
Application.StatusBar = False
' Restore original Status Bar status
Application.DisplayStatusBar = SBar
Application.ScreenUpdating = True
temp = "Embedded file counts" & vbCrLf & "Total " & vbTab & vbTab & embedCount & vbCrLf & "Word Files " & vbTab & wordCount & vbCrLf & _
"Excel Files " & vbTab & vbTab & excelCount & vbCrLf & "Visio Files " & vbTab & vbTab & visioCount & vbCrLf & "PowerPoint Files " & vbTab & pptCount & vbCrLf
temp = temp & "PDF Files " & vbTab & vbTab & pdfCount & vbCrLf & "Unknown files" & vbTab & embedCount - (wordCount + excelCount + visioCount)
msg = temp
msg = msg & vbCrLf & vbCrLf & "You should have " & vbTab & (wordCount + excelCount + visioCount + pptCount + pdfCount) & " files"
MsgBox msg, vbInformation + vbOKOnly
Exit Sub
error_handler:
If Err.Number Then
MsgBox Err.Number & " " & Err.Description, vbCritical + vbOKOnly
End If
If Err.Number = 1004 Then
If Err.Description = "No cells were found." Then
' GoTo get_filename
ElseIf Err.Description = "You cannot save this workbook with the " & _
"same name as another open workbook or " & _
"add-in. Choose a different name, or " & _
"close the other workbook or add-in " & _
"before saving." Then
MsgBox "There is another file with the same name " & _
"already open. Please chose a different name " & _
"for this file."
' GoTo get_filename
End If
End If
Exit Sub
pdfError:
MsgBox "PDF files require Adobe Acrobat (not Reader) to work"
Resume
End Sub
[1]: https://www.howtogeek.com/50628/easily-extract-images-text-and-embedded-files-from-an-office-2007-document/