如何使用python从docx文件提取超链接中的url_Python_Python Docx

如何使用python从docx文件提取超链接中的url

python

如何使用python从docx文件提取超链接中的url,python,python-docx,Python,Python Docx,我一直试图找到如何使用python从docx文件中获取URL，但没有找到任何东西，我尝试了python docx和python-docx2txt，但python docx似乎只提取文本，而python-docx2txt能够从超链接中提取文本，而不是URL本身 def iter_hyperlink_rels(rels): for rel in rels: if rels[rel].reltype == RT.HYPERLINK: yield rels[rel]

我一直试图找到如何使用python从docx文件中获取URL，但没有找到任何东西，我尝试了python docx和python-docx2txt，但python docx似乎只提取文本，而python-docx2txt能够从超链接中提取文本，而不是URL本身

def iter_hyperlink_rels(rels):
   for rel in rels:
      if rels[rel].reltype == RT.HYPERLINK:
         yield rels[rel]

这将删除错误。

我来晚了，但是如果您想要从.docx文件中提取所有链接并制作电子表格（或返回它们的列表），我有一个脚本可以帮您完成。它包括URL和链接文本，如果需要，您可以将整个文件夹提供给它

它使用BeautifulSoup和Unicodesv，您也可以从同一回购协议中获取这两种股票。在Python3中运行。文件顶部的说明。处理非ascii字符。目前只在Mac和Ubuntu上测试过。Excel不能可靠地导入Unicode CSV，尽管Google Drive可以。在禁止的地方提供void（）。

我使用以下代码从docx打印超链接内容来解决它

from docx import Document
from docx.opc.constants import RELATIONSHIP_TYPE as RT

document = Document('test.docx')
rels = document.part.rels

def iter_hyperlink_rels(rels):
    for rel in rels:
        if rels[rel].reltype == RT.HYPERLINK:
            yield rels[rel]._target      

print(iter_hyperlink_rels(rels)

我是Python初学者，有一项任务是使用Python更改.docx文档中的每个超链接。感谢Kiran的代码，它给了我一些提示，让我做一些猜测，尝试和错误，最后让它工作。这是我的代码，我想和其他初学者分享

# python to change docx URL hyperlinks:
### see: https://stackoverflow.com/questions/40475757/how-to-extract-the-url-in-hyperlinks-from-a-docx-file-using-python

from docx import Document
from docx.opc.constants import RELATIONSHIP_TYPE as RT

print(" This program changes the hyperlinks detected in a word .docx file \n")

docx_file=input(" Pls input docx filename (without .docx): ")

document = Document(docx_file + ".docx")

rels = document.part.rels

for rel in rels:
   if rels[rel].reltype == RT.HYPERLINK:
      print("\n Origianl link id -", rel, "with detected URL: ", rels[rel]._target)
      new_url=input(" Pls input new URL: ")
      rels[rel]._target=new_url

out_file=docx_file + "-out.docx"

document.save(out_file)

print("\n File saved to: ", out_file)

谢谢,，

Lapyiu Ho

您能否更具体地说明您的总体意图是什么？PythonDocx具有超链接功能，因此您要查找的信息就在那里。您只是想提取文档中的所有超链接，还是将其与其余文本一起提取？@scanny我只想要URL，不关心文本。哎呀，抱歉，python docx还不支持超链接，很遗憾，这是一个被暂停的请求。如果您想这样做，您需要转到lxml/内部构件级别。我将在答案中加入一些想法。请编辑您的答案，使其具有正确格式的代码（在要格式化的部分使用Ctrl+K）。此外，在代码段中添加了一个解释，说明它为什么有效，以便将来的读者知道答案的作用。链接已断开。不建议答案过于依赖外部网站。@Jean FrancoisT。谢谢你的留言-我修复了链接。这是一个250行的文件，所以我不确定在这里包含代码的最佳方式是什么。实际上我也不确定。您可能可以在StackOverflow的META中找到答案：