python-如何提取DOCX超链接的文本?

python-如何提取DOCX超链接的文本?,python,docx,Python,Docx,基于: 我需要同时获取url和超链接的文本(例如,url为mydomain.com,文本为Go to My Domain)回答我自己的问题,我必须通过html来完成以下操作: from bs4 import BeautifulSoup with open('my_word_file.htm', 'r') as file: page = file.read() soup = BeautifulSoup(page, 'lxml') text_and_url = [] for link in

基于:


我需要同时获取url和超链接的文本(例如,url为
mydomain.com
,文本为
Go to My Domain

回答我自己的问题,我必须通过
html
来完成以下操作:

from bs4 import BeautifulSoup
with open('my_word_file.htm', 'r') as file:
    page = file.read()
soup = BeautifulSoup(page, 'lxml')

text_and_url = []
for link in soup.findAll('a'):
    text_and_url.append({'text':link.string, 'url':link.get('href')})

Foor转换
docx
文件
html

from bs4 import BeautifulSoup
with open('my_word_file.htm', 'r') as file:
    page = file.read()
soup = BeautifulSoup(page, 'lxml')

text_and_url = []
for link in soup.findAll('a'):
    text_and_url.append({'text':link.string, 'url':link.get('href')})