python-如何提取DOCX超链接的文本?
基于:python-如何提取DOCX超链接的文本?,python,docx,Python,Docx,基于: 我需要同时获取url和超链接的文本(例如,url为mydomain.com,文本为Go to My Domain)回答我自己的问题,我必须通过html来完成以下操作: from bs4 import BeautifulSoup with open('my_word_file.htm', 'r') as file: page = file.read() soup = BeautifulSoup(page, 'lxml') text_and_url = [] for link in
我需要同时获取url和超链接的文本(例如,url为
mydomain.com
,文本为Go to My Domain
)回答我自己的问题,我必须通过html
来完成以下操作:
from bs4 import BeautifulSoup
with open('my_word_file.htm', 'r') as file:
page = file.read()
soup = BeautifulSoup(page, 'lxml')
text_and_url = []
for link in soup.findAll('a'):
text_and_url.append({'text':link.string, 'url':link.get('href')})
Foor转换
docx
文件html
:
from bs4 import BeautifulSoup
with open('my_word_file.htm', 'r') as file:
page = file.read()
soup = BeautifulSoup(page, 'lxml')
text_and_url = []
for link in soup.findAll('a'):
text_and_url.append({'text':link.string, 'url':link.get('href')})