使用Python和BeautifulSoup,仅选择未包装在<;中的文本节点;a>;
我正在尝试解析一些文本,以便我可以对未格式化的链接进行URL化(用标签包装)。以下是一些示例文本:使用Python和BeautifulSoup,仅选择未包装在<;中的文本节点;a>;,python,beautifulsoup,Python,Beautifulsoup,我正在尝试解析一些文本,以便我可以对未格式化的链接进行URL化(用标签包装)。以下是一些示例文本: text = '<p>This is a <a href="https://google.com">link</a>, this is also a link where the text is the same as the link: <a href="https://google.com">https://google.com</a>
text = '<p>This is a <a href="https://google.com">link</a>, this is also a link where the text is the same as the link: <a href="https://google.com">https://google.com</a>, and this is a link too but not formatted: https://google.com</p>'
但这也将捕获示例中的中间链接,导致它被双重包装在中,这也是一个测试与链接相同的链接,这也是一个链接,但没有格式化:a href=”https://djangosnippets.org/snippets/2072/"https://djangosnippets.org/snippets/2072//a
如何处理
textNodes=soup.findAll(text=True)
,使其仅包含尚未包装在
标记中的文本节点?父级
引用,因此您可以只测试a
标记:
for textNode in textNodes:
if textNode.parent and getattr(textNode.parent, 'name') == 'a':
continue # skip links
urlizedText = urlize(textNode)
textNode.replaceWith(urlizedText)
美丽的乌苏是美丽的!
<p>This is a <a href="https://djangosnippets.org/snippets/2072/" target="_blank">link</a>, this is also a link where the test is the same as the link: <a href="https://djangosnippets.org/snippets/2072/" target="_blank"><a href="https://djangosnippets.org/snippets/2072/">https://djangosnippets.org/snippets/2072/</a></a>, and this is a link too but not formatted: <a href="https://djangosnippets.org/snippets/2072/">https://djangosnippets.org/snippets/2072/</a></p>
for textNode in textNodes:
if textNode.parent and getattr(textNode.parent, 'name') == 'a':
continue # skip links
urlizedText = urlize(textNode)
textNode.replaceWith(urlizedText)