使用Python和BeautifulSoup，仅选择未包装在<；中的文本节点；a>；_Python_Beautifulsoup

使用Python和BeautifulSoup，仅选择未包装在<；中的文本节点；a>；

python

使用Python和BeautifulSoup，仅选择未包装在<；中的文本节点；a>；,python,beautifulsoup,Python,Beautifulsoup,我正在尝试解析一些文本，以便我可以对未格式化的链接进行URL化（用标签包装）。以下是一些示例文本： text = '<p>This is a <a href="https://google.com">link</a>, this is also a link where the text is the same as the link: <a href="https://google.com">https://google.com</a>

我正在尝试解析一些文本，以便我可以对未格式化的链接进行URL化（用标签包装）。以下是一些示例文本：

text = '<p>This is a <a href="https://google.com">link</a>, this is also a link where the text is the same as the link: <a href="https://google.com">https://google.com</a>, and this is a link too but not formatted: https://google.com</p>'

但这也将捕获示例中的中间链接，导致它被双重包装在

中，这也是一个测试与链接相同的链接，这也是一个链接，但没有格式化：a href=”https://djangosnippets.org/snippets/2072/"https://djangosnippets.org/snippets/2072//a

如何处理

textNodes=soup.findAll（text=True）

，使其仅包含尚未包装在

标记中的文本节点？

父级引用，因此您可以只测试

标记：

for textNode in textNodes:
    if textNode.parent and getattr(textNode.parent, 'name') == 'a':
        continue  # skip links
    urlizedText = urlize(textNode)
    textNode.replaceWith(urlizedText)

美丽的乌苏是美丽的！

<p>This is a <a href="https://djangosnippets.org/snippets/2072/" target="_blank">link</a>, this is also a link where the test is the same as the link: <a href="https://djangosnippets.org/snippets/2072/" target="_blank">&lt;a href="https://djangosnippets.org/snippets/2072/"&gt;https://djangosnippets.org/snippets/2072/&lt;/a&gt;</a>, and this is a link too but not formatted: &lt;a href="https://djangosnippets.org/snippets/2072/"&gt;https://djangosnippets.org/snippets/2072/&lt;/a&gt;</p>

for textNode in textNodes:
    if textNode.parent and getattr(textNode.parent, 'name') == 'a':
        continue  # skip links
    urlizedText = urlize(textNode)
    textNode.replaceWith(urlizedText)