Python find_all在混合内容中找不到文本_Python_Regex_Beautifulsoup

Python find_all在混合内容中找不到文本

python regex

Python find_all在混合内容中找不到文本,python,regex,beautifulsoup,Python,Regex,Beautifulsoup,我有一点Python屏幕抓取代码，使用BeautifulSoup，这让我头疼。html的一个小改动使我的代码中断，但我不明白为什么它不能工作。这基本上是html解析时的外观演示： soup=BeautifulSoup(""" <td> <a href="https://alink.com"> Foo Some text Bar </a> </td> """) links = soup.find_all('a',tex

我有一点Python屏幕抓取代码，使用BeautifulSoup，这让我头疼。html的一个小改动使我的代码中断，但我不明白为什么它不能工作。这基本上是html解析时的外观演示：

soup=BeautifulSoup("""
<td>
    <a href="https://alink.com">
        Foo Some text Bar
    </a>
</td>
""")
links = soup.find_all('a',text=re.compile('Some text'))
links[0]['href'] # => "https://alink.com"

在文本中添加img标记作为同级标记是什么内容破坏了BeautifulSoup所做的搜索，是否存在

修改第一个代码的方法？

第一个示例仅在

a.string

不是

None

时有效，即如果文本是唯一的子项

作为解决方法，您可以使用函数谓词：

a = soup.find(lambda tag: tag.name == 'a' and tag.has_attr('href') and 'Some text' in tag.text)
print(a['href'])
# -> 'https://alink.com'

区别在于第二个示例有一个不完整的

img

标记：

应该是

<img src="dummy.gif" />
Foo Some text Bar


Foo一些文本栏

或


Foo一些文本栏

相反，它被解析为

<img src="dummy.gif" >
Foo Some text Bar
</img>


Foo一些文本栏

所以找到的元素不再是

，而是

img

，它的父元素是

为什么不

next（link[“href”]表示汤中的link。如果link.text中的“Some text”，则查找所有（'a'）

）。下一个（）调用做什么？只返回第一个匹配项，它将是您想要的链接。事实证明，该行为是特定于库的。我在Mac上获得了一些用于Python发行版的解析代码，但它不能用于Linux发行版。不完整的img标记在一个运行时被视为父级，而在另一个运行时则被视为同级。一定会喜欢的。

<img src="dummy.gif" />
Foo Some text Bar

<img src="dummy.gif" > </img>
Foo Some text Bar

<img src="dummy.gif" >
Foo Some text Bar
</img>