Python 使用lxml生成器进行非递归查找_Python_Python 2.7_Parsing_Beautifulsoup_Lxml

Python 使用lxml生成器进行非递归查找

python python-2.7 parsing

Python 使用lxml生成器进行非递归查找,python,python-2.7,parsing,beautifulsoup,lxml,Python,Python 2.7,Parsing,Beautifulsoup,Lxml,我发现在Python2.7中，如果我使用lxmlbuilder，我无法执行非递归的bs4.BeautifulSoup.find_all 以以下示例HTML代码段为例： Cats are interesting creatures Dogs are cool too <div>

我发现在Python2.7中，如果我使用

lxml

builder，我无法执行非递归的

bs4.BeautifulSoup.find_all

以以下示例HTML代码段为例：

<p> <b> Cats </b> are interesting creatures </p>

<p> <b> Dogs </b> are cool too </p>

<div>
<p> <b> Penguins </b> are pretty neat, but they're inside a div </p>
</div>

<p> <b> Llamas </b> don't live in New York </p>

正常使用

find_all

时，它们都能正确执行：

>>> a.find_all("p")
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Penguins </b> are pretty neat, but they're inside a div </p>, <p> <b> Llamas </b> don't live in New York </p>]
>>> b.find_all("p")
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Penguins </b> are pretty neat, but they're inside a div </p>, <p> <b> Llamas </b> don't live in New York </p>]

为什么会这样？这是一个错误，还是我做错了什么？

lxml

builder是否支持非递归的

find_all

？

这是因为

lxml

解析器会将HTML代码放入

HTML/body

中，如果它不存在：

>>> b = bs4.BeautifulSoup(html, "lxml")
>>> print(b)
<html><body><p> <b> Cats </b> are interesting creatures </p>
<p> <b> Dogs </b> are cool too </p>
<div>
<p> <b> Penguins </b> are pretty neat, but they're inside a div </p>
</div>
<p> <b> Llamas </b> don't live in New York </p>
</body></html>

这在我看来似乎是不一致的，为什么不同的解析器在这种方式下的行为会不同呢？@LukeTaylor我同意，这可能会令人困惑。文件段落中有一些关于这方面的信息。这一切都归结为不同的解析器使非良好格式的HTML成为有效的—它们只是做得不同而已。

>>> a.find_all("p", recursive=False)
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Llamas </b> don't live in New York </p>]
>>> b.find_all("p", recursive=False)
[]

>>> b = bs4.BeautifulSoup(html, "lxml")
>>> print(b)
<html><body><p> <b> Cats </b> are interesting creatures </p>
<p> <b> Dogs </b> are cool too </p>
<div>
<p> <b> Penguins </b> are pretty neat, but they're inside a div </p>
</div>
<p> <b> Llamas </b> don't live in New York </p>
</body></html>

>>> print(b.find_all("p", recursive=False))
[]
>>> print(b.body.find_all("p", recursive=False))
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Llamas </b> don't live in New York </p>]