Python <；p>；及<；部门>；刮纸时的顺序_Python_Web Scraping_Beautifulsoup

Python <；p>；及<；部门>；刮纸时的顺序

python web-scraping

Python <；p>；及<；部门>；刮纸时的顺序,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我很难抓取在标记中嵌入标记的网页。当我找到一个div时，输出以下一个结束，而不是继续。而且输出似乎已经将源代码中的转换成了一个。我尝试使用其他inclusivediv标记，但我的输出总是在所需文本之前结束 HTML源代码 <p><div class="asdf">Text</p> <p>More Text</p></div> 输出 output = <div class="asdf">Text</div&

我很难抓取在

标记中嵌入

标记的网页。当我找到一个div时，输出以下一个

结束，而不是继续

。而且输出似乎已经将源代码中的

转换成了一个。我尝试使用其他inclusive

div

标记，但我的输出总是在所需文本之前结束

HTML源代码

<p><div class="asdf">Text</p>
<p>More Text</p></div>

输出

output = <div class="asdf">Text</div>

output=Text

期望输出

<div class="asdf">Text</p><p>More Text</p></div>

文本
更多文本

您可能正在使用默认的解析器（Python内置的

html.parser

），它对格式错误的html代码不是很好：

>>> BeautifulSoup("<div>Foo</p>Bar</div>", "html.parser").find("div")
<div>Foo</div>

然后：

>>BeautifulSoup（“FooBar”，“html5lib”）.find（“div”）
Foo酒吧

阅读

中有关不同解析器的更多信息您可能正在使用默认解析器（Python内置的

html.parser

），它对格式错误的html代码不太好：

>>> BeautifulSoup("<div>Foo</p>Bar</div>", "html.parser").find("div")
<div>Foo</div>

然后：

>>BeautifulSoup（“FooBar”，“html5lib”）.find（“div”）
Foo酒吧

在中阅读有关不同解析器的更多信息

pip install html5lib

>>> BeautifulSoup("<div>Foo</p>Bar</div>", "html5lib").find("div")
<div>Foo<p></p>Bar</div>