Python 美化组丢失节点_Python_Beautifulsoup_Html5lib

Python 美化组丢失节点

python

Python 美化组丢失节点,python,beautifulsoup,html5lib,Python,Beautifulsoup,Html5lib,我正在使用Python和Beautifulsoup解析HTML数据，并从RSS提要中获取p标记。但是，一些URL会导致问题，因为解析的soup对象不包括文档的所有节点例如，我试图解析但是在将解析的对象与页面源代码进行比较之后，我注意到ul class=“nextgen left”之后的所有节点都丢失了以下是我解析文档的方式： from bs4 import BeautifulSoup as bs url = 'http://feeds.chicagotribune.com/~r/Chic

我正在使用Python和Beautifulsoup解析HTML数据，并从RSS提要中获取p标记。但是，一些URL会导致问题，因为解析的soup对象不包括文档的所有节点

例如，我试图解析

但是在将解析的对象与页面源代码进行比较之后，我注意到

ul class=“nextgen left”

之后的所有节点都丢失了

以下是我解析文档的方式：

from bs4 import BeautifulSoup as bs

url = 'http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm'

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
request = urllib2.Request(url)

response = opener.open(request) 

soup = bs(response,'lxml')        
print soup

输入HTML不太一致，因此您必须在这里使用不同的解析器。

html5lib

解析器正确处理此页面：

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm')
>>> soup = BeautifulSoup(r.text, 'lxml')
>>> soup.find('div', id='story-body') is not None
False
>>> soup = BeautifulSoup(r.text, 'html5')
>>> soup.find('div', id='story-body') is not None
True

输入HTML不太一致，因此您必须在这里使用不同的解析器。

html5lib

解析器正确处理此页面：

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm')
>>> soup = BeautifulSoup(r.text, 'lxml')
>>> soup.find('div', id='story-body') is not None
False
>>> soup = BeautifulSoup(r.text, 'html5')
>>> soup.find('div', id='story-body') is not None
True

尝试不同的解析器；提要中的HTML被破坏，不同的解析器处理方式不同；提要中的HTML被破坏，不同的解析器处理方式不同。