Python BeautifulSoup不'；不能正确地从h1返回_Python_Python 2.7_Beautifulsoup

Python BeautifulSoup不'；不能正确地从h1返回

python python-2.7

Python BeautifulSoup不'；不能正确地从h1返回,python,python-2.7,beautifulsoup,Python,Python 2.7,Beautifulsoup,我的代码从美化组导入美化组 htmls='' 名称：亚历克斯 ... 更多文本 ''' 汤=美汤（htmls） h1=soup.find（“h1”，{“class”：“student”}）打印h1 预期结果 from BeautifulSoup import BeautifulSoup htmls = ''' <div class="main-content"> <h1 class="student"> <p>Name: <br /&

我的代码

从美化组导入美化组
htmls=''
名称：

亚历克斯

... 更多文本
'''
汤=美汤（htmls）
h1=soup.find（“h1”，{“class”：“student”}）
打印h1

预期结果

from BeautifulSoup import BeautifulSoup

htmls = '''
<div class="main-content">
<h1 class="student">
    <p>Name: <br />
    Alex</p>
    <p>&nbsp;</p>
</h1>
</div>
<div class="department">
... more text
</div>
'''
soup = BeautifulSoup(htmls)
h1 = soup.find("h1", {"class": "student"})
print h1


名称：

亚历克斯

但是，不幸的是，他回来了

<h1 class="student">
    <p>Name: <br />
    Alex</p>
    <p>&nbsp;</p>
</h1>

我的问题是，为什么它会吃掉p标签之间的所有东西？它是否执行渲染内容（）？或者是解析失败？

这是因为您在

h1

标记中使用了

标记。例如，如果您这样做：

<h1 class="student">
</h1>

你可以看到孩子们

这是HTML

标记的行为方式。这就是问题所在。（阅读此处的更多内容）

尝试将不同的解析器传递到您的BeautifulGroup中：

>>> htmls = '''
... <div class="main-content">
... <h1 class="student">
...     <span>Name: <br />
...     Alex</span>
...     <span>&nbsp;</span>
... </h1>
... </div>
... <div class="department">
... ... more text
... </div>
... '''
>>> 
>>> htmls.contents
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'contents'
>>> soup = BeautifulSoup(htmls)
>>> h1 = soup.find("h1", {"class": "student"})
>>> 
>>> h1
<h1 class="student">
<span>Name: <br />
    Alex</span>
<span>&nbsp;</span>
</h1>

pip安装html5lib
>>>htmls=''
... 
... 
...     名称：

...     亚历克斯
...      
... 
... 
... 
... ... 更多文本
... 
... '''
>>>soup=BeautifulSoup（htmls，‘html5lib’）
>>>h1=汤。查找（'h1'，'student'）
>>>打印h1
名称：

亚历克斯

我想这是你想要的。否则，您不应该将块元素放在符合性要求的内部

请参阅：插入解析器的步骤

>>> htmls = '''
... <div class="main-content">
... <h1 class="student">
...     <span>Name: <br />
...     Alex</span>
...     <span>&nbsp;</span>
... </h1>
... </div>
... <div class="department">
... ... more text
... </div>
... '''
>>> 
>>> htmls.contents
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'contents'
>>> soup = BeautifulSoup(htmls)
>>> h1 = soup.find("h1", {"class": "student"})
>>> 
>>> h1
<h1 class="student">
<span>Name: <br />
    Alex</span>
<span>&nbsp;</span>
</h1>

pip install html5lib

>>> htmls = '''
... <div class="main-content">
... <h1 class="student">
...     <span>Name: <br />
...     Alex</span>
...     <span>&nbsp;</span>
... </h1>
... </div>
... <div class="department">
... ... more text
... </div>
... '''

>>> soup = BeautifulSoup(htmls, 'html5lib')
>>> h1 = soup.find('h1', 'student')
>>> print h1
<h1 class="student">
    <p>Name: <br/>
    Alex</p>
    <p> </p>
</h1>