Python BeautifulSoup不';不能正确地从h1返回
我的代码Python BeautifulSoup不';不能正确地从h1返回,python,python-2.7,beautifulsoup,Python,Python 2.7,Beautifulsoup,我的代码 从美化组导入美化组 htmls='' 名称: 亚历克斯 ... 更多文本 ''' 汤=美汤(htmls) h1=soup.find(“h1”,{“class”:“student”}) 打印h1 预期结果 from BeautifulSoup import BeautifulSoup htmls = ''' <div class="main-content"> <h1 class="student"> <p>Name: <br /&
从美化组导入美化组
htmls=''
名称:
亚历克斯
... 更多文本
'''
汤=美汤(htmls)
h1=soup.find(“h1”,{“class”:“student”})
打印h1
预期结果
from BeautifulSoup import BeautifulSoup
htmls = '''
<div class="main-content">
<h1 class="student">
<p>Name: <br />
Alex</p>
<p> </p>
</h1>
</div>
<div class="department">
... more text
</div>
'''
soup = BeautifulSoup(htmls)
h1 = soup.find("h1", {"class": "student"})
print h1
名称:
亚历克斯
但是,不幸的是,他回来了
<h1 class="student">
<p>Name: <br />
Alex</p>
<p> </p>
</h1>
我的问题是,为什么它会吃掉p标签之间的所有东西?它是否执行渲染内容()?或者是解析失败?这是因为您在
h1
标记中使用了p
标记。例如,如果您这样做:
<h1 class="student">
</h1>
你可以看到孩子们
这是HTML
p
标记的行为方式。这就是问题所在。(阅读此处的更多内容)尝试将不同的解析器传递到您的BeautifulGroup中:
>>> htmls = '''
... <div class="main-content">
... <h1 class="student">
... <span>Name: <br />
... Alex</span>
... <span> </span>
... </h1>
... </div>
... <div class="department">
... ... more text
... </div>
... '''
>>>
>>> htmls.contents
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'contents'
>>> soup = BeautifulSoup(htmls)
>>> h1 = soup.find("h1", {"class": "student"})
>>>
>>> h1
<h1 class="student">
<span>Name: <br />
Alex</span>
<span> </span>
</h1>
pip安装html5lib
>>>htmls=''
...
...
... 名称:
... 亚历克斯
...
...
...
...
... ... 更多文本
...
... '''
>>>soup=BeautifulSoup(htmls,‘html5lib’)
>>>h1=汤。查找('h1','student')
>>>打印h1
名称:
亚历克斯
我想这是你想要的。否则,您不应该将块元素放在符合性要求的内部
请参阅:插入解析器的步骤
>>> htmls = '''
... <div class="main-content">
... <h1 class="student">
... <span>Name: <br />
... Alex</span>
... <span> </span>
... </h1>
... </div>
... <div class="department">
... ... more text
... </div>
... '''
>>>
>>> htmls.contents
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'contents'
>>> soup = BeautifulSoup(htmls)
>>> h1 = soup.find("h1", {"class": "student"})
>>>
>>> h1
<h1 class="student">
<span>Name: <br />
Alex</span>
<span> </span>
</h1>
pip install html5lib
>>> htmls = '''
... <div class="main-content">
... <h1 class="student">
... <span>Name: <br />
... Alex</span>
... <span> </span>
... </h1>
... </div>
... <div class="department">
... ... more text
... </div>
... '''
>>> soup = BeautifulSoup(htmls, 'html5lib')
>>> h1 = soup.find('h1', 'student')
>>> print h1
<h1 class="student">
<p>Name: <br/>
Alex</p>
<p> </p>
</h1>