Python 跳过div类,在div网页抓取中使用
我正在尝试浏览一个网站,我的html示例如下所示Python 跳过div类,在div网页抓取中使用,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正在尝试浏览一个网站,我的html示例如下所示 <div class="ism-true"><!-- message --> <div id="post_message_5437898" data-spx-slot="1"> OK, although it's been several weeks since I installed the <div&
<div class="ism-true"><!-- message -->
<div id="post_message_5437898" data-spx-slot="1">
OK, although it's been several weeks since I installed the
<div><label>Quote:</label></div>
<div class="panel alt2" style="border:1px inset">
<div>
Originally Posted by <strong>DeltaNu1142</strong>
</div>
<div style="font-style:italic">The very first thing I did </div>
</div>
</div>When I got my grille back from the paint shop, I went to work on the
</div>
<!-- / message --></div>
<div class="ism-true"><!-- message -->
<div id="post_message_5125716">
<div style="margin:1rem; margin-top:0.3rem;">
<div><label>Quote:</label></div>
<div class="panel alt2" style="border:1px inset">
<div>
Originally Posted by <strong>HCFX2013</strong>
</div>
<div style="font-style:italic">I must be the minority that absolutely can't .</div>
</div>
</div>Hello World.
</div>
<!-- / message --></div>
上面的代码仅在“Panel alt2”是div类中的第一个类时提供文本。若类的位置发生变化并将错误抛出为“列表索引超出范围”,那个么它就不起作用了。你能帮我忽略这些课程吗。
预期结果是
[OK, although it's been several weeks. When I got my grille back from the paint shop, I went to work on the],[Hello world]
示例网站()可行的方法是使用class
面板alt2
和标签
标记来输出div。下面的代码似乎与示例html一样适用于该站点
import requests
from bs4 import BeautifulSoup
URL = 'https://www.f150forum.com/f118/fab-fours-black-steel-elite-bumper-adaptive-cruise-relocation-bracket-387234/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
text = []
for div in soup.find_all('div', class_="ism-true"):
try:
div.find('div', class_="panel alt2").extract()
except AttributeError:
pass # sometimes there is no 'panel alt2'
try:
div.find('label').extract()
except AttributeError:
pass # sometimes there is no 'Quote'
text.append(div.text.strip())
print(text)
与您的示例一起输出:
["OK, although it's been several weeks since I installed the \n\n \n\nWhen I got my grille back from the paint shop, I went to work on the", 'Hello World.']
如果您不需要换行符,您可以删除换行符我认为您的html格式不好,因为无法访问“Hello world”,因为它被关闭的标记包围。我已经编辑了我的html。您实际想要从该网站获得什么?@anonymous13查看我的编辑。这带来:
[“好的,虽然我已经安装了\n\n Quote几周了:\n\n\n最初由DeltaNu1142发布\n\n我做的第一件事”,“Quote:\n\n\n最初由HCFX2013发布\n\n我肯定是少数绝对不能的人。\n\n Hello World。”]
哪个比什么更重要asked@JuanC谢谢你让我注意到它。我没有用他在这里发布的html尝试它,而是直接指向链接。我是专门回答这个问题的:我想要的文本只在post message类中,而不在“panel alt2”类中。类在“div id=”post\u message\uz”中的位置不断变化。如何忽略panel alt2类中的文本。
@BittoBennichan当引号位于文本中间时,它可以正常工作,但当引号位于文本开头时,它只忽略整个注释-就像链接中的注释一样()
["OK, although it's been several weeks since I installed the \n\n \n\nWhen I got my grille back from the paint shop, I went to work on the", 'Hello World.']