Python 跳过div类，在div网页抓取中使用_Python_Web Scraping_Beautifulsoup

Python 跳过div类，在div网页抓取中使用

python web-scraping

Python 跳过div类，在div网页抓取中使用,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正在尝试浏览一个网站，我的html示例如下所示 <div class="ism-true"> <div id="post_message_5437898" data-spx-slot="1"> OK, although it's been several weeks since I installed the <div&

我正在尝试浏览一个网站，我的html示例如下所示

<div class="ism-true"><!-- message -->
                    <div id="post_message_5437898" data-spx-slot="1">

                        OK, although it's been several weeks since I installed the 

    <div><label>Quote:</label></div>
    <div class="panel alt2" style="border:1px inset">

        <div>
            Originally Posted by <strong>DeltaNu1142</strong>
        </div>
        <div style="font-style:italic">The very first thing I did </div>

    </div>
</div>When I got my grille back from the paint shop, I went to work on the
                    </div>
                    <!-- / message --></div>

<div class="ism-true"><!-- message -->
                    <div id="post_message_5125716">

                        <div style="margin:1rem; margin-top:0.3rem;">
    <div><label>Quote:</label></div>
    <div class="panel alt2" style="border:1px inset">

        <div>
            Originally Posted by <strong>HCFX2013</strong>
        </div>
        <div style="font-style:italic">I must be the minority that absolutely can't .</div>

    </div>
</div>Hello World.
                    </div>
                    <!-- / message --></div>

上面的代码仅在“Panel alt2”是div类中的第一个类时提供文本。若类的位置发生变化并将错误抛出为“列表索引超出范围”，那个么它就不起作用了。你能帮我忽略这些课程吗。预期结果是

[OK, although it's been several weeks. When I got my grille back from the paint shop, I went to work on the],[Hello world]

示例网站（）

可行的方法是使用class

面板alt2

和

标签

标记来输出div。下面的代码似乎与示例html一样适用于该站点

import requests
from bs4 import BeautifulSoup
URL = 'https://www.f150forum.com/f118/fab-fours-black-steel-elite-bumper-adaptive-cruise-relocation-bracket-387234/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
text = []
for div in soup.find_all('div', class_="ism-true"):
    try:
        div.find('div', class_="panel alt2").extract()
    except AttributeError:
        pass  # sometimes there is no 'panel alt2'
    try:
        div.find('label').extract()
    except AttributeError:
        pass  # sometimes there is no 'Quote'
    text.append(div.text.strip())

print(text)

与您的示例一起输出：

["OK, although it's been several weeks since I installed the \n\n    \n\nWhen I got my grille back from the paint shop, I went to work on the", 'Hello World.']

如果您不需要换行符，您可以删除换行符

我认为您的html格式不好，因为无法访问“Hello world”，因为它被关闭的标记包围。我已经编辑了我的html。您实际想要从该网站获得什么？@anonymous13查看我的编辑。这带来：

[“好的，虽然我已经安装了\n\n Quote几周了：\n\n\n最初由DeltaNu1142发布\n\n我做的第一件事”，“Quote:\n\n\n最初由HCFX2013发布\n\n我肯定是少数绝对不能的人。\n\n Hello World。”]

哪个比什么更重要asked@JuanC谢谢你让我注意到它。我没有用他在这里发布的html尝试它，而是直接指向链接。我是专门回答这个问题的：

我想要的文本只在post message类中，而不在“panel alt2”类中。类在“div id=”post\u message\uz”中的位置不断变化。如何忽略panel alt2类中的文本。

@BittoBennichan当引号位于文本中间时，它可以正常工作，但当引号位于文本开头时，它只忽略整个注释-就像链接中的注释一样（）

["OK, although it's been several weeks since I installed the \n\n    \n\nWhen I got my grille back from the paint shop, I went to work on the", 'Hello World.']