Python 从BeautifulSoup中没有类的span标记中提取文本_Python_Web Scraping_Beautifulsoup

Python 从BeautifulSoup中没有类的span标记中提取文本

python web-scraping

Python 从BeautifulSoup中没有类的span标记中提取文本,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,为了完成一个小型的数据分析项目，我正在尝试从一个网站中提取数据。下面是我正在处理的HTML源代码（我想要从中提取数据的所有div都具有完全相同的结构）然而，当我试图提取每篇文章的类别时，它位于之间（在我的例子中是石油市场），我得到了一个错误，即“NoneType”对象没有属性“text。我使用的代码是： for container in divs: topic = container.find('span').text topics.append(topic) 这里奇怪的

为了完成一个小型的数据分析项目，我正在尝试从一个网站中提取数据。下面是我正在处理的HTML源代码（我想要从中提取数据的所有div都具有完全相同的结构）

然而，当我试图提取每篇文章的类别时，它位于

之间（在我的例子中是石油市场），我得到了一个

错误，即“NoneType”对象没有属性“text

。我使用的代码是：

for container in divs:
    topic = container.find('span').text
    topics.append(topic)

这里奇怪的是，当我打印（主题）时，我得到了一个列表，其中包含的元素比实际的多（几乎800个元素，有时甚至更多），并且元素混合在一起，同时包含字符串和bs4元素标记。下面是我得到的列表的快照：

</span>, <span> E&amp;P, Oil Markets, Supply Chain </span>, <span> Oil Markets, Gas Markets </span>, <span> Supply Chain </span>, <span> Gas Markets </span>, <span> E&amp;P </span>, <span> Shale </span>, <span> Corporate </span>, <span> E&amp;P </span>, <span> Oil Markets </span>, <span> Supply Chain, Other, Renewables </span>, <span> Gas Markets </span>, <span> Oil Markets </span>, <span> Gas Markets </span>, <span> Gas Markets </span>, <span> E&amp;P </span>, <span> Gas Markets </span>, <span> E&amp;P </span>, <span> Supply Chain </span>, <span> Shale </span>, None, <span> Corporate </span>, <span> Shale </span>, None, <span> Renewables </span>, <span> Renewables </span>, <span> Renewables </span>, <span> E&amp;P </span>, <span> E&amp;P </span>, <span> E&amp;P </span>, <span> E&amp;P </span>, <span> Oil Markets </span>, <span> E&amp;P </span>, <span> Supply Chain </span>, ' Oil Markets ', ' Oil Markets ', ' Supply Chain, Renewables ', ' Oil Markets ', ' Renewables ', ' E&P ', ' Renewables ', ' Supply Chain ', ' Shale ', ' E&P ', ' Shale ', ' Gas Markets ', ' Gas Markets ', ' Supply Chain ', ' Oil Markets ', ' Shale ', ' Oil Markets ', ' Corporate, Oil Markets, Other ', ' Shale ', ' Renewables ', ' Shale ', ' Supply Chain ',

，E&；P、 石油市场、供应链、石油市场、天然气市场、供应链、天然气市场、E&；P、页岩、公司、E&；石油市场、供应链、其他、可再生能源、天然气市场、石油市场、天然气市场、天然气市场、E&；P、天然气市场、E&；供应链，页岩，无，公司，页岩，无，可再生能源，可再生能源，可再生能源，E&；P,E& ;；P,E& ;；P,E& ;；石油市场,勘探与开发部,；供应链，石油市场，石油市场，供应链，可再生能源，石油市场，可再生能源，勘探与生产，可再生能源，供应链，页岩，勘探与生产，页岩，天然气市场，供应链，石油市场，页岩，石油市场，公司，石油市场，其他，页岩，可再生能源，"页岩","供应链",，

我的目标是将类别提取为字符串列表（它们应该是207个类别的组合），以便稍后在数据框中填充它们以及日期和标题

我尝试过这些解决方案，但没有成功。我想知道是否有人能帮我解决这个问题。

您的代码很好，您只需添加一个

try..catch

，以避免在一些没有类别的文章上崩溃

下面的代码片段对此进行了说明：

from bs4 import BeautifulSoup
import requests

html = BeautifulSoup(requests.get('https://www.rystadenergy.com/newsevents/news/press-releases/').text, 'html.parser')

divs = html.find_all('div', class_='col-12 col-md-6 col-lg-4 mt-3 news-events-list__item')

for container in divs:
    topic = container.find('span')
    if not topic :
        print(container)

输出：

<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/winners-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-28">January 28, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces winners for Gullkronen 2020 </h5> </a> </div>
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/nominees-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-23">January 23, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces nominees for Gullkronen 2020 </h5> </a> </div>

请注意，这只是一种方法：）

使用html示例（随附

…

标记）-无法使用

soup=BeautifulSoup（html，'html.parser'）

和其他代码进行复制。请发布完整的回溯。您应该展示如何制作

汤。我认为他必须提供一个更完整的DOM示例，可能是在调用find\u all（'div'）时
当您在except套件中检查/打印相关数据时，他会得到所需的div
s元素和其他元素-这是否是您所期望的。很可能divs
中的一个div标签中没有span标签。您是说无法重现OP的问题吗？如果是这样，这不是答案。否不是
-如果这不是答案，你应该删除它并发表评论。就像我做的那样，只是说你不能重现问题。@wwii I已经更新了帖子。我删除了以前的注释以清理注释部分：）而不是try/except如果主题为None，您也可以continue
<代码>主题=容器。查找（…）；如果主题为无：继续；topic=topic.text.strip（）。
from bs4 import BeautifulSoup
import requests

html = BeautifulSoup(requests.get('https://www.rystadenergy.com/newsevents/news/press-releases/').text, 'html.parser')

divs = html.find_all('div', class_='col-12 col-md-6 col-lg-4 mt-3 news-events-list__item')

for container in divs:
    topic = container.find('span')
    if not topic :
        print(container)

<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/winners-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-28">January 28, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces winners for Gullkronen 2020 </h5> </a> </div>
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/nominees-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-23">January 23, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces nominees for Gullkronen 2020 </h5> </a> </div>

topics = []
for container in divs:
    try:
        topic = container.find('span').text.strip()
    except:
        topic = ''
    finally:
        topics.append(topic)