Python 从BeautifulSoup中没有类的span标记中提取文本
为了完成一个小型的数据分析项目,我正在尝试从一个网站中提取数据。下面是我正在处理的HTML源代码(我想要从中提取数据的所有div都具有完全相同的结构) 然而,当我试图提取每篇文章的类别时,它位于Python 从BeautifulSoup中没有类的span标记中提取文本,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,为了完成一个小型的数据分析项目,我正在尝试从一个网站中提取数据。下面是我正在处理的HTML源代码(我想要从中提取数据的所有div都具有完全相同的结构) 然而,当我试图提取每篇文章的类别时,它位于之间(在我的例子中是石油市场),我得到了一个错误,即“NoneType”对象没有属性“text。我使用的代码是: for container in divs: topic = container.find('span').text topics.append(topic) 这里奇怪的
之间(在我的例子中是石油市场),我得到了一个错误,即“NoneType”对象没有属性“text
。我使用的代码是:
for container in divs:
topic = container.find('span').text
topics.append(topic)
这里奇怪的是,当我打印(主题)时,我得到了一个列表,其中包含的元素比实际的多(几乎800个元素,有时甚至更多),并且元素混合在一起,同时包含字符串和bs4元素标记。下面是我得到的列表的快照:
</span>, <span> E&P, Oil Markets, Supply Chain </span>, <span> Oil Markets, Gas Markets </span>, <span> Supply Chain </span>, <span> Gas Markets </span>, <span> E&P </span>, <span> Shale </span>, <span> Corporate </span>, <span> E&P </span>, <span> Oil Markets </span>, <span> Supply Chain, Other, Renewables </span>, <span> Gas Markets </span>, <span> Oil Markets </span>, <span> Gas Markets </span>, <span> Gas Markets </span>, <span> E&P </span>, <span> Gas Markets </span>, <span> E&P </span>, <span> Supply Chain </span>, <span> Shale </span>, None, <span> Corporate </span>, <span> Shale </span>, None, <span> Renewables </span>, <span> Renewables </span>, <span> Renewables </span>, <span> E&P </span>, <span> E&P </span>, <span> E&P </span>, <span> E&P </span>, <span> Oil Markets </span>, <span> E&P </span>, <span> Supply Chain </span>, ' Oil Markets ', ' Oil Markets ', ' Supply Chain, Renewables ', ' Oil Markets ', ' Renewables ', ' E&P ', ' Renewables ', ' Supply Chain ', ' Shale ', ' E&P ', ' Shale ', ' Gas Markets ', ' Gas Markets ', ' Supply Chain ', ' Oil Markets ', ' Shale ', ' Oil Markets ', ' Corporate, Oil Markets, Other ', ' Shale ', ' Renewables ', ' Shale ', ' Supply Chain ',
,E&;P、 石油市场、供应链、石油市场、天然气市场、供应链、天然气市场、E&;P、页岩、公司、E&;石油市场、供应链、其他、可再生能源、天然气市场、石油市场、天然气市场、天然气市场、E&;P、天然气市场、E&;供应链,页岩,无,公司,页岩,无,可再生能源,可再生能源,可再生能源,E&;P,E& ;;P,E& ;;P,E& ;;石油市场,勘探与开发部,;供应链,石油市场,石油市场,供应链,可再生能源,石油市场,可再生能源,勘探与生产,可再生能源,供应链,页岩,勘探与生产,页岩,天然气市场,供应链,石油市场,页岩,石油市场,公司,石油市场,其他,页岩,可再生能源,"页岩","供应链",,
我的目标是将类别提取为字符串列表(它们应该是207个类别的组合),以便稍后在数据框中填充它们以及日期和标题
我尝试过这些解决方案,但没有成功。我想知道是否有人能帮我解决这个问题。您的代码很好,您只需添加一个
try..catch
,以避免在一些没有类别的文章上崩溃
下面的代码片段对此进行了说明:
from bs4 import BeautifulSoup
import requests
html = BeautifulSoup(requests.get('https://www.rystadenergy.com/newsevents/news/press-releases/').text, 'html.parser')
divs = html.find_all('div', class_='col-12 col-md-6 col-lg-4 mt-3 news-events-list__item')
for container in divs:
topic = container.find('span')
if not topic :
print(container)
输出:
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/winners-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-28">January 28, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces winners for Gullkronen 2020 </h5> </a> </div>
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/nominees-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-23">January 23, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces nominees for Gullkronen 2020 </h5> </a> </div>
请注意,这只是一种方法:)使用html示例(随附
…
标记)-无法使用soup=BeautifulSoup(html,'html.parser')
和其他代码进行复制。请发布完整的回溯。您应该展示如何制作汤。我认为他必须提供一个更完整的DOM示例,可能是在调用find\u all('div')时
当您在except套件中检查/打印相关数据时,他会得到所需的div
s元素和其他元素-这是否是您所期望的。很可能divs
中的一个div标签中没有span标签。您是说无法重现OP的问题吗?如果是这样,这不是答案。否不是
-如果这不是答案,你应该删除它并发表评论。就像我做的那样,只是说你不能重现问题。@wwii I已经更新了帖子。我删除了以前的注释以清理注释部分:)而不是try/except如果主题为None,您也可以continue
<代码>主题=容器。查找(…);如果主题为无:继续;topic=topic.text.strip()
。
from bs4 import BeautifulSoup
import requests
html = BeautifulSoup(requests.get('https://www.rystadenergy.com/newsevents/news/press-releases/').text, 'html.parser')
divs = html.find_all('div', class_='col-12 col-md-6 col-lg-4 mt-3 news-events-list__item')
for container in divs:
topic = container.find('span')
if not topic :
print(container)
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/winners-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-28">January 28, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces winners for Gullkronen 2020 </h5> </a> </div>
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/nominees-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-23">January 23, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces nominees for Gullkronen 2020 </h5> </a> </div>
topics = []
for container in divs:
try:
topic = container.find('span').text.strip()
except:
topic = ''
finally:
topics.append(topic)