Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/312.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从BeautifulSoup中没有类的span标记中提取文本_Python_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 从BeautifulSoup中没有类的span标记中提取文本

Python 从BeautifulSoup中没有类的span标记中提取文本,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,为了完成一个小型的数据分析项目,我正在尝试从一个网站中提取数据。下面是我正在处理的HTML源代码(我想要从中提取数据的所有div都具有完全相同的结构) 然而,当我试图提取每篇文章的类别时,它位于之间(在我的例子中是石油市场),我得到了一个错误,即“NoneType”对象没有属性“text。我使用的代码是: for container in divs: topic = container.find('span').text topics.append(topic) 这里奇怪的

为了完成一个小型的数据分析项目,我正在尝试从一个网站中提取数据。下面是我正在处理的HTML源代码(我想要从中提取数据的所有div都具有完全相同的结构)

然而,当我试图提取每篇文章的类别时,它位于
之间(在我的例子中是石油市场),我得到了一个
错误,即“NoneType”对象没有属性“text
。我使用的代码是:

for container in divs:
    topic = container.find('span').text
    topics.append(topic)  
这里奇怪的是,当我打印(主题)时,我得到了一个列表,其中包含的元素比实际的多(几乎800个元素,有时甚至更多),并且元素混合在一起,同时包含字符串和bs4元素标记。下面是我得到的列表的快照:

</span>, <span> E&amp;P, Oil Markets, Supply Chain </span>, <span> Oil Markets, Gas Markets </span>, <span> Supply Chain </span>, <span> Gas Markets </span>, <span> E&amp;P </span>, <span> Shale </span>, <span> Corporate </span>, <span> E&amp;P </span>, <span> Oil Markets </span>, <span> Supply Chain, Other, Renewables </span>, <span> Gas Markets </span>, <span> Oil Markets </span>, <span> Gas Markets </span>, <span> Gas Markets </span>, <span> E&amp;P </span>, <span> Gas Markets </span>, <span> E&amp;P </span>, <span> Supply Chain </span>, <span> Shale </span>, None, <span> Corporate </span>, <span> Shale </span>, None, <span> Renewables </span>, <span> Renewables </span>, <span> Renewables </span>, <span> E&amp;P </span>, <span> E&amp;P </span>, <span> E&amp;P </span>, <span> E&amp;P </span>, <span> Oil Markets </span>, <span> E&amp;P </span>, <span> Supply Chain </span>, ' Oil Markets ', ' Oil Markets ', ' Supply Chain, Renewables ', ' Oil Markets ', ' Renewables ', ' E&P ', ' Renewables ', ' Supply Chain ', ' Shale ', ' E&P ', ' Shale ', ' Gas Markets ', ' Gas Markets ', ' Supply Chain ', ' Oil Markets ', ' Shale ', ' Oil Markets ', ' Corporate, Oil Markets, Other ', ' Shale ', ' Renewables ', ' Shale ', ' Supply Chain ',
,E&;P、 石油市场、供应链、石油市场、天然气市场、供应链、天然气市场、E&;P、页岩、公司、E&;石油市场、供应链、其他、可再生能源、天然气市场、石油市场、天然气市场、天然气市场、E&;P、天然气市场、E&;供应链,页岩,无,公司,页岩,无,可再生能源,可再生能源,可再生能源,E&;P,E& ;;P,E& ;;P,E& ;;石油市场,勘探与开发部,;供应链,石油市场,石油市场,供应链,可再生能源,石油市场,可再生能源,勘探与生产,可再生能源,供应链,页岩,勘探与生产,页岩,天然气市场,供应链,石油市场,页岩,石油市场,公司,石油市场,其他,页岩,可再生能源,"页岩","供应链",,
我的目标是将类别提取为字符串列表(它们应该是207个类别的组合),以便稍后在数据框中填充它们以及日期和标题


我尝试过这些解决方案,但没有成功。我想知道是否有人能帮我解决这个问题。

您的代码很好,您只需添加一个
try..catch
,以避免在一些没有类别的文章上崩溃

下面的代码片段对此进行了说明:

from bs4 import BeautifulSoup
import requests

html = BeautifulSoup(requests.get('https://www.rystadenergy.com/newsevents/news/press-releases/').text, 'html.parser')

divs = html.find_all('div', class_='col-12 col-md-6 col-lg-4 mt-3 news-events-list__item')

for container in divs:
    topic = container.find('span')
    if not topic :
        print(container)
输出:

<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/winners-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-28">January 28, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces winners for Gullkronen 2020 </h5> </a> </div>
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/nominees-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-23">January 23, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces nominees for Gullkronen 2020 </h5> </a> </div>

请注意,这只是一种方法:)

使用html示例(随附
标记)-无法使用
soup=BeautifulSoup(html,'html.parser')
和其他代码进行复制。请发布完整的回溯。您应该展示如何制作
。我认为他必须提供一个更完整的DOM示例,可能是在调用
find\u all('div')时
当您在except套件中检查/打印相关数据时,他会得到所需的
div
s元素和其他元素-这是否是您所期望的。很可能
divs
中的一个div标签中没有span标签。您是说无法重现OP的问题吗?如果是这样,这不是答案。
否不是
-如果这不是答案,你应该删除它并发表评论。就像我做的那样,只是说你不能重现问题。@wwii I已经更新了帖子。我删除了以前的注释以清理注释部分:)而不是try/except如果主题为None,您也可以
continue
<代码>主题=容器。查找(…);如果主题为无:继续;topic=topic.text.strip()
from bs4 import BeautifulSoup
import requests

html = BeautifulSoup(requests.get('https://www.rystadenergy.com/newsevents/news/press-releases/').text, 'html.parser')

divs = html.find_all('div', class_='col-12 col-md-6 col-lg-4 mt-3 news-events-list__item')

for container in divs:
    topic = container.find('span')
    if not topic :
        print(container)
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/winners-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-28">January 28, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces winners for Gullkronen 2020 </h5> </a> </div>
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/nominees-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-23">January 23, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces nominees for Gullkronen 2020 </h5> </a> </div>
topics = []
for container in divs:
    try:
        topic = container.find('span').text.strip()
    except:
        topic = ''
    finally:
        topics.append(topic)