Python 使用Beauty Soup时无法理解空数组输出_Python_Web Scraping_Beautifulsoup

Python 使用Beauty Soup时无法理解空数组输出

python web-scraping

Python 使用Beauty Soup时无法理解空数组输出,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我写了一个非常小的python脚本，从CNN的网站上抓取文章标题 import requests from bs4 import BeautifulSoup url='https://edition.cnn.com/' topics=['world','politics','business'] r=requests.get(url+topics[1]) soup=BeautifulSoup(r.content,'html.parser') spans=soup.find_all('span'

我写了一个非常小的python脚本，从CNN的网站上抓取文章标题

import requests
from bs4 import BeautifulSoup

url='https://edition.cnn.com/'
topics=['world','politics','business']
r=requests.get(url+topics[1])
soup=BeautifulSoup(r.content,'html.parser')
spans=soup.find_all('span',{'class':"cd__headline-text"})
print(spans)

在执行这段代码时，我只是得到一个空列表作为输出。这不是我所期望的，也不是我所寻找的，因为我正在尝试刮除标记后面的文本。我试图引用的html块的片段是-

<span class="cd__headline-text">
Bernie Sanders faces pivotal clash as Democratic establishment joins forces against him
</span>


伯尼·桑德斯（Bernie Sanders）面临着关键性的冲突，因为民主建制派联合起来反对他

请帮助澄清我的代码似乎做错了什么和/或我可能犯的任何逻辑错误

您的代码运行良好。它只是不能为

政治

页面生成结果

试试这个：

import requests
from bs4 import BeautifulSoup

url='https://edition.cnn.com/'
topics = ['world','politics','business']

headlines = []

for topic in topics:

    r = requests.get(url+topic)
    soup=BeautifulSoup(r.content,'html.parser')

    for span in soup.find_all('span',{'class':"cd__headline-text"}):
        headlines.append(span.text)
        print(span.text)
        print()

标题

打印到：

The bizarre ways that coronavirus is changing etiquette
Over half of all virus cases in one country are linked to this group
Trump's Middle East plan could jeopardize Jordan-Israel peace treaty, Jordan PM says
Irish duo's win marks rare victory for women in the 'Nobel of architecture'
After more than 240 days, Australia's New South Wales is finally free from bushfires
Child drowns off Greek coast after Turkey opens border with Europe 
A migration crisis and disagreement with Turkey is the last thing Europe needs right now
Vatican to open controversial WW2-era files on Pope Pius XII
Netanyahu projected to win Israeli election, but exit polls suggest bloc just short of majority
Adviser to Iran's Supreme Leader dies after contracting coronavirus
Israeli election exit polls project Netanyahu in lead
She became pregnant at the age of 12. Now, Kenya's Christine Ongare is an Olympic boxing qualifier
Nigeria says it is ready and more than capable of dealing with coronavirus
Kenya bans commercial slaughter of donkeys following a rise in animal theft 
Violence forces Haiti to cancel Carnival
....

你不会得到政治的结果，因为内容是在浏览器中用Javascript动态呈现的（正如他在评论中解释的）。使用

请求

时，您只能获得原始HTML

在浏览器中打开站点，并将

查看页面源代码

与

检查元素

进行比较。前者生成原始HTML，后者生成呈现的HTML

使用

请求

获取网站内容时，需要经常检查的几件事。你检查过网站的回复了吗？

看起来像你期望的那样吗？那么，在尝试查找汤中的任何东西之前，您是否检查了汤中的内容？这两项检查可以告诉你，你的

get

是否成功，网站是以html格式完全加载，还是在访问时异步加载（后者很可能是CNN），在这种情况下，你需要一个类似selenium browser automationHi@G.Anderson的工具！谢谢你的回复。我对我们的废弃比较陌生，所以我不确定异步加载意味着什么。你能详细说明一下吗？也许值得快速浏览一下谷歌，但是高级概述：像Ajax（异步Java和XML）这样的框架只有在web浏览器访问页面时才动态加载页面。这既可以定制用户体验，又可以防止类似于（不幸的是）web抓取之类的事情。检查你的

汤

，我敢打赌你只会看到一些HTML元素，因为除非浏览器点击它，否则页面的其余部分实际上不会加载。这是否回答了你的问题？如果您的问题得到解决，请将答案标记为已接受，以便其他人看到您的问题已得到回答。谢谢您，这确实解决了问题。进一步澄清一下，是使用Selenium来抓取动态呈现的内容更好，还是我应该坚持使用漂亮的SoupyYou’re welcome。