Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/289.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何使用BeautifulSoup循环链接和刮取新闻文章的内容_Python_For Loop_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 如何使用BeautifulSoup循环链接和刮取新闻文章的内容

Python 如何使用BeautifulSoup循环链接和刮取新闻文章的内容,python,for-loop,web-scraping,beautifulsoup,Python,For Loop,Web Scraping,Beautifulsoup,我是Python新手,我想从本页获取所有新闻文章的内容和标题: 然而,我当前代码的输出将10篇文章中的所有段落存储到1个列表中。我想知道如何将每个段落存储到它所属的文章的dict中,并将所有dict保存到一个列表中 任何帮助都将不胜感激 import requests from bs4 import BeautifulSoup import json response=requests.get('https://www.nytimes.com/search?query=china+COVID-

我是Python新手,我想从本页获取所有新闻文章的内容和标题:

然而,我当前代码的输出将10篇文章中的所有段落存储到1个列表中。我想知道如何将每个段落存储到它所属的文章的dict中,并将所有dict保存到一个列表中

任何帮助都将不胜感激

import requests
from bs4 import BeautifulSoup
import json

response=requests.get('https://www.nytimes.com/search?query=china+COVID-19')
response.encoding='utf-8'
soupe=BeautifulSoup(response.text,'html.parser')

links = soupe.find_all('div', class_='css-1i8vfl5')

pagelinks = []
for link in links:
    url = link.contents[0].find_all('a')[0] 
 pagelinks.append('https://www.nytimes.com'+url.get('href')) 


articles=[]  

for i in pagelinks:
    response=requests.get(i)
    response.encoding='utf-8'
    soupe=BeautifulSoup(response.text,'html.parser') 
    for p in soupe.select('section.meteredContent.css-1r7ky0e div.css-53u6y8'):
        articles.append(p.text.strip())
print('\n'.join(articles))

打印(刮削)https://www.nytimes.com/search?query=china+新冠病毒-19“[0]#以显示第一篇文章(dict)

谢谢!我使用了代码,但它显示了一条错误消息:未定义名称“get_soup_page”。我可以问一下我如何解决这个问题吗?而且,我想删掉全部新闻内容,似乎你的代码只删掉了描述。如果你能帮我把全部内容删掉,我将不胜感激!抱歉,我在测试中使用了名为get_soup_page的函数。请删除第二个soup_page,然后您就可以了。我为您获取了标题和说明,您可以使用container.find获取其余内容
import urllib3
from bs4 import BeautifulSoup as bs

def scrape(url):
    http = urllib3.PoolManager()
    response = http.request("GET", url)
    soup_page = bs(response.data, 'lxml') # you have to install lxml package
    # pip install lxml
    articles = []

    containers = soup_page.findAll("div", attrs={'class': "css-1i8vfl5"})

    for container in containers:
        title = container.find('h4', {'class':'css-2fgx4k'}).text.strip()
        description = container.find('p', {'class':'css-16nhkrn'})

        article = {
            'title':title,
            'description':description
        }

        articles.append(article)
    return articles