Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x BeautifulSoup4:find_all()覆盖以前的数据集,而不是显示所有目标数据_Python 3.x_Beautifulsoup - Fatal编程技术网

Python 3.x BeautifulSoup4:find_all()覆盖以前的数据集,而不是显示所有目标数据

Python 3.x BeautifulSoup4:find_all()覆盖以前的数据集,而不是显示所有目标数据,python-3.x,beautifulsoup,Python 3.x,Beautifulsoup,我正在抓取此网页: 代码: import requests as r from bs4 import BeautifulSoup as soup webpages=['https://zh.wikisource.org/wiki/%E8%AE%80%E9%80%9A%E9%91%92%E8%AB%96/%E5%8D%B701'] for item in webpages: headers = {'User-Agent': 'Mozilla/5.0'} data = r.get

我正在抓取此网页:

代码:

import requests as r
from bs4 import BeautifulSoup as soup

webpages=['https://zh.wikisource.org/wiki/%E8%AE%80%E9%80%9A%E9%91%92%E8%AB%96/%E5%8D%B701']

for item in webpages:
    headers = {'User-Agent': 'Mozilla/5.0'}
    data = r.get(item, headers=headers)
    data.encoding = 'utf-8'
    page_soup = soup(data.text, 'html5lib')
    headline = page_soup.find_all(class_='mw-headline')
    for el in headline:
        headline_text = el.get_text()
    p = page_soup.find_all('p')
    for el in p:
        p_text = el.get_text()
    text = headline_text + p_text
    with open(r'sample_srape.txt', 'a', encoding='utf-8') as file:
        file.write(text)
        file.close()

输出txt文件仅显示最后一组
headline\u text+p\u text
数据集。似乎每当检索到新数据时,它都会覆盖以前的数据集。如何使其停止覆盖以前的数据并显示目标的每一组数据?

您需要
a
来附加参数

我希望您的缩进在内部两个for循环中是不同的,这样您就不会只使用每次匹配的最后一项。若要发出多个请求,则可以使用会话—重新使用连接可提高效率

此外,在给定标题下的段落连接。某些部分的变量命名更清晰

您不需要
关闭
,因为这是由
一起处理的。也许是这样的:

import requests
from bs4 import BeautifulSoup as soup

webpages=['https://zh.wikisource.org/wiki/%E8%AE%80%E9%80%9A%E9%91%92%E8%AB%96/%E5%8D%B701']
headers = {'User-Agent': 'Mozilla/5.0'}

with requests.Session() as s:

    for link in webpages:
        data = s.get(link, headers=headers)
        data.encoding = 'utf-8'
        page_soup = soup(data.text, 'html5lib')
        headlines = page_soup.find_all(class_='mw-headline')

        with open(r'sample_scrape.txt', 'a', encoding='utf-8') as file:

            for headline in headlines:
                headline_text = headline.get_text()
                paragraphs = page_soup.find_all('p')
                text = ''

                for paragraph in paragraphs:
                    paragraph_text = paragraph.get_text()
                    text+= paragraph_text

                text = headline_text + text
                file.write(text)

我将open()从write模式更改为append模式;同样的问题仍然存在,我注意到你在做两个for循环,因为缩进,在这两种情况下,你只能使用循环中的最后一个值。我认为我的for循环也有问题。您能更具体地说明如何修复它吗?谢谢我想我需要你的反馈来适当地调整这个