Python 3.x BeautifulSoup4:find_all()覆盖以前的数据集,而不是显示所有目标数据
我正在抓取此网页: 代码:Python 3.x BeautifulSoup4:find_all()覆盖以前的数据集,而不是显示所有目标数据,python-3.x,beautifulsoup,Python 3.x,Beautifulsoup,我正在抓取此网页: 代码: import requests as r from bs4 import BeautifulSoup as soup webpages=['https://zh.wikisource.org/wiki/%E8%AE%80%E9%80%9A%E9%91%92%E8%AB%96/%E5%8D%B701'] for item in webpages: headers = {'User-Agent': 'Mozilla/5.0'} data = r.get
import requests as r
from bs4 import BeautifulSoup as soup
webpages=['https://zh.wikisource.org/wiki/%E8%AE%80%E9%80%9A%E9%91%92%E8%AB%96/%E5%8D%B701']
for item in webpages:
headers = {'User-Agent': 'Mozilla/5.0'}
data = r.get(item, headers=headers)
data.encoding = 'utf-8'
page_soup = soup(data.text, 'html5lib')
headline = page_soup.find_all(class_='mw-headline')
for el in headline:
headline_text = el.get_text()
p = page_soup.find_all('p')
for el in p:
p_text = el.get_text()
text = headline_text + p_text
with open(r'sample_srape.txt', 'a', encoding='utf-8') as file:
file.write(text)
file.close()
输出txt文件仅显示最后一组
headline\u text+p\u text
数据集。似乎每当检索到新数据时,它都会覆盖以前的数据集。如何使其停止覆盖以前的数据并显示目标的每一组数据?您需要a
来附加参数
我希望您的缩进在内部两个for循环中是不同的,这样您就不会只使用每次匹配的最后一项。若要发出多个请求,则可以使用会话—重新使用连接可提高效率
此外,在给定标题下的段落连接。某些部分的变量命名更清晰
您不需要关闭
,因为这是由与
一起处理的。也许是这样的:
import requests
from bs4 import BeautifulSoup as soup
webpages=['https://zh.wikisource.org/wiki/%E8%AE%80%E9%80%9A%E9%91%92%E8%AB%96/%E5%8D%B701']
headers = {'User-Agent': 'Mozilla/5.0'}
with requests.Session() as s:
for link in webpages:
data = s.get(link, headers=headers)
data.encoding = 'utf-8'
page_soup = soup(data.text, 'html5lib')
headlines = page_soup.find_all(class_='mw-headline')
with open(r'sample_scrape.txt', 'a', encoding='utf-8') as file:
for headline in headlines:
headline_text = headline.get_text()
paragraphs = page_soup.find_all('p')
text = ''
for paragraph in paragraphs:
paragraph_text = paragraph.get_text()
text+= paragraph_text
text = headline_text + text
file.write(text)
我将open()从write模式更改为append模式;同样的问题仍然存在,我注意到你在做两个for循环,因为缩进,在这两种情况下,你只能使用循环中的最后一个值。我认为我的for循环也有问题。您能更具体地说明如何修复它吗?谢谢我想我需要你的反馈来适当地调整这个