Python 如何将网页抓取代码转换为循环?
我能够循环web抓取过程,但是从后面的页面收集的数据替换了前面页面的数据。制作excel只包含最后一页的数据。我需要做什么Python 如何将网页抓取代码转换为循环?,python,loops,web-scraping,Python,Loops,Web Scraping,我能够循环web抓取过程,但是从后面的页面收集的数据替换了前面页面的数据。制作excel只包含最后一页的数据。我需要做什么 from bs4 import BeautifulSoup import requests import pandas as pd print ('all imported successfuly') for x in range(1, 44): link = (f'https://www.trustpilot.com/review/birchbox.com?p
from bs4 import BeautifulSoup
import requests
import pandas as pd
print ('all imported successfuly')
for x in range(1, 44):
link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}')
print (link)
req = requests.get(link)
content = req.content
soup = BeautifulSoup(content, "lxml")
names = soup.find_all('div', attrs={'class': 'consumer-information__name'})
headers = soup.find_all('h2', attrs={'class':'review-content__title'})
bodies = soup.find_all('p', attrs={'class':'review-content__text'})
ratings = soup.find_all('div', attrs={'class':'star-rating star-rating--medium'})
dates = soup.find_all('div', attrs={'class':'review-content-header__dates'})
print ('pass1')
df = pd.DataFrame({'User Name': names, 'Header': headers, 'Body': bodies, 'Rating': ratings, 'Date': dates})
df.to_csv('birchbox006.csv', index=False, encoding='utf-8')
print ('excel done')
如果您知道页面的数量,您可以将刮削放入循环中,如下所示:
for i in range(1,44):
req = requests.get("https://www.trustpilot.com/review/birchbox.com?page={}".format(i)
...
如果您只想运行一次,并且知道总页数,那么您所要做的就是相应地更改正在调用的url,然后连接生成的数据帧 一种方法是(假设您有python 3.6或更高版本的f-strings): 如果您使用的是较旧版本的python,则可以将该行替换为
req=requests.get(“https://www.trustpilot.com/review/birchbox.com?page={}.格式(i))
df = None
for i in range(1, 44):
req = requests.get(f"https://www.trustpilot.com/review/birchbox.com?page={i}")
content = req.content
soup = BeautifulSoup(content, "lxml")
names = soup.find_all('div', attrs={'class': 'consumer-information__name'})
headers = soup.find_all('h2', attrs={'class':'review-content__title'})
bodies = soup.find_all('p', attrs={'class':'review-content__text'})
ratings = soup.find_all('div', attrs={'class':'star-rating star-rating--medium'})
dates = soup.find_all('div', attrs={'class':'review-content-header__dates'})
print ('pass1')
if df is None:
df = pd.DataFrame({'User Name':names,'Header':headers,'Body':bodies,'Rating':ratings,'Date':dates})
else:
df = pd.concat([df, pd.DataFrame({'User Name':names,'Header':headers,'Body':bodies,'Rating':ratings,'Date':dates})])