Python 迭代页面时只返回第一个页面的结果
我正在从本页中删除新闻文章的链接: 我编写代码是为了从第1页和第2页获取链接,但它只返回第1页的文章。我不知道如何解决这个问题,让它成功地返回多个页面的结果Python 迭代页面时只返回第一个页面的结果,python,web-scraping,beautifulsoup,iterator,Python,Web Scraping,Beautifulsoup,Iterator,我正在从本页中删除新闻文章的链接: 我编写代码是为了从第1页和第2页获取链接,但它只返回第1页的文章。我不知道如何解决这个问题,让它成功地返回多个页面的结果 def scrape(url): user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'} request = 0 params = { 'q'
def scrape(url):
user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
request = 0
params = {
'q': 'China%20COVID-19',
}
pagelinks = []
myarticle = []
for page_no in range(1,3):
params['page'] = page_no
response = requests.get(url=url,
headers=user_agent,
params=params)
# controlling the crawl-rate
start_time = time()
#pause the loop
sleep(randint(8,15))
#monitor the requests
request += 1
elapsed_time = time() - start_time
print('Request:{}; Frequency: {} request/s'.format(request, request/elapsed_time))
clear_output(wait = True)
#parse the content
soup_page = bs(response.text, 'lxml')
#select all the articles for a single page
containers = soup_page.findAll("article", {'class': 'partial tile media image-top margin-16-right search-result'})
scrape the links of the articles
for i in containers:
url = i.find('a')['href']
pagelinks.append(url)
print(pagelinks)
任何建议都将不胜感激 将添加到
pagelinks
的代码部分更改为此(不要覆盖稍后在请求中使用的url
变量):
之后,脚本将打印:
Request:1; Frequency: 838860.8 request/s
Request:2; Frequency: 1398101.3333333333 request/s
['https://time.com/5841895/global-coronavirus-battle/', 'https://time.com/5842256/world-health-organization-china-coronavirus-outbreak/', 'https://time.com/5826025/taiwan-who-trump-coronavirus-covid19/', 'https://time.com/5836611/china-superpower-reopening-coronavirus/', 'https://time.com/5783401/covid19-hubei-cases-classification/', 'https://time.com/5782633/covid-19-drug-remdesivir-china/', 'https://time.com/5778994/coronavirus-china-country-future/', 'https://time.com/5830420/trump-china-rivalry-coronavirus-intelligence/', 'https://time.com/5810493/coronavirus-china-united-states-governments/', 'https://time.com/5813628/china-coronavirus-statistics-wuhan/', 'https://time.com/5793363/china-coronavirus-covid19-abandoned-pets-wuhan/', 'https://time.com/5779678/li-wenliang-coronavirus-china-doctor-death/', 'https://time.com/5820389/africans-guangzhou-china-coronavirus-discrimination/', 'https://time.com/5824599/china-coronavirus-covid19-economy/', 'https://time.com/5784286/covid-19-china-plasma-treatment/', 'https://time.com/5796425/china-coronavirus-lockdown/', 'https://time.com/5825362/china-coronavirus-lawsuit-missouri/', 'https://time.com/5811222/wuhan-coronavirus-death-toll/']
您可以执行
url=i.find('a')
,然后在下一行url.get('href')
。你可能是指url=i.find('a')['href']
我按照你的建议更改了它。不再有错误消息。但它只返回第一页的结果。你知道为什么吗?非常感谢。我贴出了答案。
#scrape the links of the articles
for i in containers:
pagelinks.append(i.find('a')['href'])
Request:1; Frequency: 838860.8 request/s
Request:2; Frequency: 1398101.3333333333 request/s
['https://time.com/5841895/global-coronavirus-battle/', 'https://time.com/5842256/world-health-organization-china-coronavirus-outbreak/', 'https://time.com/5826025/taiwan-who-trump-coronavirus-covid19/', 'https://time.com/5836611/china-superpower-reopening-coronavirus/', 'https://time.com/5783401/covid19-hubei-cases-classification/', 'https://time.com/5782633/covid-19-drug-remdesivir-china/', 'https://time.com/5778994/coronavirus-china-country-future/', 'https://time.com/5830420/trump-china-rivalry-coronavirus-intelligence/', 'https://time.com/5810493/coronavirus-china-united-states-governments/', 'https://time.com/5813628/china-coronavirus-statistics-wuhan/', 'https://time.com/5793363/china-coronavirus-covid19-abandoned-pets-wuhan/', 'https://time.com/5779678/li-wenliang-coronavirus-china-doctor-death/', 'https://time.com/5820389/africans-guangzhou-china-coronavirus-discrimination/', 'https://time.com/5824599/china-coronavirus-covid19-economy/', 'https://time.com/5784286/covid-19-china-plasma-treatment/', 'https://time.com/5796425/china-coronavirus-lockdown/', 'https://time.com/5825362/china-coronavirus-lawsuit-missouri/', 'https://time.com/5811222/wuhan-coronavirus-death-toll/']