For loop 如果查询没有';在网站上不存在
我正在抓取新闻网站,看看是否有文章和列表中的单词对应。当列表中的单词在新闻网站上不存在时,我会出错。每次单词不存在时,我需要添加什么代码才能继续For loop 如果查询没有';在网站上不存在,for-loop,web,screen-scraping,For Loop,Web,Screen Scraping,我正在抓取新闻网站,看看是否有文章和列表中的单词对应。当列表中的单词在新闻网站上不存在时,我会出错。每次单词不存在时,我需要添加什么代码才能继续 import requests from bs4 import BeautifulSoup import json import time # load list json file with open("words.json") as json_file: data_words = json.load(json_fil
import requests
from bs4 import BeautifulSoup
import json
import time
# load list json file
with open("words.json") as json_file:
data_words = json.load(json_file)
for item in data_words:
query = item
print(query)
# start timing
start_time = time.time()
# scrape newswebsite
base_url = "https://www.pbs.org/newshour/search-results?q=%22"+ query +"%22&pnb="
# scrape DutchNews
#base_url = "https://www.dutchnews.nl/page/"+ number +"/?s="+ query
# there are 50 pages to interate
# get the page
url = base_url + "1"
page = requests.get(url)
# convert to soup, is main object
soup = BeautifulSoup(page.content, "html.parser")
# grab number of paginations
pagination_links = soup.find(class_="pagination__numbers")
links = pagination_links.findAll("a")
# get last item of list, know the lenght and grab last item
last_item = len(links) - 1
total_pages = int(links[last_item].get_text())
# put data out of forloops
data = []
# cast i to string
for i in range(1, total_pages + 1):
url = base_url + str(i)
print("Retrieving", url)
# make sure page and soup are in for loop
# get page
page = requests.get(url)
# convert to soup object
soup = BeautifulSoup(page.content, "html.parser")
# get all the search result titles, add All
results_list = soup.findAll(class_="search-result__text")
# timesleep
time.sleep(2)
# search in list
for item in results_list:
title = item.find("h4").get_text()
# timesleep
time.sleep(2)
datetime = item.find("span").get_text()
snippet = item.find("p").get_text().strip()
url = item.find("a")["href"]
# create dictionairy
article = {
"query": query,
"title": title,
"datetime": datetime,
"snippet": snippet,
"url": url
}
# has to be in line with article
data.append(article)
# strip white spaces use, strip() after get_text()
print(data)
# print timing
print(time.time() - start_time, 's')
# save this with json
with open("pbs_words_results.json", "w") as outfile:
json.dump(data, outfile)
属性错误基本上表示
分页链接
为无
。由于您将其声明为pagination\u links=soup.find(class=“pagination\u numbers”)
这意味着不存在具有classpagination\u numbers的元素。
在这种情况下,您需要做的就是检查它是否为None
,如果是,则继续执行循环的下一项:
对于数据字中的项目:
查询=项目
打印(查询)
#开始计时
开始时间=time.time()
#刮新闻网站
基本url=”https://www.pbs.org/newshour/search-results?q=%22“+query+%22&pnb=”
#刮削DutchNews
#基本url=”https://www.dutchnews.nl/page/“+number+”/?s=“+query”
#有50页要交互
#获取页面
url=base_url+“1”
page=请求.get(url)
#转换为汤,是主要对象
soup=BeautifulSoup(page.content,“html.parser”)
#获取分页的数量
分页链接=soup.find(class=“分页编号”)
#检查分页链接是否为无
如果没有分页链接:
持续
....
您还应该使用其他变量执行此检查,这样您就不会再次遇到错误