For loop 如果查询没有';在网站上不存在

For loop 如果查询没有';在网站上不存在,for-loop,web,screen-scraping,For Loop,Web,Screen Scraping,我正在抓取新闻网站,看看是否有文章和列表中的单词对应。当列表中的单词在新闻网站上不存在时,我会出错。每次单词不存在时,我需要添加什么代码才能继续 import requests from bs4 import BeautifulSoup import json import time # load list json file with open("words.json") as json_file: data_words = json.load(json_fil

我正在抓取新闻网站,看看是否有文章和列表中的单词对应。当列表中的单词在新闻网站上不存在时,我会出错。每次单词不存在时,我需要添加什么代码才能继续

import requests
from bs4 import BeautifulSoup
import json
import time

# load list json file 
with open("words.json") as json_file:
    data_words = json.load(json_file)

for item in data_words:
    query = item 
    print(query)
    
    # start timing 
    start_time = time.time()
    
    # scrape newswebsite 
    base_url = "https://www.pbs.org/newshour/search-results?q=%22"+ query +"%22&pnb="
    
    # scrape DutchNews
    #base_url = "https://www.dutchnews.nl/page/"+ number +"/?s="+ query
    
    # there are 50 pages to interate
    # get the page 
    url = base_url + "1"
    page = requests.get(url)

    # convert to soup, is main object
    soup = BeautifulSoup(page.content, "html.parser")
    
    # grab number of paginations
    pagination_links = soup.find(class_="pagination__numbers")

    links = pagination_links.findAll("a")
    
    # get last item of list, know the lenght and grab last item 
    last_item = len(links) - 1
    
    total_pages = int(links[last_item].get_text())
    
    # put data out of forloops
    data = []
    
    # cast i to string
    for i in range(1, total_pages + 1):
        url = base_url + str(i)
        print("Retrieving", url)
        
    # make sure page and soup are in for loop
    # get page
        page = requests.get(url)
    
    # convert to soup object
        soup = BeautifulSoup(page.content, "html.parser")
    
    # get all the search result titles, add All
        results_list = soup.findAll(class_="search-result__text")
    
    # timesleep
        time.sleep(2)
        
    # search in list
        for item in results_list:
            title = item.find("h4").get_text()
    # timesleep
            time.sleep(2)
            datetime = item.find("span").get_text()
            snippet = item.find("p").get_text().strip()
            url = item.find("a")["href"]
    # create dictionairy
            article = {
                "query": query,
                "title": title,
                "datetime": datetime,
                "snippet": snippet,
                "url": url
            }
    # has to be in line with article
            data.append(article)
        
# strip white spaces use, strip() after get_text()
print(data)

# print timing
print(time.time() - start_time, 's')

# save this with json
with open("pbs_words_results.json", "w") as outfile:
    json.dump(data, outfile)

属性错误基本上表示
分页链接
。由于您将其声明为
pagination\u links=soup.find(class=“pagination\u numbers”)
这意味着不存在具有class
pagination\u numbers的元素。
在这种情况下,您需要做的就是检查它是否为
None
,如果是,则继续执行循环的下一项:

对于数据字中的项目:
查询=项目
打印(查询)
#开始计时
开始时间=time.time()
#刮新闻网站
基本url=”https://www.pbs.org/newshour/search-results?q=%22“+query+%22&pnb=”
#刮削DutchNews
#基本url=”https://www.dutchnews.nl/page/“+number+”/?s=“+query”
#有50页要交互
#获取页面
url=base_url+“1”
page=请求.get(url)
#转换为汤,是主要对象
soup=BeautifulSoup(page.content,“html.parser”)
#获取分页的数量
分页链接=soup.find(class=“分页编号”)
#检查分页链接是否为无
如果没有分页链接:
持续
....
您还应该使用其他变量执行此检查,这样您就不会再次遇到错误