For loop 如果查询没有'；在网站上不存在_For Loop_Web_Screen Scraping

For loop 如果查询没有'；在网站上不存在

for-loop web

For loop 如果查询没有'；在网站上不存在,for-loop,web,screen-scraping,For Loop,Web,Screen Scraping,我正在抓取新闻网站，看看是否有文章和列表中的单词对应。当列表中的单词在新闻网站上不存在时，我会出错。每次单词不存在时，我需要添加什么代码才能继续 import requests from bs4 import BeautifulSoup import json import time # load list json file with open("words.json") as json_file: data_words = json.load(json_fil

我正在抓取新闻网站，看看是否有文章和列表中的单词对应。当列表中的单词在新闻网站上不存在时，我会出错。每次单词不存在时，我需要添加什么代码才能继续

import requests
from bs4 import BeautifulSoup
import json
import time

# load list json file 
with open("words.json") as json_file:
    data_words = json.load(json_file)

for item in data_words:
    query = item 
    print(query)
    
    # start timing 
    start_time = time.time()
    
    # scrape newswebsite 
    base_url = "https://www.pbs.org/newshour/search-results?q=%22"+ query +"%22&pnb="
    
    # scrape DutchNews
    #base_url = "https://www.dutchnews.nl/page/"+ number +"/?s="+ query
    
    # there are 50 pages to interate
    # get the page 
    url = base_url + "1"
    page = requests.get(url)

    # convert to soup, is main object
    soup = BeautifulSoup(page.content, "html.parser")
    
    # grab number of paginations
    pagination_links = soup.find(class_="pagination__numbers")

    links = pagination_links.findAll("a")
    
    # get last item of list, know the lenght and grab last item 
    last_item = len(links) - 1
    
    total_pages = int(links[last_item].get_text())
    
    # put data out of forloops
    data = []
    
    # cast i to string
    for i in range(1, total_pages + 1):
        url = base_url + str(i)
        print("Retrieving", url)
        
    # make sure page and soup are in for loop
    # get page
        page = requests.get(url)
    
    # convert to soup object
        soup = BeautifulSoup(page.content, "html.parser")
    
    # get all the search result titles, add All
        results_list = soup.findAll(class_="search-result__text")
    
    # timesleep
        time.sleep(2)
        
    # search in list
        for item in results_list:
            title = item.find("h4").get_text()
    # timesleep
            time.sleep(2)
            datetime = item.find("span").get_text()
            snippet = item.find("p").get_text().strip()
            url = item.find("a")["href"]
    # create dictionairy
            article = {
                "query": query,
                "title": title,
                "datetime": datetime,
                "snippet": snippet,
                "url": url
            }
    # has to be in line with article
            data.append(article)
        
# strip white spaces use, strip() after get_text()
print(data)

# print timing
print(time.time() - start_time, 's')

# save this with json
with open("pbs_words_results.json", "w") as outfile:
    json.dump(data, outfile)

属性错误基本上表示

分页链接

为

无

。由于您将其声明为

pagination\u links=soup.find（class=“pagination\u numbers”）

这意味着不存在具有class

pagination\u numbers的元素。
在这种情况下，您需要做的就是检查它是否为None
，如果是，则继续执行循环的下一项：
对于数据字中的项目：
查询=项目
打印（查询）
#开始计时
开始时间=time.time（）
#刮新闻网站
基本url=”https://www.pbs.org/newshour/search-results?q=%22“+query+%22&pnb=”
#刮削DutchNews
#基本url=”https://www.dutchnews.nl/page/“+number+”/？s=“+query”
#有50页要交互
#获取页面
url=base_url+“1”
page=请求.get（url）
#转换为汤，是主要对象
soup=BeautifulSoup（page.content，“html.parser”）
#获取分页的数量
分页链接=soup.find（class=“分页编号”）
#检查分页链接是否为无
如果没有分页链接：
持续
....

您还应该使用其他变量执行此检查，这样您就不会再次遇到错误