Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/352.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何使scrapy跳过错误并继续?_Python_Scrapy - Fatal编程技术网

Python 如何使scrapy跳过错误并继续?

Python 如何使scrapy跳过错误并继续?,python,scrapy,Python,Scrapy,大家好,我想让scrapy忽略一个错误:蜘蛛错误处理错误,继续下一次迭代,但我觉得这是不可能的。取而代之的是,由于某种原因,它在这个页面上失败了,然后决定用这个有趣的业务来完成它 这是我的剧本: import scrapy import jsonlines import re class SgbdSpider(scrapy.Spider): name = "sgbd" start_urls = [ "http://www.sante

大家好,我想让scrapy忽略一个错误:蜘蛛错误处理错误,继续下一次迭代,但我觉得这是不可能的。取而代之的是,由于某种原因,它在这个页面上失败了,然后决定用这个有趣的业务来完成它

这是我的剧本:

import scrapy
import jsonlines
import re

class SgbdSpider(scrapy.Spider):
    name = "sgbd"
    start_urls = [
        "http://www.sante.gouv.sn/actualites/"
    ]
    def parse(self,response):
        try:
            base = "http://www.sante.gouv.sn"
            for link in response.css(".card-title a"):
                try:
                    title = link.css("a::text").get()
                    comNum = int(re.search("[N°][0-9]{1,3}",title,re.IGNORECASE).group(0).split("°")[-1])
                    href = link.css("a::attr(href)").extract()
                    pdfLink = base + href[0]
                    next_page = response.css("li.pager-next a::attr(href)").get()
                    if self.isInDatbase(comNum):
                        continue
                    else:
                        yield scrapy.Request(pdfLink,callback=self.extractPDF,meta = {
                            "title" : title,
                            "article" : pdfLink,
                            "page" : next_page,
                        })
                except:
                    continue
                ##  EXECUTE WITH COMMAND scrapy crawl sgbd -o pdf.jsonl -t jsonlines
                ##
                ##
                ##
                if next_page is not None and next_page.split("?page=")[-1] != "35":
                    next_page = response.urljoin(next_page)
                    yield scrapy.Request(next_page,callback=self.parse)
        except print(0):
            pass
        
    def extractPDF(self,response):
        link = response.css(".file a::attr(href)").get()
        title = response.meta.get("title")
        article = response.meta.get("article")
        comNum = re.search("[N°][0-9]{1,3}",title,re.IGNORECASE)
        page = response.meta.get("page")
        date = self.getDate(title)
        day = date.get("day")
        month = date.get("month")
        year = date.get("year")
        yield {
            "link" : link,
            "title" : title,
            "article" : article,
            "page" : page,
            "comNum" : int(comNum.group(0).split("°")[-1]),
            "day" : day,
            "month" : month,
            "year" : year
            
        }
    def isInDatbase(self,com):
        found = False
        with jsonlines.open('pdf.jsonl') as f:

            for line in f.iter():
                if com == line["comNum"]:
                    found = True
                    break
            return found
    def getDay(self,text):
        day = re.search("(lundi|mardi|mercredi|mércredi|jeudi|vendredi|samedi|dimanche)",text,re.IGNORECASE)
        full = re.search("(lundi|mardi|mercredi|mércredi|jeudi|vendredi|samedi|dimanche).\d+",text,re.IGNORECASE)
        num = int(full.group(0).split(day.group(0))[-1].strip())
        return num
    def getDate(self,text):
        month = re.search("(janvier|février|fevrier|mars|avril|mai|juin|juillet|août|aout|septembre|octobre|novembre|decembre|décembre)",text,re.IGNORECASE)
        full = re.search("(janvier|février|fevrier|mars|avril|mai|juin|juillet|août|aout|septembre|octobre|novembre|decembre|décembre).\d+",text,re.IGNORECASE)
        num = int(full.group(0).split(month.group(0))[-1].strip())
        return dict(day=self.getDay(text),month=month.group(0),year=num)
这是我得到的错误,它总是同一页

2021-04-01 16:21:44 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://sante.sec.gouv.sn/Actualites/coronavirus-communiqu%c3%a9-de-presse-n%c2%b0389-jeudi-mars-2021-du-minist%c3%a8re-de-la-sant%c3%a9-et-de> from <GET http://www.sante.gouv.sn/Actualites/coronavirus-communiqu%C3%A9-de-presse-n%C2%B0389-jeudi-mars-2021-du-minist%C3%A8re-de-la-sant%C3%A9-et-de>
2021-04-01 16:21:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sante.sec.gouv.sn/actualites?page=1> (referer: https://sante.sec.gouv.sn/actualites/)
2021-04-01 16:22:20 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2021-04-01 16:22:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sante.sec.gouv.sn/Actualites/coronavirus-communiqu%c3%a9-de-presse-n%c2%b0389-jeudi-mars-2021-du-minist%c3%a8re-de-la-sant%c3%a9-et-de> (referer: None)        
2021-04-01 16:22:41 [scrapy.core.scraper] ERROR: Spider error processing <GET https://sante.sec.gouv.sn/Actualites/coronavirus-communiqu%c3%a9-de-presse-n%c2%b0389-jeudi-mars-2021-du-minist%c3%a8re-de-la-sant%c3%a9-et-de> (referer: None)
Traceback (most recent call last):
  File "c:\users\dems\appdata\local\programs\python\python39\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
    yield next(it)
  File "c:\users\dems\appdata\local\programs\python\python39\lib\site-packages\scrapy\utils\python.py", line 353, in 
__next__
    return next(self.data)
  File "c:\users\dems\appdata\local\programs\python\python39\lib\site-packages\scrapy\utils\python.py", line 353, in 
__next__
    return next(self.data)
  File "c:\users\dems\appdata\local\programs\python\python39\lib\site-packages\scrapy\core\spidermw.py", line 62, in 
_evaluate_iterable
    for r in iterable:
  File "c:\users\dems\appdata\local\programs\python\python39\lib\site-packages\scrapy\spidermiddlewares\offsite.py", 
line 29, in process_spider_output
    for x in result:
  File "c:\users\dems\appdata\local\programs\python\python39\lib\site-packages\scrapy\core\spidermw.py", line 62, in 
_evaluate_iterable
    for r in iterable:
  File "c:\users\dems\appdata\local\programs\python\python39\lib\site-packages\scrapy\spidermiddlewares\referer.py", 
line 340, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\users\dems\appdata\local\programs\python\python39\lib\site-packages\scrapy\core\spidermw.py", line 62, in 
_evaluate_iterable
    for r in iterable:
  File "c:\users\dems\appdata\local\programs\python\python39\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\users\dems\appdata\local\programs\python\python39\lib\site-packages\scrapy\core\spidermw.py", line 62, in 
_evaluate_iterable
    for r in iterable:
  File "c:\users\dems\appdata\local\programs\python\python39\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\users\dems\appdata\local\programs\python\python39\lib\site-packages\scrapy\core\spidermw.py", line 62, in 
_evaluate_iterable
    for r in iterable:
  File "C:\Users\Dems\Desktop\test-scrapy\sgbd\sgbd\spiders\sgbd.py", line 46, in extractPDF
    date = self.getDate(title)
  File "C:\Users\Dems\Desktop\test-scrapy\sgbd\sgbd\spiders\sgbd.py", line 79, in getDate
    return dict(day=self.getDay(text),month=month.group(0),year=num)
  File "C:\Users\Dems\Desktop\test-scrapy\sgbd\sgbd\spiders\sgbd.py", line 73, in getDay
    num = int(full.group(0).split(day.group(0))[-1].strip())
AttributeError: 'NoneType' object has no attribute 'group'
2021-04-01 16:22:41 [scrapy.core.engine] INFO: Closing spider (finished)
2021-04-01 16:22:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1404,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 5,
 'downloader/response_bytes': 22472,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 3,
 'downloader/response_status_count/301': 2,
 'elapsed_time_seconds': 140.663172,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 4, 1, 16, 22, 41, 148946),
 'log_count/DEBUG': 5,
 'log_count/ERROR': 1,
 'log_count/INFO': 12,
 'request_depth_max': 1,
 'response_received_count': 3,
 'scheduler/dequeued': 5,
 'scheduler/dequeued/memory': 5,
 'scheduler/enqueued': 5,
 'scheduler/enqueued/memory': 5,
 'spider_exceptions/AttributeError': 1,
 'start_time': datetime.datetime(2021, 4, 1, 16, 20, 20, 485774)}
2021-04-01 16:22:41 [scrapy.core.engine] INFO: Spider closed (finished)
    
2021-04-01 16:21:44[scrapy.downloadermiddleware.redirect]调试:将(301)重定向到
2021-04-01 16:21:46[刮屑核心引擎]调试:爬网(200)(参考:https://sante.sec.gouv.sn/actualites/)
2021-04-01 16:22:20[scrapy.extensions.logstats]信息:抓取2页(每分钟1页),抓取0项(每分钟0项)
2021-04-01 16:22:40[scrapy.core.engine]调试:爬网(200)(参考:无)
2021-04-01 16:22:41[scrapy.core.scraper]错误:十字轴错误处理(参考:无)
回溯(最近一次呼叫最后一次):
文件“c:\users\dems\appdata\local\programs\python\39\lib\site packages\scrapy\utils\defer.py”,第120行,在iter\u errback中
下一个(it)
文件“c:\users\dems\appdata\local\programs\python\39\lib\site packages\scrapy\utils\python.py”,第353行,在
__下一个__
返回下一个(self.data)
文件“c:\users\dems\appdata\local\programs\python\39\lib\site packages\scrapy\utils\python.py”,第353行,在
__下一个__
返回下一个(self.data)
文件“c:\users\dems\appdata\local\programs\python\39\lib\site packages\scrapy\core\spidermw.py”,第62行,在
_可计算
对于iterable中的r:
文件“c:\users\dems\appdata\local\programs\python\39\lib\site packages\scrapy\spidermiddleware\offsite.py”,
第29行,过程中输出
对于结果中的x:
文件“c:\users\dems\appdata\local\programs\python\39\lib\site packages\scrapy\core\spidermw.py”,第62行,在
_可计算
对于iterable中的r:
文件“c:\users\dems\appdata\local\programs\python\39\lib\site packages\scrapy\spidermiddleware\referer.py”,
第340行,输入
返回(_set_referer(r)表示结果中的r或())
文件“c:\users\dems\appdata\local\programs\python\39\lib\site packages\scrapy\core\spidermw.py”,第62行,在
_可计算
对于iterable中的r:
文件“c:\users\dems\appdata\local\programs\python\39\lib\site packages\scrapy\spidermiddleware\urlength.py”,第37行,在
返回(结果中的r表示r或()如果_过滤器(r))
文件“c:\users\dems\appdata\local\programs\python\39\lib\site packages\scrapy\core\spidermw.py”,第62行,在
_可计算
对于iterable中的r:
文件“c:\users\dems\appdata\local\programs\python\39\lib\site packages\scrapy\spidermiddleware\depth.py”,第58行,在
返回(结果中的r表示r或()如果_过滤器(r))
文件“c:\users\dems\appdata\local\programs\python\39\lib\site packages\scrapy\core\spidermw.py”,第62行,在
_可计算
对于iterable中的r:
文件“C:\Users\Dems\Desktop\test scrapy\sgbd\sgbd\spider\sgbd.py”,第46行,PDF格式
日期=self.getDate(标题)
文件“C:\Users\Dems\Desktop\test scrapy\sgbd\sgbd\spider\sgbd.py”,第79行,在getDate中
return dict(day=self.getDay(text),month=month.group(0),year=num)
文件“C:\Users\Dems\Desktop\test scrapy\sgbd\sgbd\spider\sgbd.py”,第73行,在getDay中
num=int(full.group(0).split(day.group(0))[-1].strip())
AttributeError:“非类型”对象没有属性“组”
2021-04-01 16:22:41[刮屑芯发动机]信息:关闭卡盘(完成)
2021-04-01 16:22:41[scrapy.StatCollectors]信息:转储scrapy统计信息:
{'downloader/request_bytes':1404,
“下载程序/请求计数”:5,
“下载程序/请求方法/计数/获取”:5,
“downloader/response_字节”:22472,
“下载程序/响应计数”:5,
“下载/响应状态\计数/200”:3,
“下载程序/响应状态\计数/301”:2,
“已用时间”秒数:140.663172,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2021,4,1,16,22,41,148946),
“日志计数/调试”:5,
“日志计数/错误”:1,
“日志计数/信息”:12,
“请求深度最大值”:1,
“收到的响应数”:3,
“调度程序/出列”:5,
“调度程序/出列/内存”:5,
“调度程序/排队”:5,
“调度程序/排队/内存”:5,
“spider_异常/属性错误”:1,
“开始时间”:datetime.datetime(2021,4,1,16,20,20485774)}
2021-04-01 16:22:41[刮屑堆芯发动机]信息:十字轴关闭(完成)

谢谢大家!

与suppress一起使用
抱歉,我是scrapy的新手,我该如何以及在何处执行此操作?感谢使用suppress(AttributeError)在
下显示相关部分:
感谢您的帮助,我还找到了错误的来源,谢谢!