为什么scrapy突然给我一个“不可预测的”AttributeError,说没有属性“css”

为什么scrapy突然给我一个“不可预测的”AttributeError,说没有属性“css”,scrapy,Scrapy,在我的工作中,我构建了一个scrapy spider,可以在大约200-500个网站登录页面上快速检查页面是否正常工作,除了400个样式错误之外。e、 g.检查第页是否有缺货。此检查发生在我权限范围内的大约30个不同网站上,所有这些网站都使用相同的页面结构 这4个月来每天都运作良好 然后,大约4周前,在没有更改代码的情况下,我突然出现了不可预测的错误: url\u title=response.cstitle::text.extract\u first AttributeError:“Respo

在我的工作中,我构建了一个scrapy spider,可以在大约200-500个网站登录页面上快速检查页面是否正常工作,除了400个样式错误之外。e、 g.检查第页是否有缺货。此检查发生在我权限范围内的大约30个不同网站上,所有这些网站都使用相同的页面结构

这4个月来每天都运作良好

然后,大约4周前,在没有更改代码的情况下,我突然出现了不可预测的错误:

url\u title=response.cstitle::text.extract\u first AttributeError:“Response”对象没有属性“css”

如果我运行这个爬行器,这个错误将会发生,比如说。。。400页中有3页。 然后,如果立即再次运行爬行器,相同的3个页面将被正确地刮取,并且4个完全不同的页面将返回相同的错误

此外,如果我运行与下面完全相同的spider,但仅用这7个错误的登录页替换映射,那么它们将被完全删除

我的代码中是否有不太正确的地方

我将附上整个代码-提前道歉!!-我只是担心一些我认为多余的东西实际上可能是原因。这就是全部,但敏感数据被替换为

我已经检查了所有受影响的页面,当然css是有效的,并且标题始终存在

我已经在运行scrapy的服务器上完成了sudo-apt-get-update和sudo-apt-get-dist升级,希望这会有所帮助。不走运

import scrapy
from scrapy import signals
from sqlalchemy.orm import sessionmaker
from datetime import date, datetime, timedelta
from scrapy.http.request import Request
from w3lib.url import safe_download_url
from sqlalchemy import and_, or_, not_


import smtplib
from email.MIMEMultipart import MIMEMultipart
from email.MIMEText import MIMEText
from sqlalchemy.engine import create_engine
engine = create_engine('mysql://######:######@localhost/LandingPages', pool_recycle=3600, echo=False)
#conn = engine.connect()

from LandingPageVerifier.models import LandingPagesFacebook, LandingPagesGoogle, LandingPagesSimplifi, LandingPagesScrapeLog, LandingPagesScrapeResults

Session = sessionmaker(bind=engine)
session = Session()

# today = datetime.now().strftime("%Y-%m-%d")

# thisyear = datetime.now().strftime("%Y")
# thismonth = datetime.now().strftime("%m")
# thisday = datetime.now().strftime("%d")
# start = date(year=2019,month=04,day=09)

todays_datetime = datetime(datetime.today().year, datetime.today().month, datetime.today().day)
print todays_datetime

landingpages_today_fb = session.query(LandingPagesFacebook).filter(LandingPagesFacebook.created_on >= todays_datetime).all()
landingpages_today_google = session.query(LandingPagesGoogle).filter(LandingPagesGoogle.created_on >= todays_datetime).all()
landingpages_today_simplifi = session.query(LandingPagesSimplifi).filter(LandingPagesSimplifi.created_on >= todays_datetime).all()

session.close()
#Mix 'em together!
landingpages_today = landingpages_today_fb + landingpages_today_google + landingpages_today_simplifi
#landingpages_today = landingpages_today_fb

#Do some iterating and formatting work
landingpages_today = [(u.ad_url_full, u.client_id) for u in landingpages_today]
#print landingpages_today

landingpages_today = list(set(landingpages_today))

#print 'Unique pages: '
#print landingpages_today
# unique_landingpages = [(u[0]) for u in landingpages_today]
# unique_landingpage_client = [(u[1]) for u in landingpages_today]
# print 'Pages----->', len(unique_landingpages)

class LandingPage004Spider(scrapy.Spider):
    name='LandingPage004Spider'

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(LandingPage004Spider, cls).from_crawler(crawler, *args, **kwargs)
        #crawler.signals.connect(spider.spider_opened, signals.spider_opened)
        crawler.signals.connect(spider.spider_closed, signals.spider_closed)
        return spider

    def spider_closed(self, spider):
        #stats = spider.crawler.stats.get_stats() 
        stats = spider.crawler.stats.get_value('item_scraped_count'),
        Session = sessionmaker(bind=engine)
        session = Session()
        logitem = LandingPagesScrapeLog(scrape_count = spider.crawler.stats.get_value('item_scraped_count'),
                                        is200 = spider.crawler.stats.get_value('downloader/response_status_count/200'),
                                        is400 = spider.crawler.stats.get_value('downloader/response_status_count/400'),
                                        is403 = spider.crawler.stats.get_value('downloader/response_status_count/403'),
                                        is404 = spider.crawler.stats.get_value('downloader/response_status_count/404'),
                                        is500 = spider.crawler.stats.get_value('downloader/response_status_count/500'),
                                        scrapy_errors = spider.crawler.stats.get_value('log_count/ERROR'),
                                        scrapy_criticals = spider.crawler.stats.get_value('log_count/CRITICAL'),
                                        )
        session.add(logitem)
        session.commit()
        session.close()



    #mapping = landingpages_today
    handle_httpstatus_list = [200, 302, 404, 400, 500]

    start_urls = []

    def start_requests(self):
        for url, client_id in self.mapping:
            yield Request(url, callback=self.parse, meta={'client_id': client_id})


    def parse(self, response):

        ##DEBUG - return all scraped data
        #wholepage = response.body.lower()

        url = response.url
        if 'redirect_urls' in response.request.meta:
            redirecturl = response.request.meta['redirect_urls'][0]
            if 'utm.pag.ca' in redirecturl:
                url_shortener = response.request.meta['redirect_urls'][0]
            else:
                url_shortener = 'None'
        else:
            url_shortener = 'None'

        client_id = response.meta['client_id']
        url_title = response.css("title::text").extract_first()
        # pagesize = len(response.xpath('//*[not(descendant-or-self::script)]'))
        pagesize = len(response.body)
        HTTP_code = response.status

        ####ERROR CHECK: Small page size
        if 'instapage' in response.body.lower():
            if pagesize <= 20000:
                err_small = 1
            else:
                err_small = 0
        else:
            if pagesize <= 35000:
                err_small = 1
            else:
                err_small = 0

        ####ERROR CHECK: Page contains the phrase 'not found'
        if 'not found' in response.xpath('//*[not(descendant-or-self::script)]').extract_first().lower():
            #their sites are full of HTML errors, making scrapy unable to notice what is and is not inside a script element
            if 'dealerinspire' in response.body.lower():
                err_has_not_found = 0
            else:
                err_has_not_found = 1
        else:
            err_has_not_found = 0

        ####ERROR CHECK: Page cotains the phrase 'can't be found'
        if "can't be found" in response.xpath('//*[not(self::script)]').extract_first().lower():
            err_has_cantbefound = 1
        else:
            err_has_cantbefound = 0

        ####ERROR CHECK: Page contains the phrase 'unable to locate'
        if 'unable to locate' in response.body.lower():
            err_has_unabletolocate = 1
        else:
            err_has_unabletolocate = 0

        ####ERROR CHECK: Page contains phrase 'no longer available'
        if 'no longer available' in response.body.lower():
            err_has_nolongeravailable = 1
        else:
            err_has_nolongeravailable = 0

        ####ERROR CHECK: Page contains phrase 'no service specials'
        if 'no service specials' in response.body.lower():
            err_has_noservicespecials = 1
        else:
            err_has_noservicespecials = 0

        ####ERROR CHECK: Page contains phrase 'Sorry, no' to match zero inventory for a search, which normally says "Sorry, no items matching your request were found."
        if 'sorry, no ' in response.body.lower():
            err_has_sorryno = 1
        else:
            err_has_sorryno = 0

        yield {'client_id': client_id, 'url': url, 'url_shortener': url_shortener, 'url_title': url_title, "pagesize": pagesize, "HTTP_code": HTTP_code, "err_small": err_small, 'err_has_not_found': err_has_not_found, 'err_has_cantbefound': err_has_cantbefound, 'err_has_unabletolocate': err_has_unabletolocate, 'err_has_nolongeravailable': err_has_nolongeravailable, 'err_has_noservicespecials': err_has_noservicespecials, 'err_has_sorryno': err_has_sorryno}



#E-mail settings

def sendmail(recipients,subject,body):

            fromaddr = "#######"
            toaddr = recipients
            msg = MIMEMultipart()
            msg['From'] = fromaddr
            msg['Subject'] = subject 

            body = body
            msg.attach(MIMEText(body, 'html'))

            server = smtplib.SMTP('########)
            server.starttls()
            server.login(fromaddr, "##########")
            text = msg.as_string()
            server.sendmail(fromaddr, recipients, text)
            server.quit()
` 
预期结果是完美的,没有错误。
实际结果是不可预测的AttributeErrors,声称在某些页面上找不到属性“css”。但是,如果我使用相同的脚本单独刮取这些页面,它们刮取得就很好了。

有时候,由于标记错误,Scrapy无法解析HTML,这就是为什么不能调用response.css。您可以在代码中捕获这些事件并分析损坏的HTML:

def parse(self, response):

    try:
     ....
     your code
     .....
    except:
        with open("Error.htm", "w") as f:
            f.write(response.body)
更新您可以尝试检查空响应:

def parse(self, response):
    if not response.body:
        yield scrapy.Request(url=response.url, callback=self.parse, meta={'client_id': response.meta["client_id"]})

    # your original code

你确定你没有被这些服务限制吗?众所周知,它们会对某些机器人进行分级限制,并在这些情况下显示另一个页面,例如验证码或拒绝输入。当您抛出AttributeError时,从服务输出响应和HTTP响应代码,您应该能够进一步调试它。HI MatsLindh。有可能,是的。这个刮板每天只运行一次,并且在一分钟内检查任何给定站点上的1-30页。我确信没有验证码。但肯定会有一些阻碍。有没有办法让scrapy在擦伤之间停下来?我不在乎刮板是否需要很长时间才能运行,因为它每天只运行一次。隐马尔可夫模型。。。。谢谢你。你让我想了想。但是,如果我立即再次运行蜘蛛,那些以前错误的页面会很好地通过,而该站点的其他1-29个页面也会很好地通过/添加代码以包含“download_delay=5”,以在两次刮片之间创建5秒延迟,仍然会导致3个错误:如前所述;当发生错误时,输出实际错误和HTTP错误代码-这应该告诉您查找失败的原因。谢谢。在f.write行的上方,我添加了一行。。。。打印“whiterabbitobject”。有了这一点,刮擦完成没有错误。但有3-7次出现“whiterabbitobject”。然而,在接下来的一行中,绝对没有任何回应的迹象。奇怪,不是吗??六羟甲基三聚氰胺六甲醚。。。。再次感谢您在这方面的帮助。使用上面的代码,您将不会有零碎的错误。你查过Error.htm了吗?对不起,我不清楚这一点。当我使用您添加的代码执行此操作时,它确实会创建一个Error.htm文件。但在执行此操作后,它是一个0字节的文件:这就是为什么您在尝试响应时出错的原因。你可以抓到这个,然后再发一次请求。天啊,我很抱歉,我可以告诉你你想帮助我,但我想我缺少一些基本知识。你能带我走远一点吗?你说你可以抓住这个,然后再次发送请求。要做到这一点,我是否应该复制所有的try:并粘贴到except:,这将给scrape一次再次发送请求的机会?当我尝试它时,它似乎确实起作用了!:但是它有很多重复的代码。或者我可以在这里的代码中添加一些东西,告诉爬行器继续尝试,直到它得到响应,因为如果它再次尝试,它会这样做吗?再次感谢。