为什么scrapy突然给我一个“不可预测的”AttributeError,说没有属性“css”
在我的工作中,我构建了一个scrapy spider,可以在大约200-500个网站登录页面上快速检查页面是否正常工作,除了400个样式错误之外。e、 g.检查第页是否有缺货。此检查发生在我权限范围内的大约30个不同网站上,所有这些网站都使用相同的页面结构 这4个月来每天都运作良好 然后,大约4周前,在没有更改代码的情况下,我突然出现了不可预测的错误: url\u title=response.cstitle::text.extract\u first AttributeError:“Response”对象没有属性“css” 如果我运行这个爬行器,这个错误将会发生,比如说。。。400页中有3页。 然后,如果立即再次运行爬行器,相同的3个页面将被正确地刮取,并且4个完全不同的页面将返回相同的错误 此外,如果我运行与下面完全相同的spider,但仅用这7个错误的登录页替换映射,那么它们将被完全删除 我的代码中是否有不太正确的地方 我将附上整个代码-提前道歉!!-我只是担心一些我认为多余的东西实际上可能是原因。这就是全部,但敏感数据被替换为 我已经检查了所有受影响的页面,当然css是有效的,并且标题始终存在 我已经在运行scrapy的服务器上完成了sudo-apt-get-update和sudo-apt-get-dist升级,希望这会有所帮助。不走运为什么scrapy突然给我一个“不可预测的”AttributeError,说没有属性“css”,scrapy,Scrapy,在我的工作中,我构建了一个scrapy spider,可以在大约200-500个网站登录页面上快速检查页面是否正常工作,除了400个样式错误之外。e、 g.检查第页是否有缺货。此检查发生在我权限范围内的大约30个不同网站上,所有这些网站都使用相同的页面结构 这4个月来每天都运作良好 然后,大约4周前,在没有更改代码的情况下,我突然出现了不可预测的错误: url\u title=response.cstitle::text.extract\u first AttributeError:“Respo
import scrapy
from scrapy import signals
from sqlalchemy.orm import sessionmaker
from datetime import date, datetime, timedelta
from scrapy.http.request import Request
from w3lib.url import safe_download_url
from sqlalchemy import and_, or_, not_
import smtplib
from email.MIMEMultipart import MIMEMultipart
from email.MIMEText import MIMEText
from sqlalchemy.engine import create_engine
engine = create_engine('mysql://######:######@localhost/LandingPages', pool_recycle=3600, echo=False)
#conn = engine.connect()
from LandingPageVerifier.models import LandingPagesFacebook, LandingPagesGoogle, LandingPagesSimplifi, LandingPagesScrapeLog, LandingPagesScrapeResults
Session = sessionmaker(bind=engine)
session = Session()
# today = datetime.now().strftime("%Y-%m-%d")
# thisyear = datetime.now().strftime("%Y")
# thismonth = datetime.now().strftime("%m")
# thisday = datetime.now().strftime("%d")
# start = date(year=2019,month=04,day=09)
todays_datetime = datetime(datetime.today().year, datetime.today().month, datetime.today().day)
print todays_datetime
landingpages_today_fb = session.query(LandingPagesFacebook).filter(LandingPagesFacebook.created_on >= todays_datetime).all()
landingpages_today_google = session.query(LandingPagesGoogle).filter(LandingPagesGoogle.created_on >= todays_datetime).all()
landingpages_today_simplifi = session.query(LandingPagesSimplifi).filter(LandingPagesSimplifi.created_on >= todays_datetime).all()
session.close()
#Mix 'em together!
landingpages_today = landingpages_today_fb + landingpages_today_google + landingpages_today_simplifi
#landingpages_today = landingpages_today_fb
#Do some iterating and formatting work
landingpages_today = [(u.ad_url_full, u.client_id) for u in landingpages_today]
#print landingpages_today
landingpages_today = list(set(landingpages_today))
#print 'Unique pages: '
#print landingpages_today
# unique_landingpages = [(u[0]) for u in landingpages_today]
# unique_landingpage_client = [(u[1]) for u in landingpages_today]
# print 'Pages----->', len(unique_landingpages)
class LandingPage004Spider(scrapy.Spider):
name='LandingPage004Spider'
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(LandingPage004Spider, cls).from_crawler(crawler, *args, **kwargs)
#crawler.signals.connect(spider.spider_opened, signals.spider_opened)
crawler.signals.connect(spider.spider_closed, signals.spider_closed)
return spider
def spider_closed(self, spider):
#stats = spider.crawler.stats.get_stats()
stats = spider.crawler.stats.get_value('item_scraped_count'),
Session = sessionmaker(bind=engine)
session = Session()
logitem = LandingPagesScrapeLog(scrape_count = spider.crawler.stats.get_value('item_scraped_count'),
is200 = spider.crawler.stats.get_value('downloader/response_status_count/200'),
is400 = spider.crawler.stats.get_value('downloader/response_status_count/400'),
is403 = spider.crawler.stats.get_value('downloader/response_status_count/403'),
is404 = spider.crawler.stats.get_value('downloader/response_status_count/404'),
is500 = spider.crawler.stats.get_value('downloader/response_status_count/500'),
scrapy_errors = spider.crawler.stats.get_value('log_count/ERROR'),
scrapy_criticals = spider.crawler.stats.get_value('log_count/CRITICAL'),
)
session.add(logitem)
session.commit()
session.close()
#mapping = landingpages_today
handle_httpstatus_list = [200, 302, 404, 400, 500]
start_urls = []
def start_requests(self):
for url, client_id in self.mapping:
yield Request(url, callback=self.parse, meta={'client_id': client_id})
def parse(self, response):
##DEBUG - return all scraped data
#wholepage = response.body.lower()
url = response.url
if 'redirect_urls' in response.request.meta:
redirecturl = response.request.meta['redirect_urls'][0]
if 'utm.pag.ca' in redirecturl:
url_shortener = response.request.meta['redirect_urls'][0]
else:
url_shortener = 'None'
else:
url_shortener = 'None'
client_id = response.meta['client_id']
url_title = response.css("title::text").extract_first()
# pagesize = len(response.xpath('//*[not(descendant-or-self::script)]'))
pagesize = len(response.body)
HTTP_code = response.status
####ERROR CHECK: Small page size
if 'instapage' in response.body.lower():
if pagesize <= 20000:
err_small = 1
else:
err_small = 0
else:
if pagesize <= 35000:
err_small = 1
else:
err_small = 0
####ERROR CHECK: Page contains the phrase 'not found'
if 'not found' in response.xpath('//*[not(descendant-or-self::script)]').extract_first().lower():
#their sites are full of HTML errors, making scrapy unable to notice what is and is not inside a script element
if 'dealerinspire' in response.body.lower():
err_has_not_found = 0
else:
err_has_not_found = 1
else:
err_has_not_found = 0
####ERROR CHECK: Page cotains the phrase 'can't be found'
if "can't be found" in response.xpath('//*[not(self::script)]').extract_first().lower():
err_has_cantbefound = 1
else:
err_has_cantbefound = 0
####ERROR CHECK: Page contains the phrase 'unable to locate'
if 'unable to locate' in response.body.lower():
err_has_unabletolocate = 1
else:
err_has_unabletolocate = 0
####ERROR CHECK: Page contains phrase 'no longer available'
if 'no longer available' in response.body.lower():
err_has_nolongeravailable = 1
else:
err_has_nolongeravailable = 0
####ERROR CHECK: Page contains phrase 'no service specials'
if 'no service specials' in response.body.lower():
err_has_noservicespecials = 1
else:
err_has_noservicespecials = 0
####ERROR CHECK: Page contains phrase 'Sorry, no' to match zero inventory for a search, which normally says "Sorry, no items matching your request were found."
if 'sorry, no ' in response.body.lower():
err_has_sorryno = 1
else:
err_has_sorryno = 0
yield {'client_id': client_id, 'url': url, 'url_shortener': url_shortener, 'url_title': url_title, "pagesize": pagesize, "HTTP_code": HTTP_code, "err_small": err_small, 'err_has_not_found': err_has_not_found, 'err_has_cantbefound': err_has_cantbefound, 'err_has_unabletolocate': err_has_unabletolocate, 'err_has_nolongeravailable': err_has_nolongeravailable, 'err_has_noservicespecials': err_has_noservicespecials, 'err_has_sorryno': err_has_sorryno}
#E-mail settings
def sendmail(recipients,subject,body):
fromaddr = "#######"
toaddr = recipients
msg = MIMEMultipart()
msg['From'] = fromaddr
msg['Subject'] = subject
body = body
msg.attach(MIMEText(body, 'html'))
server = smtplib.SMTP('########)
server.starttls()
server.login(fromaddr, "##########")
text = msg.as_string()
server.sendmail(fromaddr, recipients, text)
server.quit()
`
预期结果是完美的,没有错误。
实际结果是不可预测的AttributeErrors,声称在某些页面上找不到属性“css”。但是,如果我使用相同的脚本单独刮取这些页面,它们刮取得就很好了。有时候,由于标记错误,Scrapy无法解析HTML,这就是为什么不能调用response.css。您可以在代码中捕获这些事件并分析损坏的HTML:
def parse(self, response):
try:
....
your code
.....
except:
with open("Error.htm", "w") as f:
f.write(response.body)
更新您可以尝试检查空响应:
def parse(self, response):
if not response.body:
yield scrapy.Request(url=response.url, callback=self.parse, meta={'client_id': response.meta["client_id"]})
# your original code
你确定你没有被这些服务限制吗?众所周知,它们会对某些机器人进行分级限制,并在这些情况下显示另一个页面,例如验证码或拒绝输入。当您抛出AttributeError时,从服务输出响应和HTTP响应代码,您应该能够进一步调试它。HI MatsLindh。有可能,是的。这个刮板每天只运行一次,并且在一分钟内检查任何给定站点上的1-30页。我确信没有验证码。但肯定会有一些阻碍。有没有办法让scrapy在擦伤之间停下来?我不在乎刮板是否需要很长时间才能运行,因为它每天只运行一次。隐马尔可夫模型。。。。谢谢你。你让我想了想。但是,如果我立即再次运行蜘蛛,那些以前错误的页面会很好地通过,而该站点的其他1-29个页面也会很好地通过/添加代码以包含“download_delay=5”,以在两次刮片之间创建5秒延迟,仍然会导致3个错误:如前所述;当发生错误时,输出实际错误和HTTP错误代码-这应该告诉您查找失败的原因。谢谢。在f.write行的上方,我添加了一行。。。。打印“whiterabbitobject”。有了这一点,刮擦完成没有错误。但有3-7次出现“whiterabbitobject”。然而,在接下来的一行中,绝对没有任何回应的迹象。奇怪,不是吗??六羟甲基三聚氰胺六甲醚。。。。再次感谢您在这方面的帮助。使用上面的代码,您将不会有零碎的错误。你查过Error.htm了吗?对不起,我不清楚这一点。当我使用您添加的代码执行此操作时,它确实会创建一个Error.htm文件。但在执行此操作后,它是一个0字节的文件:这就是为什么您在尝试响应时出错的原因。你可以抓到这个,然后再发一次请求。天啊,我很抱歉,我可以告诉你你想帮助我,但我想我缺少一些基本知识。你能带我走远一点吗?你说你可以抓住这个,然后再次发送请求。要做到这一点,我是否应该复制所有的try:并粘贴到except:,这将给scrape一次再次发送请求的机会?当我尝试它时,它似乎确实起作用了!:但是它有很多重复的代码。或者我可以在这里的代码中添加一些东西,告诉爬行器继续尝试,直到它得到响应,因为如果它再次尝试,它会这样做吗?再次感谢。