Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/xml/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
未调用Python scrapy SitemapSpider回调_Python_Xml_Web Scraping_Scrapy_Web Crawler - Fatal编程技术网

未调用Python scrapy SitemapSpider回调

未调用Python scrapy SitemapSpider回调,python,xml,web-scraping,scrapy,web-crawler,Python,Xml,Web Scraping,Scrapy,Web Crawler,我在这里阅读了SitemapSpider类的文档: 这是我的密码: class CurrentHarvestSpider(scrapy.spiders.SitemapSpider): name = "newegg" allowed_domains = ["newegg.com"] sitemap_urls = ['http://www.newegg.com/Siteindex_USA.xml'] # if I comment this out, then the

我在这里阅读了SitemapSpider类的文档:

这是我的密码:

class CurrentHarvestSpider(scrapy.spiders.SitemapSpider):
    name = "newegg"
    allowed_domains = ["newegg.com"]
    sitemap_urls = ['http://www.newegg.com/Siteindex_USA.xml']
    # if I comment this out, then the parse function should be called by default for every link, but it doesn't
    sitemap_rules = [('/Product', 'parse_product_url'), ('product','parse_product_url')]
    sitemap_follow = ['/newegg_sitemap_product', '/Product']

    def parse(self, response):
        with open("/home/dan/debug/newegg_crawler.log", "a") as log:
        log.write("logging from parse " + response.url)
        self.this_function_does_not_exist()
        yield Request(response.url, callback=self.some_callback)

    def some_callback(self, response):
        with open("/home/dan/debug/newegg_crawler.log", "a") as log:
            log.write("logging from some_callback " + response.url)
        self.this_function_does_not_exist()

    def parse_product_url(self, response):
        with open("/home/dan/debug/newegg_crawler.log ", "a") as log:
            log.write("logging from parse_product_url" + response.url)
        self.this_function_does_not_exist()
这可以在安装了scrapy的情况下成功运行。
运行
pip install scrapy
获取scrapy,并从工作目录执行
scrapy crawl newegg


我的问题是,为什么这些回调都没有被调用?文档声称应该调用
sitemap\u规则中定义的回调。如果我把它注释掉,那么默认情况下应该调用
parse()
,但仍然不会调用它。这些文件是不是100%错了?我正在检查我设置的日志文件,但没有写入任何内容。我甚至将该文件的权限设置为777。另外,我正在调用一个不存在的函数,它会导致一个错误,以证明函数没有被调用,但不会发生错误。我做错了什么?

当我运行你的spider时,我在控制台上看到的是:

$ scrapy runspider op.py 
2016-11-09 21:34:51 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 21:34:51 [scrapy] INFO: Spider opened
2016-11-09 21:34:51 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 21:34:51 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 21:34:51 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 21:34:53 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 21:34:53 [scrapy] ERROR: Spider error processing <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spiders/sitemap.py", line 44, in _parse_sitemap
    s = Sitemap(body)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/sitemap.py", line 17, in __init__
    rt = self._root.tag
AttributeError: 'NoneType' object has no attribute 'tag'
这些日志显示newegg的.xml.gz站点地图确实需要压缩两次:

$ scrapy runspider spider.py 
2016-11-09 13:10:56 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 13:10:56 [scrapy] INFO: Spider opened
2016-11-09 13:10:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 13:10:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding='
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_store01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xda\xef\x1eX\x00\x0bnewegg_sitemap_store01'
2016-11-09 13:10:57 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
2016-11-09 13:10:57 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.newegg.com/Hubs/SubCategory/ID-26> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-11-09 13:10:59 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:59 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xe3\xfa\x1eX\x00\x0bnewegg_sitemap_product'
2016-11-09 13:10:59 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
(...)
2016-11-09 13:11:02 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512> (referer: http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz)
(...)
2016-11-09 13:11:02 [newegg] INFO: parsing 'http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512'
(...)
$scrapy runspider.py
2016-11-09 13:10:56[scrapy]信息:scrapy 1.2.1已启动(机器人程序:scrapybot)
(...)
2016-11-09 13:10:56[剪贴]信息:蜘蛛打开
2016-11-09 13:10:56[抓取]信息:抓取0页(以0页/分钟的速度),抓取0项(以0项/分钟的速度)
2016-11-09 13:10:56[scrapy]调试:Telnet控制台监听127.0.0.1:6023
2016-11-09 13:10:57[scrapy]调试:爬网(200)(参考:无)

2016-11-09 13:10:57[newegg]调试:body[:32]:“\xef\xbb\xbf当我运行你的spider时,我在控制台上看到的是:

$ scrapy runspider op.py 
2016-11-09 21:34:51 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 21:34:51 [scrapy] INFO: Spider opened
2016-11-09 21:34:51 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 21:34:51 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 21:34:51 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 21:34:53 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 21:34:53 [scrapy] ERROR: Spider error processing <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spiders/sitemap.py", line 44, in _parse_sitemap
    s = Sitemap(body)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/sitemap.py", line 17, in __init__
    rt = self._root.tag
AttributeError: 'NoneType' object has no attribute 'tag'
这些日志显示newegg的.xml.gz站点地图确实需要压缩两次:

$ scrapy runspider spider.py 
2016-11-09 13:10:56 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 13:10:56 [scrapy] INFO: Spider opened
2016-11-09 13:10:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 13:10:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding='
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_store01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xda\xef\x1eX\x00\x0bnewegg_sitemap_store01'
2016-11-09 13:10:57 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
2016-11-09 13:10:57 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.newegg.com/Hubs/SubCategory/ID-26> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-11-09 13:10:59 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:59 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xe3\xfa\x1eX\x00\x0bnewegg_sitemap_product'
2016-11-09 13:10:59 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
(...)
2016-11-09 13:11:02 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512> (referer: http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz)
(...)
2016-11-09 13:11:02 [newegg] INFO: parsing 'http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512'
(...)
$scrapy runspider.py
2016-11-09 13:10:56[scrapy]信息:scrapy 1.2.1已启动(机器人程序:scrapybot)
(...)
2016-11-09 13:10:56[剪贴]信息:蜘蛛打开
2016-11-09 13:10:56[抓取]信息:抓取0页(以0页/分钟的速度),抓取0项(以0项/分钟的速度)
2016-11-09 13:10:56[scrapy]调试:Telnet控制台监听127.0.0.1:6023
2016-11-09 13:10:57[scrapy]调试:爬网(200)(参考:无)

2016-11-09 13:10:57[newegg]调试:body[:32]:'\xef\xbb\xbftnewegg网站地图中的
loc
似乎包含压缩文件
gz
。你能用命令行日志更新你的问题吗?仅供参考,我在scrapy上打开了一个问题:newegg网站地图中的
loc
看起来包含压缩文件
gz
。你能用命令行日志更新你的问题吗?仅供参考,我在scrapy上发布了一个问题:有趣。但gunzipping两次与未调用的函数有什么关系?它在什么地方失败了吗?具体在哪里?它不应该仍然调用回调吗?或者做点什么?我不明白,它是不是默默地失败了?这很讽刺,因为我觉得声音太大了;因此,我尝试手动记录调试输出。我无法处理筛选和过滤大量不必要的垃圾打印到stdout。它使我的眼睛流血。我也不会为在记录不好且复杂的记录器模块上乱动而烦恼。我试过了,时间太长了。Scrapy需要一些工作。请参阅我的更新答案。对我来说,当我运行你的蜘蛛时,scrapy非常大声地说有一个错误。我不知道你所说的“我无法处理筛选和过滤大量不必要的垃圾打印到stdout”是什么意思。如果是废日志,“垃圾”实际上表明存在问题。如果是别的事情,你不需要贬低来表达你的观点。“Scrapy需要一些工作”:这是肯定的,它需要用户社区投入他们不同的用例,这样代码才能得到改进。谢谢你的报道。如果你愿意的话,你也可以帮助修复这个bug。它实际上更像是一个自我问题,而不是一个棘手的问题。你有一段摘录。我能读懂。但与其他杂乱无章的输出相比,它是微不足道的。对我来说,这就像大海捞针,我经常重读台词,失去位置,被这些台词淹没。我很幸运能找到你在控制台输出中找到的东西。我的抱怨是,通常当程序遇到运行时错误时,程序会停止执行并返回错误,对吗?嗯,刮痧可不行。它把一根针扔进了干草堆,然后继续前进。我收回了这一切。我不是很彻底。我对这段代码做了一点修改,昨天它没有抛出错误或调用回调(可能是因为您在回答中概述的原因)。我在这里对代码进行了调整,但未能测试是否可以重现该问题,现在调整后的代码抛出了我可以很好地读取的错误。对不起,scrapy,我不是有意贬低你。哦,谢谢你的帮助和对答案的扩展。事实证明,这是非常有帮助的,事情变得更有意义。有趣。但gunzipping两次与未调用的函数有什么关系?它在什么地方失败了吗?具体在哪里?它不应该仍然调用回调吗?或者做点什么?我不明白,它是不是默默地失败了?这很讽刺,因为我觉得声音太大了;因此,我尝试手动记录调试输出。我无法处理筛选和过滤大量不必要的垃圾打印到stdout。它使我的眼睛流血。我也不会为在记录不好且复杂的记录器模块上乱动而烦恼。我试过了,时间太长了。Scrapy需要一些工作。请参阅我的更新答案。对我来说,当我运行你的蜘蛛时,scrapy非常大声地说有一个错误。我不知道你所说的“我无法处理筛选和过滤大量不必要的垃圾打印到stdout”是什么意思。如果是废日志,“垃圾”实际上表明存在问题。如果是别的事情,你不需要贬低来表达你的观点。“Scrapy需要一些工作”:这是肯定的,它需要用户社区投入他们不同的用途