未调用Python scrapy SitemapSpider回调_Python_Xml_Web Scraping_Scrapy_Web Crawler

未调用Python scrapy SitemapSpider回调

python xml web-scraping scrapy web-crawler

未调用Python scrapy SitemapSpider回调,python,xml,web-scraping,scrapy,web-crawler,Python,Xml,Web Scraping,Scrapy,Web Crawler,我在这里阅读了SitemapSpider类的文档：这是我的密码： class CurrentHarvestSpider(scrapy.spiders.SitemapSpider): name = "newegg" allowed_domains = ["newegg.com"] sitemap_urls = ['http://www.newegg.com/Siteindex_USA.xml'] # if I comment this out, then the

我在这里阅读了SitemapSpider类的文档：

这是我的密码：

class CurrentHarvestSpider(scrapy.spiders.SitemapSpider):
    name = "newegg"
    allowed_domains = ["newegg.com"]
    sitemap_urls = ['http://www.newegg.com/Siteindex_USA.xml']
    # if I comment this out, then the parse function should be called by default for every link, but it doesn't
    sitemap_rules = [('/Product', 'parse_product_url'), ('product','parse_product_url')]
    sitemap_follow = ['/newegg_sitemap_product', '/Product']

    def parse(self, response):
        with open("/home/dan/debug/newegg_crawler.log", "a") as log:
        log.write("logging from parse " + response.url)
        self.this_function_does_not_exist()
        yield Request(response.url, callback=self.some_callback)

    def some_callback(self, response):
        with open("/home/dan/debug/newegg_crawler.log", "a") as log:
            log.write("logging from some_callback " + response.url)
        self.this_function_does_not_exist()

    def parse_product_url(self, response):
        with open("/home/dan/debug/newegg_crawler.log ", "a") as log:
            log.write("logging from parse_product_url" + response.url)
        self.this_function_does_not_exist()

这可以在安装了scrapy的情况下成功运行。
运行

pip install scrapy

获取scrapy，并从工作目录执行

scrapy crawl newegg

我的问题是，为什么这些回调都没有被调用？文档声称应该调用

sitemap\u规则中定义的回调。如果我把它注释掉，那么默认情况下应该调用parse（）
，但仍然不会调用它。这些文件是不是100%错了？我正在检查我设置的日志文件，但没有写入任何内容。我甚至将该文件的权限设置为777。另外，我正在调用一个不存在的函数，它会导致一个错误，以证明函数没有被调用，但不会发生错误。我做错了什么？
当我运行你的spider时，我在控制台上看到的是：
$ scrapy runspider op.py 
2016-11-09 21:34:51 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 21:34:51 [scrapy] INFO: Spider opened
2016-11-09 21:34:51 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 21:34:51 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 21:34:51 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 21:34:53 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 21:34:53 [scrapy] ERROR: Spider error processing <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spiders/sitemap.py", line 44, in _parse_sitemap
    s = Sitemap(body)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/sitemap.py", line 17, in __init__
    rt = self._root.tag
AttributeError: 'NoneType' object has no attribute 'tag'

这些日志显示newegg的.xml.gz站点地图确实需要压缩两次：
$ scrapy runspider spider.py 
2016-11-09 13:10:56 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 13:10:56 [scrapy] INFO: Spider opened
2016-11-09 13:10:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 13:10:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding='
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_store01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xda\xef\x1eX\x00\x0bnewegg_sitemap_store01'
2016-11-09 13:10:57 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
2016-11-09 13:10:57 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.newegg.com/Hubs/SubCategory/ID-26> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-11-09 13:10:59 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:59 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xe3\xfa\x1eX\x00\x0bnewegg_sitemap_product'
2016-11-09 13:10:59 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
(...)
2016-11-09 13:11:02 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512> (referer: http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz)
(...)
2016-11-09 13:11:02 [newegg] INFO: parsing 'http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512'
(...)

$scrapy runspider.py
2016-11-09 13:10:56[scrapy]信息：scrapy 1.2.1已启动（机器人程序：scrapybot）
(...)
2016-11-09 13:10:56[剪贴]信息：蜘蛛打开
2016-11-09 13:10:56[抓取]信息：抓取0页（以0页/分钟的速度），抓取0项（以0项/分钟的速度）
2016-11-09 13:10:56[scrapy]调试：Telnet控制台监听127.0.0.1:6023
2016-11-09 13:10:57[scrapy]调试：爬网（200）（参考：无）
2016-11-09 13:10:57[newegg]调试：body[：32]：“\xef\xbb\xbf当我运行你的spider时，我在控制台上看到的是：
$ scrapy runspider op.py 
2016-11-09 21:34:51 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 21:34:51 [scrapy] INFO: Spider opened
2016-11-09 21:34:51 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 21:34:51 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 21:34:51 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 21:34:53 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 21:34:53 [scrapy] ERROR: Spider error processing <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spiders/sitemap.py", line 44, in _parse_sitemap
    s = Sitemap(body)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/sitemap.py", line 17, in __init__
    rt = self._root.tag
AttributeError: 'NoneType' object has no attribute 'tag'

这些日志显示newegg的.xml.gz站点地图确实需要压缩两次：
$ scrapy runspider spider.py 
2016-11-09 13:10:56 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 13:10:56 [scrapy] INFO: Spider opened
2016-11-09 13:10:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 13:10:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding='
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_store01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xda\xef\x1eX\x00\x0bnewegg_sitemap_store01'
2016-11-09 13:10:57 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
2016-11-09 13:10:57 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.newegg.com/Hubs/SubCategory/ID-26> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-11-09 13:10:59 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:59 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xe3\xfa\x1eX\x00\x0bnewegg_sitemap_product'
2016-11-09 13:10:59 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
(...)
2016-11-09 13:11:02 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512> (referer: http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz)
(...)
2016-11-09 13:11:02 [newegg] INFO: parsing 'http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512'
(...)

$scrapy runspider.py
2016-11-09 13:10:56[scrapy]信息：scrapy 1.2.1已启动（机器人程序：scrapybot）
(...)
2016-11-09 13:10:56[剪贴]信息：蜘蛛打开
2016-11-09 13:10:56[抓取]信息：抓取0页（以0页/分钟的速度），抓取0项（以0项/分钟的速度）
2016-11-09 13:10:56[scrapy]调试：Telnet控制台监听127.0.0.1:6023
2016-11-09 13:10:57[scrapy]调试：爬网（200）（参考：无）
2016-11-09 13:10:57[newegg]调试：body[：32]：'\xef\xbb\xbftnewegg网站地图中的loc
似乎包含压缩文件gz
。你能用命令行日志更新你的问题吗？仅供参考，我在scrapy上打开了一个问题：newegg网站地图中的loc
看起来包含压缩文件gz
。你能用命令行日志更新你的问题吗？仅供参考，我在scrapy上发布了一个问题：有趣。但gunzipping两次与未调用的函数有什么关系？它在什么地方失败了吗？具体在哪里？它不应该仍然调用回调吗？或者做点什么？我不明白，它是不是默默地失败了？这很讽刺，因为我觉得声音太大了；因此，我尝试手动记录调试输出。我无法处理筛选和过滤大量不必要的垃圾打印到stdout。它使我的眼睛流血。我也不会为在记录不好且复杂的记录器模块上乱动而烦恼。我试过了，时间太长了。Scrapy需要一些工作。请参阅我的更新答案。对我来说，当我运行你的蜘蛛时，scrapy非常大声地说有一个错误。我不知道你所说的“我无法处理筛选和过滤大量不必要的垃圾打印到stdout”是什么意思。如果是废日志，“垃圾”实际上表明存在问题。如果是别的事情，你不需要贬低来表达你的观点。“Scrapy需要一些工作”：这是肯定的，它需要用户社区投入他们不同的用例，这样代码才能得到改进。谢谢你的报道。如果你愿意的话，你也可以帮助修复这个bug。它实际上更像是一个自我问题，而不是一个棘手的问题。你有一段摘录。我能读懂。但与其他杂乱无章的输出相比，它是微不足道的。对我来说，这就像大海捞针，我经常重读台词，失去位置，被这些台词淹没。我很幸运能找到你在控制台输出中找到的东西。我的抱怨是，通常当程序遇到运行时错误时，程序会停止执行并返回错误，对吗？嗯，刮痧可不行。它把一根针扔进了干草堆，然后继续前进。我收回了这一切。我不是很彻底。我对这段代码做了一点修改，昨天它没有抛出错误或调用回调（可能是因为您在回答中概述的原因）。我在这里对代码进行了调整，但未能测试是否可以重现该问题，现在调整后的代码抛出了我可以很好地读取的错误。对不起，scrapy，我不是有意贬低你。哦，谢谢你的帮助和对答案的扩展。事实证明，这是非常有帮助的，事情变得更有意义。有趣。但gunzipping两次与未调用的函数有什么关系？它在什么地方失败了吗？具体在哪里？它不应该仍然调用回调吗？或者做点什么？我不明白，它是不是默默地失败了？这很讽刺，因为我觉得声音太大了；因此，我尝试手动记录调试输出。我无法处理筛选和过滤大量不必要的垃圾打印到stdout。它使我的眼睛流血。我也不会为在记录不好且复杂的记录器模块上乱动而烦恼。我试过了，时间太长了。Scrapy需要一些工作。请参阅我的更新答案。对我来说，当我运行你的蜘蛛时，scrapy非常大声地说有一个错误。我不知道你所说的“我无法处理筛选和过滤大量不必要的垃圾打印到stdout”是什么意思。如果是废日志，“垃圾”实际上表明存在问题。如果是别的事情，你不需要贬低来表达你的观点。“Scrapy需要一些工作”：这是肯定的，它需要用户社区投入他们不同的用途