Python Scrapy:无法理解关于robots.txt的日志

Python Scrapy:无法理解关于robots.txt的日志,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我的问题是,如果这个日志意味着网站不能被刮?我把我的用户代理改成了一个浏览器,但没用。此外,我在“启动请求”中省略了“s”,但也没有什么帮助。甚至我也在seetings.py中更改了“ROBOTSTXT_OBEY=False”,但没有任何帮助 这是我得到的日志: 2020-11-17 18:06:41 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/m

我的问题是,如果这个日志意味着网站不能被刮?我把我的用户代理改成了一个浏览器,但没用。此外,我在“启动请求”中省略了“s”,但也没有什么帮助。甚至我也在seetings.py中更改了“ROBOTSTXT_OBEY=False”,但没有任何帮助

这是我得到的日志:

2020-11-17 18:06:41 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-11-17 18:06:41 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-11-17 18:06:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://podcasts.apple.com/robots.txt> (referer: None)
2020-11-17 18:06:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://podcasts.apple.com/us/genre/podcasts-arts/id1301> (referer: None)
2020-11-17 18:06:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://podcasts.apple.com/us/genre/podcasts-arts/id1301> (referer: https://podcasts.apple.com/us/genre/podcasts-arts/id1301)
2020-11-17 18:06:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://podcasts.apple.com/us/genre/podcasts-arts-books/id1482> (referer: https://podcasts.apple.com/us/genre/podcasts-arts/id1301)
2020-11-18 17:29:49 [scrapy.core.engine] INFO: Closing spider (finished)
2020-11-18 17:29:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1342,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 67297,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 4,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 11, 18, 13, 59, 49, 133234),
 'httpcache/hit': 4,
 'log_count/DEBUG': 5,
 'log_count/INFO': 9,
 'request_depth_max': 2,
 'response_received_count': 4,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2020, 11, 18, 13, 59, 48, 758371)}
2020-11-18 17:29:49 [scrapy.core.engine] INFO: Spider closed (finished)
有人能帮我理解问题是什么,我如何解决它吗

谢谢你

---编辑1---

更改“允许的\u域”部分后获得的日志:

2020-11-18 13:49:18[scrapy.extensions.logstats]信息:爬网0页(以0页/分钟的速度),爬网0项(以0项/分钟的速度)
2020-11-18 13:49:18[scrapy.extensions.httpcache]调试:在C:\Users\shima\projects\apple\u podcasts\.scrapy\httpcache中使用文件系统缓存存储
2020-11-18 13:49:18[scrapy.extensions.telnet]信息:telnet控制台监听127.0.0.1:6023
2020-11-18 13:49:18[scrapy.core.engine]调试:爬网(200)(参考:无)[“缓存”]
2020-11-18 13:49:18[scrapy.core.engine]调试:爬网(200)(参考:无)[“缓存”]
2020-11-18 13:49:18[刮屑核心引擎]调试:爬网(200)(参考:https://podcasts.apple.com/us/genre/podcasts-arts/id1301)['cached']
2020-11-18 13:49:18[刮屑核心引擎]调试:爬网(200)(参考:https://podcasts.apple.com/us/genre/podcasts-arts/id1301)['cached']
---编辑2---

删除“try and except”语句后获得的日志:

2020-11-18 13:53:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-11-18 13:53:07 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in C:\Users\shima\projects\apple_podcasts\.scrapy\httpcache
2020-11-18 13:53:07 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-11-18 13:53:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://podcasts.apple.com/robots.txt> (referer: None) ['cached']
2020-11-18 13:53:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://podcasts.apple.com/us/genre/podcasts-arts/id1301> (referer: None) ['cached']
2020-11-18 13:53:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://podcasts.apple.com/us/genre/podcasts-arts/id1301> (referer: https://podcasts.apple.com/us/genre/podcasts-arts/id1301) ['cached']
2020-11-18 13:53:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://podcasts.apple.com/us/genre/podcasts-arts-books/id1482> (referer: https://podcasts.apple.com/us/genre/podcasts-arts/id1301) ['cached']
2020-11-18 13:53:07[scrapy.extensions.logstats]信息:爬网0页(0页/分钟),爬网0项(0项/分钟)
2020-11-18 13:53:07[scrapy.extensions.httpcache]调试:在C:\Users\shima\projects\apple\u podcasts\.scrapy\httpcache中使用文件系统缓存存储
2020-11-18 13:53:07[scrapy.extensions.telnet]信息:telnet控制台监听127.0.0.1:6023
2020-11-18 13:53:07[scrapy.core.engine]调试:爬网(200)(参考:无)[“缓存”]
2020-11-18 13:53:07[scrapy.core.engine]调试:爬网(200)(参考:无)[“缓存”]
2020-11-18 13:53:07[刮屑核心引擎]调试:爬网(200)(参考:https://podcasts.apple.com/us/genre/podcasts-arts/id1301)['cached']
2020-11-18 13:53:07[刮屑核心引擎]调试:爬网(200)(参考:https://podcasts.apple.com/us/genre/podcasts-arts/id1301)['cached']

执行日志中没有任何错误。

2020-11-17 18:06:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://podcasts.apple.com/us/genre/podcasts-arts/id1301> (referer: None)
对于
'content'
'selectedgenre'
,您使用了
@class
而不是
@id
。XPath应该是:

//div[@id='content']/div[@class='padder']/div[@id='selectedgenre']
请注意,这将只返回一个选择器,因此您不应该对其进行迭代。您可能希望迭代这些链接

这也将返回空:

alphabet_link=alphabet.xpath(".//ul[@class='list alpha']/li/a[@class='selected']/@href").get()
因为在您使用的XPath中没有类为“selected”(默认情况下)的
元素


我想会有更多的XPath被破坏,所以你应该检查选择器以确保你的XPath是正确的,它们会返回你期望的内容。没有与
robots.txt
btw相关的问题。

您不必在
允许的\u域中添加
www
。尝试将
www.podcasts.apple.com
替换为
podcasts.apple.com
中的
allowed\u域
变量。参考:我猜问题在于,您将所有类方法包装在try中,除了阻止spider正确执行的块之外。你应该把它去掉并修复压痕。谢谢你,@ShubhamKadam。我照你说的做了,编辑了这篇文章以查看我得到的日志(EDIT1)。谢谢,@Patrick Klein,我删除了“try and except”并在EDIT2中编辑了这篇文章以查看我得到的日志。这是否意味着我被禁止了还是什么?谢谢,@renatodvc。我编辑了这篇文章,把所有蜘蛛的代码都放在那里。我猜这可能没有必要,所以我只是把我的蜘蛛代码的一部分。事实上,我已经要求它刮去这么多的元素。现在您可以看到上面的所有内容了,问题是创建了output.xml文件。但当我试图打开它时,我得到了“XML解析错误”。根据日志,没有物品被刮伤。
 alphabets=response.xpath("//div[@class='content']/div[@class='padder']/div[@class='selectedgenre']")
//div[@id='content']/div[@class='padder']/div[@id='selectedgenre']
alphabet_link=alphabet.xpath(".//ul[@class='list alpha']/li/a[@class='selected']/@href").get()