Scrapy python-我一直在抓取0页
我尝试了多个教程,但无论我尝试什么,我总是得到相同的结果“爬网0页(以0页/分钟的速度),刮取0项(以0项/分钟的速度)” 我的代码非常简单:Scrapy python-我一直在抓取0页,python,python-3.x,web-scraping,scrapy,Python,Python 3.x,Web Scraping,Scrapy,我尝试了多个教程,但无论我尝试什么,我总是得到相同的结果“爬网0页(以0页/分钟的速度),刮取0项(以0项/分钟的速度)” 我的代码非常简单: import scrapy class SpiderSpider(scrapy.Spider): name = 'spider' allowed_domains = ['books.toscrape.com/'] start_urls = ['http://books.toscrape.com//'] def pars
import scrapy
class SpiderSpider(scrapy.Spider):
name = 'spider'
allowed_domains = ['books.toscrape.com/']
start_urls = ['http://books.toscrape.com//']
def parse(self, response):
print(response.url)
输出为:
2020-11-03 22:11:52[scrapy.utils.log]信息:scrapy 2.4.0已启动
(bot:books)2020-11-03 22:11:52[scrapy.utils.log]信息:版本:
lxml4.5.2.0,libxml2.9.10,cssselect 1.1.0,parsel 1.6.0,w3lib
1.22.0、Twisted 20.3.0、Python 3.8.3(默认值,2020年7月2日,11:26:31)-[Clang 10.0.0]、pyOpenSSL 19.1.0(OpenSSL 1.1.1g 2020年4月21日)、密码学2.9.2、平台macOS-10.15.7-x86_64-i386-64位
2020-11-03 22:11:52[scrapy.utils.log]调试:使用反应堆:
twisted.internet.selectreactor.selectreactor 2020-11-03 22:11:52
[scrapy.crawler]信息:覆盖的设置:{'BOT_NAME':'books',
“NEWSPIDER_模块”:“books.spider”,“ROBOTSTXT_-obe”:正确,
“蜘蛛模块”:[books.SPIDER]}2020-11-03 22:11:52
[scrapy.extensions.telnet]信息:telnet密码:ae1669f089ac9e66
2020-11-03 22:11:52[scrapy.middleware]信息:启用的扩展:
['scrapy.extensions.corestats.corestats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.logstats']2020-11-03 22:11:52
[scrapy.middleware]信息:启用的下载程序中间件:
['scrapy.downloaderMiddleware.robotstxt.RobotsTxtMiddleware',
'scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware',
'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware',
'scrapy.DownloaderMiddleware.retry.RetryMiddleware',
'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware',
'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware',
'scrapy.DownloaderMiddleware.redirect.RedirectMiddleware',
“scrapy.DownloaderMiddleware.cookies.CookiesMiddleware”,
'scrapy.downloadermiddleware.httpproxy.HttpProxyMiddleware',
“scrapy.DownloaderMiddleware.stats.DownloaderStats”]2020-11-03
22:11:52[scrapy.middleware]信息:启用的蜘蛛中间件:
['scrapy.spidermiddleware.httperror.httperror中间件',
'刮皮.SpiderMiddleware.场外.场外Iddleware',
“scrapy.Spidermiddleware.referer.RefererMiddleware”,
'scrapy.spiderMiddleware.urllength.UrlLengthMiddleware',
“刮屑.蜘蛛丝.深度.深度蜘蛛丝”]2020-11-03 22:11:52
[scrapy.middleware]信息:启用的项目管道:[]2020-11-03
22:11:52[刮屑.堆芯.发动机]信息:蜘蛛网打开2020-11-03 22:11:52
[scrapy.extensions.logstats]信息:已爬网0页(0页/分钟),
刮取0个项目(以0个项目/分钟的速度)2020-11-03 22:11:52
[scrapy.extensions.telnet]信息:telnet控制台正在侦听
127.0.0.1:6023 2020-11-03 22:11:53[scrapy.core.engine]调试:Crawled(404)看起来你正在抓取的站点上没有robots.txt
您可以通过转到scrapy的settings.py并找到ROBOTSTXT_obe来禁用robots.txt。将此设置为false。您的输出显示您已爬网两个页面:
http://books.toscrape.com/robots.txt (HTTP status 404 error)
http://books.toscrape.com// (HTTP status 200)
看起来一切都正常(除了我没有看到你在outout中打印声明)。我尝试了这个方法,但没有解决问题。。。仍然得到同样的东西谢谢你,我希望爬网说我至少爬过了1页,所以当我看到输出说它是0时,我只是假设它没有。