Python 爬网0页(以0页/分钟的速度),刮取0项
你好,美丽的程序员!我正面临一个我无法解决的问题。请帮助我。我正在尝试使用来刮取一个olx.com.pk,但是我没有得到任何结果。请帮助我,我将非常感谢你。 我试过不同的方法,但都不管用。请帮帮我 附言:我已经在刮擦的贝壳上检查过了Python 爬网0页(以0页/分钟的速度),刮取0项,python,python-2.7,web-scraping,scrapy,screen-scraping,Python,Python 2.7,Web Scraping,Scrapy,Screen Scraping,你好,美丽的程序员!我正面临一个我无法解决的问题。请帮助我。我正在尝试使用来刮取一个olx.com.pk,但是我没有得到任何结果。请帮助我,我将非常感谢你。 我试过不同的方法,但都不管用。请帮帮我 附言:我已经在刮擦的贝壳上检查过了 import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from olx.items import OlxI
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from olx.items import OlxItem
class ElectronicsSpider(CrawlSpider):
name = "electronics"
allowed_domains = ["www.olx.com.pk"]
start_urls = [
'https://www.olx.com.pk/computers-accessories/'
]
rules = (
Rule(LinkExtractor(allow=(), restrict_css=('.pageNextPrev',)),
callback="parse_item",
follow=False),)
def parse_item(self, response):
item_links = response.css('.large > .detailsLink::attr(href)').extract()
for a in item_links:
yield scrapy.Request(a, callback=self.parse_detail_page)
def parse_detail_page(self, response):
title = response.css('h1::text').extract()[0].strip()
price = response.css('.pricelabel > strong::text').extract()[0]
item = OlxItem()
item['title'] = title
item['price'] = price
item['url'] = response.url
yield item
输出是这样的:
scrapy crawl electronics
2018-07-10 14:29:33 [scrapy] INFO: Scrapy 1.0.3 started (bot: olx)
2018-07-10 14:29:33 [scrapy] INFO: Optional features available: ssl, http11, boto
2018-07-10 14:29:33 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'olx.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['olx.spiders'], 'FEED_URI': 'logs/%(name)s/%(time)s.csv', 'BOT_NAME': 'olx'}
2018-07-10 14:29:34 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2018-07-10 14:29:34 [boto] DEBUG: Retrieving credentials from metadata server.
2018-07-10 14:29:35 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2018-07-10 14:29:35 [boto] ERROR: Unable to read instance data, giving up
2018-07-10 14:29:35 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2018-07-10 14:29:35 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2018-07-10 14:29:35 [scrapy] INFO: Enabled item pipelines:
2018-07-10 14:29:35 [scrapy] INFO: Spider opened
2018-07-10 14:29:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-10 14:29:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6028
2018-07-10 14:29:37 [scrapy] DEBUG: Crawled (200) <GET https://www.olx.com.pk/computers-accessories/> (referer: None)
2018-07-10 14:29:38 [scrapy] DEBUG: Crawled (200) <GET https://www.olx.com.pk/computers-accessories/?page=2> (referer: https://www.olx.com.pk/computers-accessories/)
2018-07-10 14:29:38 [scrapy] INFO: Closing spider (finished)
2018-07-10 14:29:38 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 601,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 54431,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 7, 10, 9, 29, 38, 323590),
'log_count/DEBUG': 4,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2018, 7, 10, 9, 29, 35, 178414)}
2018-07-10 14:29:38 [scrapy] INFO: Spider closed (finished)
scrapy-crawl电子产品
2018-07-10 14:29:33[scrapy]信息:scrapy 1.0.3已启动(bot:olx)
2018-07-10 14:29:33[scrapy]信息:可选功能:ssl、http11、boto
2018-07-10 14:29:33[scrapy]信息:覆盖的设置:{'NEWSPIDER_MODULE':'olx.SPIDER','FEED_FORMAT':'csv','SPIDER_MODULES':['olx.SPIDER'],'FEED_URI':'logs/%(name)s/%(time)s.csv','BOT_name':'olx'}
2018-07-10 14:29:34[scrapy]信息:启用的扩展:CloseSpider、FeedExporter、TelnetConsole、LogStats、CoreStats、SpiderState
2018-07-10 14:29:34[boto]调试:从元数据服务器检索凭据。
2018-07-10 14:29:35[boto]错误:读取实例数据时捕获异常
回溯(最近一次呼叫最后一次):
文件“/usr/lib/python2.7/dist packages/boto/utils.py”,第210行,在重试url中
r=打开器。打开(请求,超时=超时)
文件“/usr/lib/python2.7/urllib2.py”,第429行,打开
响应=自身打开(请求,数据)
文件“/usr/lib/python2.7/urllib2.py”,第447行,打开
"开放",
文件“/usr/lib/python2.7/urllib2.py”,第407行,在调用链中
结果=func(*args)
文件“/usr/lib/python2.7/urllib2.py”,第1228行,在http\u open中
返回self.do_open(httplib.HTTPConnection,req)
文件“/usr/lib/python2.7/urllib2.py”,第1198行,打开
引发URL错误(err)
URL错误:
2018-07-10 14:29:35[boto]错误:无法读取实例数据,放弃
2018-07-10 14:29:35[scrapy]信息:启用的下载中间件:HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddleware、RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectMiddleware、Cookies Middleware、ChunkedTransferMiddleware、DownloadersStats
2018-07-10 14:29:35[scrapy]信息:启用的蜘蛛中间件:HttpErrorMiddleware、OffItemIDdleware、RefererMiddleware、UrlLengthMiddleware、DepthMiddleware
2018-07-10 14:29:35[scrapy]信息:启用的项目管道:
2018-07-10 14:29:35[剪贴]信息:蜘蛛打开
2018-07-10 14:29:35[抓取]信息:抓取0页(0页/分钟),抓取0项(0项/分钟)
2018-07-10 14:29:35[scrapy]调试:Telnet控制台监听127.0.0.1:6028
2018-07-10 14:29:37[scrapy]调试:爬网(200)(参考:无)
2018-07-10 14:29:38[scrapy]调试:爬网(200)(参考:https://www.olx.com.pk/computers-accessories/)
2018-07-10 14:29:38[scrapy]信息:关闭卡盘(已完成)
2018-07-10 14:29:38[scrapy]信息:倾销scrapy统计数据:
{'downloader/request_bytes':601,
“下载程序/请求计数”:2,
“下载器/请求\方法\计数/获取”:2,
“downloader/response_字节”:54431,
“下载程序/响应计数”:2,
“下载程序/响应状态\计数/200”:2,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2018,7,10,9,29,38323590),
“日志计数/调试”:4,
“日志计数/错误”:2,
“日志计数/信息”:7,
“请求深度最大值”:1,
“响应\u已收到\u计数”:2,
“调度程序/出列”:2,
“调度程序/出列/内存”:2,
“调度程序/排队”:2,
“调度程序/排队/内存”:2,
“开始时间”:datetime.datetime(2018,7,10,9,29,35178414)}
2018-07-10 14:29:38[刮擦]信息:蜘蛛网关闭(完成)
您在parse_item()
中的css选择器似乎与任何内容都不匹配
查看页面,我可以看到带有detailsLinkPromoted
类的链接,但没有detailsLink
另外,如果您已经在使用爬行爬行器,为什么要编写手动链接提取代码,而不是简单地创建另一个规则呢?正如stranac所说,css选择器似乎是错误的。 有一个非通用的:
item_links = response.css('li[class*=lpv-item\ offer\ onclick] > .lpv-item-link::attr(href)').extract()
将显示产品的URL
为什么不在此步骤直接解析站点,您不需要提出新的请求。欢迎使用Stackoverflow。请深呼吸并编辑你的帖子。由于所有的帽子和帮助尖叫你放在那里很难理解你做了什么。请描述你的问题。告诉我们你到现在为止都做了些什么。有关指导,请查看该页。谢谢你事实上我是一个noob。我刚刚开始学习我只是在遵循这个。它只得到42个结果。如何获得1000个结果?我只检查了带有项目链接的部分。pageNext的规则也必须更改。