Python 痒痒的蜘蛛不爬行
我正试着测试这只痒痒的爬行蜘蛛,但我不明白它为什么不爬行。它应该做的是在wikipedia的数学页面上爬行一个深度级别,然后返回每个爬行页面的标题。我错过了什么?非常感谢您的帮助Python 痒痒的蜘蛛不爬行,python,web-scraping,scrapy,scrapy-spider,Python,Web Scraping,Scrapy,Scrapy Spider,我正试着测试这只痒痒的爬行蜘蛛,但我不明白它为什么不爬行。它应该做的是在wikipedia的数学页面上爬行一个深度级别,然后返回每个爬行页面的标题。我错过了什么?非常感谢您的帮助 from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.selector import Selector from Beurs.items import WikiIt
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from Beurs.items import WikiItem
class WikiSpider(CrawlSpider):
name = 'WikiSpider'
allowed_domains = ['wikipedia.org']
start_urls = ["http://en.wikipedia.org/wiki/Mathematics"]
Rules = (
Rule(LinkExtractor(restrict_xpaths=('//div[@class="mw-body"]//a/@href'))),
Rule(LinkExtractor( allow=("http://en.wikipedia.org/wiki/",)), callback='parse_item', follow=True),
)
def parse_item(self, response):
sel = Selector(response)
rows = sel.xpath('//span[@class="innhold"]/table/tr')
items = []
for row in rows[1:]:
item = WikiItem()
item['agent'] = row.xpath('./td[1]/a/text()|./td[1]/text()').extract()
item['org'] = row.xpath('./td[2]/text()').extract()
item['link'] = row.xpath('./td[1]/a/@href').extract()
item['produkt'] = row.xpath('./td[3]/text()').extract()
items.append(item)
return items
设置:
BOT_NAME = 'Beurs'
SPIDER_MODULES = ['Beurs.spiders']
NEWSPIDER_MODULE = 'Beurs.spiders'
DOWNLOAD_HANDLERS = {
's3': None,
}
DEPTH_LIMIT = 1
和日志:
C:\Users\Jan Willem\Anaconda\Beurs>scrapy crawl BeursSpider
2015-11-07 15:14:36 [scrapy] INFO: Scrapy 1.0.3 started (bot: Beurs)
2015-11-07 15:14:36 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-11-07 15:14:36 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'Beurs.spiders', 'SPIDER_MODULES': ['Beurs.spiders'], 'DEPTH_LIMIT': 1, 'BOT_NAME': 'Beurs'}
2015-11-07 15:14:36 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-07 15:14:36 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-07 15:14:36 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-07 15:14:36 [scrapy] INFO: Enabled item pipelines:
2015-11-07 15:14:36 [scrapy] INFO: Spider opened
2015-11-07 15:14:36 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-07 15:14:36 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-07 15:14:36 [scrapy] DEBUG: Redirecting (301) to <GET https://en.wikipedia.org/wiki/Mathematics> from <GET http://en.wikipedia.org/wiki/Mathematics>
2015-11-07 15:14:37 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Mathematics> (referer: None)
2015-11-07 15:14:37 [scrapy] INFO: Closing spider (finished)
2015-11-07 15:14:37 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 530,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 60393,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 11, 7, 14, 14, 37, 274000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2015, 11, 7, 14, 14, 36, 852000)}
2015-11-07 15:14:37 [scrapy] INFO: Spider closed (finished)
C:\Users\Jan Willem\Anaconda\Beurs>scrapy crawl beurspider
2015-11-07 15:14:36[scrapy]信息:scrapy 1.0.3已启动(机器人:Beurs)
2015-11-07 15:14:36[scrapy]信息:可选功能:ssl、http11、boto
2015-11-07 15:14:36[scrapy]信息:覆盖的设置:{'NEWSPIDER_MODULE':'Beurs.SPIDER','SPIDER_MODULE':['Beurs.SPIDER'],'DEPTH_LIMIT':1,'BOT_NAME':'Beurs'}
2015-11-07 15:14:36[scrapy]信息:启用的扩展:CloseSpider、TelnetConsole、LogStats、CoreStats、SpiderState
2015-11-07 15:14:36[scrapy]信息:启用的下载中间件:HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddleware、RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectMiddleware、Cookies Middleware、ChunkedTransferMiddleware、DownloadersStats
2015-11-07 15:14:36[scrapy]信息:启用的蜘蛛中间件:HttpErrorMiddleware、OffItemIDdleware、RefererMiddleware、UrlLengthMiddleware、DepthMiddleware
2015-11-07 15:14:36[scrapy]信息:启用的项目管道:
2015-11-07 15:14:36[刮擦]信息:蜘蛛打开
2015-11-07 15:14:36[抓取]信息:抓取0页(0页/分钟),抓取0项(0项/分钟)
2015-11-07 15:14:36[scrapy]调试:Telnet控制台监听127.0.0.1:6023
2015-11-07 15:14:36[scrapy]调试:将(301)重定向到
2015-11-07 15:14:37[scrapy]调试:爬网(200)(参考:无)
2015-11-07 15:14:37[scrapy]信息:关闭卡盘(已完成)
2015-11-07 15:14:37[刮痧]信息:倾销刮痧统计数据:
{'downloader/request_bytes':530,
“下载程序/请求计数”:2,
“下载器/请求\方法\计数/获取”:2,
“downloader/response_字节”:60393,
“下载程序/响应计数”:2,
“下载程序/响应状态\计数/200”:1,
“下载程序/响应状态\计数/301”:1,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2015,11,7,14,14,37274000),
“日志计数/调试”:3,
“日志计数/信息”:7,
“响应\u已接收\u计数”:1,
“调度程序/出列”:2,
“调度程序/出列/内存”:2,
“调度程序/排队”:2,
“调度程序/排队/内存”:2,
“开始时间”:datetime.datetime(2015,11,7,14,14,36852000)}
2015-11-07 15:14:37[刮屑]信息:蜘蛛网关闭(完成)
所以基本上你的正则表达式不太正确,你的Xpath需要一些调整。我认为下面的代码符合您的要求,请尝试一下,如果您需要更多帮助,请告诉我:
def parse_item(self, response):
sel = Selector(response)
rows = sel.xpath('//span[@class="innhold"]/table/tr')
items = []
for row in rows[1:]:
item = SasItem()
item['agent'] = row.xpath('./td[1]/a/text()|./td[1]/text()').extract()
item['org'] = row.xpath('./td[2]/text()').extract()
item['link'] = row.xpath('./td[1]/a/@href').extract()
item['produkt'] = row.xpath('./td[3]/text()').extract()
items.append(item)
return items
所以基本上你的正则表达式不太正确,你的Xpath需要一些调整。我认为下面的代码符合您的要求,请尝试一下,如果您需要更多帮助,请告诉我:
def parse_item(self, response):
sel = Selector(response)
rows = sel.xpath('//span[@class="innhold"]/table/tr')
items = []
for row in rows[1:]:
item = SasItem()
item['agent'] = row.xpath('./td[1]/a/text()|./td[1]/text()').extract()
item['org'] = row.xpath('./td[2]/text()').extract()
item['link'] = row.xpath('./td[1]/a/@href').extract()
item['produkt'] = row.xpath('./td[3]/text()').extract()
items.append(item)
return items
谢谢你的快速回复!我尝试了你的调整,但不幸的是没有让它运行。仍然存在以下问题:爬网0页(以0页/分钟的速度),刮取0项(以0项/分钟的速度)。看起来爬行本身根本没有发生,我不明白为什么。有什么建议吗?谢谢你的快速回复!我尝试了你的调整,但不幸的是没有让它运行。仍然存在以下问题:爬网0页(以0页/分钟的速度),刮取0项(以0项/分钟的速度)。看起来爬行本身根本没有发生,我不明白为什么。有什么建议吗?所以我在Dup步骤之一(见下文)中更改了代码的解析部分,但我仍然得到相同的日志:爬网0页(0页/分钟),刮取0项(0项/分钟)。有人知道我能做什么?所以我在Dup步骤(见下文)中更改了代码的解析部分,但我仍然得到相同的日志:爬网0页(以0页/分钟的速度),爬网0项(以0项/分钟的速度)。有人知道我能做什么?