Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/284.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 痒痒的蜘蛛不爬行_Python_Web Scraping_Scrapy_Scrapy Spider - Fatal编程技术网

Python 痒痒的蜘蛛不爬行

Python 痒痒的蜘蛛不爬行,python,web-scraping,scrapy,scrapy-spider,Python,Web Scraping,Scrapy,Scrapy Spider,我正试着测试这只痒痒的爬行蜘蛛,但我不明白它为什么不爬行。它应该做的是在wikipedia的数学页面上爬行一个深度级别,然后返回每个爬行页面的标题。我错过了什么?非常感谢您的帮助 from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.selector import Selector from Beurs.items import WikiIt

我正试着测试这只痒痒的爬行蜘蛛,但我不明白它为什么不爬行。它应该做的是在wikipedia的数学页面上爬行一个深度级别,然后返回每个爬行页面的标题。我错过了什么?非常感谢您的帮助

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from Beurs.items import WikiItem

class WikiSpider(CrawlSpider):
    name = 'WikiSpider'
    allowed_domains = ['wikipedia.org']
    start_urls = ["http://en.wikipedia.org/wiki/Mathematics"]

    Rules = (
        Rule(LinkExtractor(restrict_xpaths=('//div[@class="mw-body"]//a/@href'))),
        Rule(LinkExtractor( allow=("http://en.wikipedia.org/wiki/",)),     callback='parse_item', follow=True),        
        )


def parse_item(self, response):
    sel = Selector(response)  
    rows = sel.xpath('//span[@class="innhold"]/table/tr')
    items = []

        for row in rows[1:]:
            item = WikiItem()
            item['agent'] = row.xpath('./td[1]/a/text()|./td[1]/text()').extract()
            item['org'] = row.xpath('./td[2]/text()').extract()
            item['link'] = row.xpath('./td[1]/a/@href').extract()
            item['produkt'] = row.xpath('./td[3]/text()').extract()
        items.append(item)
        return items
设置:

BOT_NAME = 'Beurs'

SPIDER_MODULES = ['Beurs.spiders']
NEWSPIDER_MODULE = 'Beurs.spiders'
DOWNLOAD_HANDLERS = {
  's3': None,
}
DEPTH_LIMIT = 1
和日志:

C:\Users\Jan Willem\Anaconda\Beurs>scrapy crawl BeursSpider
2015-11-07 15:14:36 [scrapy] INFO: Scrapy 1.0.3 started (bot: Beurs)
2015-11-07 15:14:36 [scrapy] INFO: Optional features available: ssl, http11,    boto
2015-11-07 15:14:36 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'Beurs.spiders', 'SPIDER_MODULES': ['Beurs.spiders'], 'DEPTH_LIMIT': 1,    'BOT_NAME': 'Beurs'}
2015-11-07 15:14:36 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-07 15:14:36 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-07 15:14:36 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-07 15:14:36 [scrapy] INFO: Enabled item pipelines:
2015-11-07 15:14:36 [scrapy] INFO: Spider opened
2015-11-07 15:14:36 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-07 15:14:36 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-07 15:14:36 [scrapy] DEBUG: Redirecting (301) to <GET https://en.wikipedia.org/wiki/Mathematics> from <GET http://en.wikipedia.org/wiki/Mathematics>
2015-11-07 15:14:37 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Mathematics> (referer: None)
2015-11-07 15:14:37 [scrapy] INFO: Closing spider (finished)
2015-11-07 15:14:37 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 530,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 60393,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 11, 7, 14, 14, 37, 274000),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2015, 11, 7, 14, 14, 36, 852000)}
2015-11-07 15:14:37 [scrapy] INFO: Spider closed (finished)
C:\Users\Jan Willem\Anaconda\Beurs>scrapy crawl beurspider
2015-11-07 15:14:36[scrapy]信息:scrapy 1.0.3已启动(机器人:Beurs)
2015-11-07 15:14:36[scrapy]信息:可选功能:ssl、http11、boto
2015-11-07 15:14:36[scrapy]信息:覆盖的设置:{'NEWSPIDER_MODULE':'Beurs.SPIDER','SPIDER_MODULE':['Beurs.SPIDER'],'DEPTH_LIMIT':1,'BOT_NAME':'Beurs'}
2015-11-07 15:14:36[scrapy]信息:启用的扩展:CloseSpider、TelnetConsole、LogStats、CoreStats、SpiderState
2015-11-07 15:14:36[scrapy]信息:启用的下载中间件:HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddleware、RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectMiddleware、Cookies Middleware、ChunkedTransferMiddleware、DownloadersStats
2015-11-07 15:14:36[scrapy]信息:启用的蜘蛛中间件:HttpErrorMiddleware、OffItemIDdleware、RefererMiddleware、UrlLengthMiddleware、DepthMiddleware
2015-11-07 15:14:36[scrapy]信息:启用的项目管道:
2015-11-07 15:14:36[刮擦]信息:蜘蛛打开
2015-11-07 15:14:36[抓取]信息:抓取0页(0页/分钟),抓取0项(0项/分钟)
2015-11-07 15:14:36[scrapy]调试:Telnet控制台监听127.0.0.1:6023
2015-11-07 15:14:36[scrapy]调试:将(301)重定向到
2015-11-07 15:14:37[scrapy]调试:爬网(200)(参考:无)
2015-11-07 15:14:37[scrapy]信息:关闭卡盘(已完成)
2015-11-07 15:14:37[刮痧]信息:倾销刮痧统计数据:
{'downloader/request_bytes':530,
“下载程序/请求计数”:2,
“下载器/请求\方法\计数/获取”:2,
“downloader/response_字节”:60393,
“下载程序/响应计数”:2,
“下载程序/响应状态\计数/200”:1,
“下载程序/响应状态\计数/301”:1,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2015,11,7,14,14,37274000),
“日志计数/调试”:3,
“日志计数/信息”:7,
“响应\u已接收\u计数”:1,
“调度程序/出列”:2,
“调度程序/出列/内存”:2,
“调度程序/排队”:2,
“调度程序/排队/内存”:2,
“开始时间”:datetime.datetime(2015,11,7,14,14,36852000)}
2015-11-07 15:14:37[刮屑]信息:蜘蛛网关闭(完成)

所以基本上你的正则表达式不太正确,你的Xpath需要一些调整。我认为下面的代码符合您的要求,请尝试一下,如果您需要更多帮助,请告诉我:

def parse_item(self, response):
    sel = Selector(response)
    rows = sel.xpath('//span[@class="innhold"]/table/tr')
    items = []

    for row in rows[1:]:
        item = SasItem()
        item['agent'] = row.xpath('./td[1]/a/text()|./td[1]/text()').extract()
        item['org'] = row.xpath('./td[2]/text()').extract()
        item['link'] = row.xpath('./td[1]/a/@href').extract()
        item['produkt'] = row.xpath('./td[3]/text()').extract()
        items.append(item)
    return items

所以基本上你的正则表达式不太正确,你的Xpath需要一些调整。我认为下面的代码符合您的要求,请尝试一下,如果您需要更多帮助,请告诉我:

def parse_item(self, response):
    sel = Selector(response)
    rows = sel.xpath('//span[@class="innhold"]/table/tr')
    items = []

    for row in rows[1:]:
        item = SasItem()
        item['agent'] = row.xpath('./td[1]/a/text()|./td[1]/text()').extract()
        item['org'] = row.xpath('./td[2]/text()').extract()
        item['link'] = row.xpath('./td[1]/a/@href').extract()
        item['produkt'] = row.xpath('./td[3]/text()').extract()
        items.append(item)
    return items

谢谢你的快速回复!我尝试了你的调整,但不幸的是没有让它运行。仍然存在以下问题:爬网0页(以0页/分钟的速度),刮取0项(以0项/分钟的速度)。看起来爬行本身根本没有发生,我不明白为什么。有什么建议吗?谢谢你的快速回复!我尝试了你的调整,但不幸的是没有让它运行。仍然存在以下问题:爬网0页(以0页/分钟的速度),刮取0项(以0项/分钟的速度)。看起来爬行本身根本没有发生,我不明白为什么。有什么建议吗?所以我在Dup步骤之一(见下文)中更改了代码的解析部分,但我仍然得到相同的日志:爬网0页(0页/分钟),刮取0项(0项/分钟)。有人知道我能做什么?所以我在Dup步骤(见下文)中更改了代码的解析部分,但我仍然得到相同的日志:爬网0页(以0页/分钟的速度),爬网0项(以0项/分钟的速度)。有人知道我能做什么?