Python 痒痒的蜘蛛不爬行_Python_Web Scraping_Scrapy_Scrapy Spider

Python 痒痒的蜘蛛不爬行

python web-scraping scrapy

Python 痒痒的蜘蛛不爬行,python,web-scraping,scrapy,scrapy-spider,Python,Web Scraping,Scrapy,Scrapy Spider,我正试着测试这只痒痒的爬行蜘蛛，但我不明白它为什么不爬行。它应该做的是在wikipedia的数学页面上爬行一个深度级别，然后返回每个爬行页面的标题。我错过了什么？非常感谢您的帮助 from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.selector import Selector from Beurs.items import WikiIt

我正试着测试这只痒痒的爬行蜘蛛，但我不明白它为什么不爬行。它应该做的是在wikipedia的数学页面上爬行一个深度级别，然后返回每个爬行页面的标题。我错过了什么？非常感谢您的帮助

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from Beurs.items import WikiItem

class WikiSpider(CrawlSpider):
    name = 'WikiSpider'
    allowed_domains = ['wikipedia.org']
    start_urls = ["http://en.wikipedia.org/wiki/Mathematics"]

    Rules = (
        Rule(LinkExtractor(restrict_xpaths=('//div[@class="mw-body"]//a/@href'))),
        Rule(LinkExtractor( allow=("http://en.wikipedia.org/wiki/",)),     callback='parse_item', follow=True),        
        )


def parse_item(self, response):
    sel = Selector(response)  
    rows = sel.xpath('//span[@class="innhold"]/table/tr')
    items = []

        for row in rows[1:]:
            item = WikiItem()
            item['agent'] = row.xpath('./td[1]/a/text()|./td[1]/text()').extract()
            item['org'] = row.xpath('./td[2]/text()').extract()
            item['link'] = row.xpath('./td[1]/a/@href').extract()
            item['produkt'] = row.xpath('./td[3]/text()').extract()
        items.append(item)
        return items

设置：

BOT_NAME = 'Beurs'

SPIDER_MODULES = ['Beurs.spiders']
NEWSPIDER_MODULE = 'Beurs.spiders'
DOWNLOAD_HANDLERS = {
  's3': None,
}
DEPTH_LIMIT = 1

和日志：

C:\Users\Jan Willem\Anaconda\Beurs>scrapy crawl BeursSpider
2015-11-07 15:14:36 [scrapy] INFO: Scrapy 1.0.3 started (bot: Beurs)
2015-11-07 15:14:36 [scrapy] INFO: Optional features available: ssl, http11,    boto
2015-11-07 15:14:36 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'Beurs.spiders', 'SPIDER_MODULES': ['Beurs.spiders'], 'DEPTH_LIMIT': 1,    'BOT_NAME': 'Beurs'}
2015-11-07 15:14:36 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-07 15:14:36 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-07 15:14:36 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-07 15:14:36 [scrapy] INFO: Enabled item pipelines:
2015-11-07 15:14:36 [scrapy] INFO: Spider opened
2015-11-07 15:14:36 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-07 15:14:36 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-07 15:14:36 [scrapy] DEBUG: Redirecting (301) to <GET https://en.wikipedia.org/wiki/Mathematics> from <GET http://en.wikipedia.org/wiki/Mathematics>
2015-11-07 15:14:37 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Mathematics> (referer: None)
2015-11-07 15:14:37 [scrapy] INFO: Closing spider (finished)
2015-11-07 15:14:37 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 530,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 60393,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 11, 7, 14, 14, 37, 274000),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2015, 11, 7, 14, 14, 36, 852000)}
2015-11-07 15:14:37 [scrapy] INFO: Spider closed (finished)

C:\Users\Jan Willem\Anaconda\Beurs>scrapy crawl beurspider
2015-11-07 15:14:36[scrapy]信息：scrapy 1.0.3已启动（机器人：Beurs）
2015-11-07 15:14:36[scrapy]信息：可选功能：ssl、http11、boto
2015-11-07 15:14:36[scrapy]信息：覆盖的设置：{'NEWSPIDER_MODULE'：'Beurs.SPIDER'，'SPIDER_MODULE'：['Beurs.SPIDER']，'DEPTH_LIMIT'：1，'BOT_NAME'：'Beurs'}
2015-11-07 15:14:36[scrapy]信息：启用的扩展：CloseSpider、TelnetConsole、LogStats、CoreStats、SpiderState
2015-11-07 15:14:36[scrapy]信息：启用的下载中间件：HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddleware、RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectMiddleware、Cookies Middleware、ChunkedTransferMiddleware、DownloadersStats
2015-11-07 15:14:36[scrapy]信息：启用的蜘蛛中间件：HttpErrorMiddleware、OffItemIDdleware、RefererMiddleware、UrlLengthMiddleware、DepthMiddleware
2015-11-07 15:14:36[scrapy]信息：启用的项目管道：
2015-11-07 15:14:36[刮擦]信息：蜘蛛打开
2015-11-07 15:14:36[抓取]信息：抓取0页（0页/分钟），抓取0项（0项/分钟）
2015-11-07 15:14:36[scrapy]调试：Telnet控制台监听127.0.0.1:6023
2015-11-07 15:14:36[scrapy]调试：将（301）重定向到
2015-11-07 15:14:37[scrapy]调试：爬网（200）（参考：无）
2015-11-07 15:14:37[scrapy]信息：关闭卡盘（已完成）
2015-11-07 15:14:37[刮痧]信息：倾销刮痧统计数据：
{'downloader/request_bytes'：530，
“下载程序/请求计数”：2，
“下载器/请求\方法\计数/获取”：2，
“downloader/response_字节”：60393，
“下载程序/响应计数”：2，
“下载程序/响应状态\计数/200”：1，
“下载程序/响应状态\计数/301”：1，
“完成原因”：“完成”，
“完成时间”：datetime.datetime（2015,11,7,14,14,37274000），
“日志计数/调试”：3，
“日志计数/信息”：7，
“响应\u已接收\u计数”：1，
“调度程序/出列”：2，
“调度程序/出列/内存”：2，
“调度程序/排队”：2，
“调度程序/排队/内存”：2，
“开始时间”：datetime.datetime（2015,11,7,14,14,36852000）}
2015-11-07 15:14:37[刮屑]信息：蜘蛛网关闭（完成）

所以基本上你的正则表达式不太正确，你的Xpath需要一些调整。我认为下面的代码符合您的要求，请尝试一下，如果您需要更多帮助，请告诉我：

def parse_item(self, response):
    sel = Selector(response)
    rows = sel.xpath('//span[@class="innhold"]/table/tr')
    items = []

    for row in rows[1:]:
        item = SasItem()
        item['agent'] = row.xpath('./td[1]/a/text()|./td[1]/text()').extract()
        item['org'] = row.xpath('./td[2]/text()').extract()
        item['link'] = row.xpath('./td[1]/a/@href').extract()
        item['produkt'] = row.xpath('./td[3]/text()').extract()
        items.append(item)
    return items

所以基本上你的正则表达式不太正确，你的Xpath需要一些调整。我认为下面的代码符合您的要求，请尝试一下，如果您需要更多帮助，请告诉我：

def parse_item(self, response):
    sel = Selector(response)
    rows = sel.xpath('//span[@class="innhold"]/table/tr')
    items = []

    for row in rows[1:]:
        item = SasItem()
        item['agent'] = row.xpath('./td[1]/a/text()|./td[1]/text()').extract()
        item['org'] = row.xpath('./td[2]/text()').extract()
        item['link'] = row.xpath('./td[1]/a/@href').extract()
        item['produkt'] = row.xpath('./td[3]/text()').extract()
        items.append(item)
    return items

谢谢你的快速回复！我尝试了你的调整，但不幸的是没有让它运行。仍然存在以下问题：爬网0页（以0页/分钟的速度），刮取0项（以0项/分钟的速度）。看起来爬行本身根本没有发生，我不明白为什么。有什么建议吗？谢谢你的快速回复！我尝试了你的调整，但不幸的是没有让它运行。仍然存在以下问题：爬网0页（以0页/分钟的速度），刮取0项（以0项/分钟的速度）。看起来爬行本身根本没有发生，我不明白为什么。有什么建议吗？所以我在Dup步骤之一（见下文）中更改了代码的解析部分，但我仍然得到相同的日志：爬网0页（0页/分钟），刮取0项（0项/分钟）。有人知道我能做什么？所以我在Dup步骤（见下文）中更改了代码的解析部分，但我仍然得到相同的日志：爬网0页（以0页/分钟的速度），爬网0项（以0项/分钟的速度）。有人知道我能做什么？