Scrapy 痒痒的蜘蛛被卡在爬行的中间_Scrapy_Scrapy Spider

Scrapy 痒痒的蜘蛛被卡在爬行的中间

scrapy

Scrapy 痒痒的蜘蛛被卡在爬行的中间,scrapy,scrapy-spider,Scrapy,Scrapy Spider,我是scrapy的新手，我正在尝试构建一个蜘蛛，它可以抓取一个网站，从中获取所有的电话号码、电子邮件、PDF等，我希望它能够跟踪主页上的所有链接，从而搜索整个域此问题有类似的问题，但未解决：以下是我的spider的代码： import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractor

我是scrapy的新手，我正在尝试构建一个蜘蛛，它可以抓取一个网站，从中获取所有的电话号码、电子邮件、PDF等，我希望它能够跟踪主页上的所有链接，从而搜索整个域

此问题有类似的问题，但未解决：

以下是我的spider的代码：

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from mobilesuites.items import MobilesuitesItem
import re

class ExampleSpider(CrawlSpider):
    name = "hyatt"
    allowed_domains = ["hyatt.com"]
    start_urls = ( 
        'http://www.hyatt.com/',
    )   

    #follow only non-javascript links
    rules = ( 
            Rule(SgmlLinkExtractor(deny = ('.*\.jsp.*')), follow = True, callback = 'parse_item'),
            )   

    def parse_item(self, response):
        #self.log('The current url is %s' % response.url)

        selector = Selector(response)
        item = MobilesuitesItem()
        #get url
        item['url'] = response.url

        #get page title
        titles = selector.select("//title")
        for t in titles:
            item['title'] = t.select("./text()").extract()

        #get all phone numbers, emails, and pdf links
        text = response.body
        item['phone'] = '|'.join(re.findall('\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(?\d{3}\)?[-\.\s]?\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4}', text))
        item['email'] = '|'.join(re.findall("[^\s@]+@[^\s@]+\.[^\s@]+", text))
        item['pdfs'] = '|'.join(re.findall("[^\s\"<]*\.pdf[^\s\">]*", text))

        #check to see if dining is mentioned on the page
        item['dining'] = bool(re.findall("\s[dD]ining\s|\s[mM]enu\s|\s[bB]everage\s", text))
        return item

以下是爬网日志挂起前的最后一部分：

2014-07-21 18:18:57-0500 [hyatt] DEBUG: Scraped from <200 http://www.place.hyatt.com/en/hyattplace/eat-and-drink/24-7-gallery-menu.html>
    {'email': '',
     'phone': '',
     'title': [u'24/7 Gallery Menu'],
     'url': 'http://www.place.hyatt.com/en/hyattplace/eat-and-drink/24-7-gallery-menu.html'}
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Ignoring response <404 http://hyatt.com/gallery/thrive/siteMap.html>: HTTP status code is not handled or not allowed
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.hyatt.com/hyatt/pure/contact/> (referer: http://www.hyatt.com/hyatt/pure/?icamp=HY_HyattPure_HPLS)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/aboutus.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.place.hyatt.com/en/hyattplace/eat-and-drink/eat-and-drink.html> (referer: http://www.place.hyatt.com/en/hyattplace/eat-and-drink.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.park.hyatt.com/en/parkhyatt/newsandannouncements.html?icamp=park_hppsa_new_hotels> (referer: http://www.park.hyatt.com/en/parkhyatt.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.regency.hyatt.com/en/hyattregency/meetingsandevents.html> (referer: http://www.regency.hyatt.com/en/hyattregency.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/specialoffers.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/locations.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)

显示你的爬行日志。它看起来像是在一堆网站上爬行，很好，然后就挂起了。这是爬网日志的最后一部分，其余部分大致相同：2014-07-21 18:18:57-0500[hyatt]调试：从{'email'：，'phone'：，'title'：[u'24/7 Gallery Menu']，'url'：}2014-07-21 18:18:57-0500[hyatt]调试：爬网200最重要的部分是爬网结束时的摘要日志。请把它贴上去。