Scrapy 痒痒的蜘蛛被卡在爬行的中间
我是scrapy的新手,我正在尝试构建一个蜘蛛,它可以抓取一个网站,从中获取所有的电话号码、电子邮件、PDF等,我希望它能够跟踪主页上的所有链接,从而搜索整个域 此问题有类似的问题,但未解决: 以下是我的spider的代码:Scrapy 痒痒的蜘蛛被卡在爬行的中间,scrapy,scrapy-spider,Scrapy,Scrapy Spider,我是scrapy的新手,我正在尝试构建一个蜘蛛,它可以抓取一个网站,从中获取所有的电话号码、电子邮件、PDF等,我希望它能够跟踪主页上的所有链接,从而搜索整个域 此问题有类似的问题,但未解决: 以下是我的spider的代码: import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractor
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from mobilesuites.items import MobilesuitesItem
import re
class ExampleSpider(CrawlSpider):
name = "hyatt"
allowed_domains = ["hyatt.com"]
start_urls = (
'http://www.hyatt.com/',
)
#follow only non-javascript links
rules = (
Rule(SgmlLinkExtractor(deny = ('.*\.jsp.*')), follow = True, callback = 'parse_item'),
)
def parse_item(self, response):
#self.log('The current url is %s' % response.url)
selector = Selector(response)
item = MobilesuitesItem()
#get url
item['url'] = response.url
#get page title
titles = selector.select("//title")
for t in titles:
item['title'] = t.select("./text()").extract()
#get all phone numbers, emails, and pdf links
text = response.body
item['phone'] = '|'.join(re.findall('\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(?\d{3}\)?[-\.\s]?\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4}', text))
item['email'] = '|'.join(re.findall("[^\s@]+@[^\s@]+\.[^\s@]+", text))
item['pdfs'] = '|'.join(re.findall("[^\s\"<]*\.pdf[^\s\">]*", text))
#check to see if dining is mentioned on the page
item['dining'] = bool(re.findall("\s[dD]ining\s|\s[mM]enu\s|\s[bB]everage\s", text))
return item
以下是爬网日志挂起前的最后一部分:
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Scraped from <200 http://www.place.hyatt.com/en/hyattplace/eat-and-drink/24-7-gallery-menu.html>
{'email': '',
'phone': '',
'title': [u'24/7 Gallery Menu'],
'url': 'http://www.place.hyatt.com/en/hyattplace/eat-and-drink/24-7-gallery-menu.html'}
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Ignoring response <404 http://hyatt.com/gallery/thrive/siteMap.html>: HTTP status code is not handled or not allowed
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.hyatt.com/hyatt/pure/contact/> (referer: http://www.hyatt.com/hyatt/pure/?icamp=HY_HyattPure_HPLS)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/aboutus.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.place.hyatt.com/en/hyattplace/eat-and-drink/eat-and-drink.html> (referer: http://www.place.hyatt.com/en/hyattplace/eat-and-drink.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.park.hyatt.com/en/parkhyatt/newsandannouncements.html?icamp=park_hppsa_new_hotels> (referer: http://www.park.hyatt.com/en/parkhyatt.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.regency.hyatt.com/en/hyattregency/meetingsandevents.html> (referer: http://www.regency.hyatt.com/en/hyattregency.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/specialoffers.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/locations.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)
显示你的爬行日志。它看起来像是在一堆网站上爬行,很好,然后就挂起了。这是爬网日志的最后一部分,其余部分大致相同:2014-07-21 18:18:57-0500[hyatt]调试:从{'email':,'phone':,'title':[u'24/7 Gallery Menu'],'url':}2014-07-21 18:18:57-0500[hyatt]调试:爬网200最重要的部分是爬网结束时的摘要日志。请把它贴上去。