Web scraping 无法继续进行刮取或爬行

Web scraping 无法继续进行刮取或爬行,web-scraping,scrapy,Web Scraping,Scrapy,我试图从这(一个示例页面)中获取数据,但没有用。我不知道为什么它总是告诉我,过滤后的异地请求到另一个网站,而referer是没有的。我只是想知道它的工作名称、位置和链接。无论如何,这是我的代码: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.http import Request from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor fr

我试图从这(一个示例页面)中获取数据,但没有用。我不知道为什么它总是告诉我,过滤后的异地请求到另一个网站,而referer是没有的。我只是想知道它的工作名称、位置和链接。无论如何,这是我的代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem

class MySpider(CrawlSpider):
    name = "meridian"
    allowed_domains = ["careers-meridianhealth.icims.com"]
    start_urls = ["https://careers-meridianhealth.icims.com"]



rules = (Rule (SgmlLinkExtractor(deny = path_deny_base, allow=('\d+'),restrict_xpaths=('*'))
    , callback="parse_items", follow= True),
    )


def parse_items(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.select('//div[2]/h1')

    linker = hxs.select('//div[2]/div[8]/a[1]')

    loc_Con = hxs.select('//div[2]/span/span/span[1]') 
    loc_Reg = hxs.select('//div[2]/span/span/span[2]') 
    loc_Loc = hxs.select('//div[2]/span/span/span[3]') 
    items = []
    for titles in titles:
        item = CraigslistSampleItem()
        #item ["job_id"] = id.select('text()').extract()[0].strip()
        item ["title"] = map(unicode.strip, titles.select('text()').extract()) #ok
        item ["link"] = linker.select('@href').extract() #ok
        item ["info"] = (response.url)
        temp1 = loc_Con.select('text()').extract()
        temp2 = loc_Reg.select('text()').extract()
        temp3 = loc_Loc.select('text()').extract()
        temp1 = temp1[0] if temp1 else ""
        temp2 = temp2[0] if temp2 else ""
        temp3 = temp3[0] if temp3 else ""
        item["code"] = "{0}-{1}-{2}".format(temp1, temp2, temp3)
        items.append(item)
    return(items)

如果你使用scrapy shell检查你的链接提取器,你会发现你的起始URL只包含指向网站的链接,而不在“careers.health.icims.com”下

您可以更改规则,将更多域添加到
允许的\u域
属性,或者根本不定义
允许的\u属性
(因此所有域都将进行爬网,这可能意味着爬网大量页面)

但是如果仔细查看页面源代码,您会注意到它包含一个
iframe
,如果您按照链接进行操作,您会发现
https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1
其中包含个人职务公告:

paul@paul:~/tmp/stackoverflow$ scrapy shell https://careers-meridianhealth.icims.com

In [1]: sel.xpath('.//iframe/@src')
Out[1]: [<Selector xpath='.//iframe/@src' data=u'https://careers-meridianhealth.icims.com'>]

In [2]: sel.xpath('.//iframe/@src').extract()
Out[2]: [u'https://careers-meridianhealth.icims.com/?in_iframe=1']

In [3]: fetch('https://careers-meridianhealth.icims.com/?in_iframe=1')
2014-05-21 11:53:14+0200 [default] DEBUG: Redirecting (302) to <GET https://careers-meridianhealth.icims.com/jobs?in_iframe=1> from <GET https://careers-meridianhealth.icims.com/?in_iframe=1>
2014-05-21 11:53:14+0200 [default] DEBUG: Redirecting (302) to <GET https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1&amp;hashed=0&in_iframe=1> from <GET https://careers-meridianhealth.icims.com/jobs?in_iframe=1>
2014-05-21 11:53:14+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1&amp;hashed=0&in_iframe=1> (referer: None)

In [4]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

In [5]: lx = SgmlLinkExtractor()

In [6]: lx.extract_links(response)
Out[6]: 
[Link(url='https://careers-meridianhealth.icims.com/jobs/login?back=intro&hashed=0&in_iframe=1', text=u'submit your resume', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1', text=u'view all open job positions', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/reminder?hashed=0&in_iframe=1', text=u'Reset Password', fragment='', nofollow=False),
 Link(url='https://media.icims.com/training/candidatefaq/faq.html', text=u'Need further assistance?', fragment='', nofollow=False),
 Link(url='http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform', text=u'Applicant Tracking Software', fragment='', nofollow=False)]

In [7]: fetch('https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1')
2014-05-21 11:54:24+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1> (referer: None)

In [8]: lx.extract_links(response)
Out[8]: 
[Link(url='https://careers-meridianhealth.icims.com/jobs/search?in_iframe=1&pr=1', text=u'', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5196/licensed-practical-nurse/job?in_iframe=1', text=u'LICENSED PRACTICAL NURSE', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5192/certified-nursing-assistant/job?in_iframe=1', text=u'CERTIFIED NURSING ASSISTANT', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5191/receptionist/job?in_iframe=1', text=u'RECEPTIONIST', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5190/rehabilitation-aide/job?in_iframe=1', text=u'REHABILITATION AIDE', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5188/nurse-supervisor/job?in_iframe=1', text=u'NURSE SUPERVISOR', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5164/lpn/job?in_iframe=1', text=u'LPN', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5161/speech-pathologist-per-diem/job?in_iframe=1', text=u'SPEECH PATHOLOGIST PER DIEM', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5160/social-worker-part-time/job?in_iframe=1', text=u'SOCIAL WORKER PART TIME', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5154/client-care-coordinator-nights/job?in_iframe=1', text=u'CLIENT CARE COORDINATOR NIGHTS', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5153/greeter/job?in_iframe=1', text=u'GREETER', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5152/welcome-ambassador/job?in_iframe=1', text=u'WELCOME AMBASSADOR', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5146/certified-medical-assistant-i/job?in_iframe=1', text=u'CERTIFIED MEDICAL ASSISTANT I', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5142/registered-nurse-full-time/job?in_iframe=1', text=u'REGISTERED NURSE FULL TIME', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5139/part-time-home-health-aide/job?in_iframe=1', text=u'PART TIME HOME HEALTH AIDE', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5136/rehabilitation-tech/job?in_iframe=1', text=u'REHABILITATION TECH', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5127/registered-nurse/job?in_iframe=1', text=u'REGISTERED NURSE', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5123/dietary-aide/job?in_iframe=1', text=u'DIETARY AIDE', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5121/tcu-administrator-%5Btransitional-care-unit%5D/job?in_iframe=1', text=u'TCU ADMINISTRATOR [TRANSITIONAL CARE UNIT]', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5119/mds-coordinator/job?in_iframe=1', text=u'MDS Coordinator', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5108/per-diem-patient-service-tech/job?in_iframe=1', text=u'Per Diem PATIENT SERVICE TECH', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1', text=u'Go back to the welcome page', fragment='', nofollow=False),
 Link(url='https://media.icims.com/training/candidatefaq/faq.html', text=u'Need further assistance?', fragment='', nofollow=False),
 Link(url='http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform', text=u'Applicant Tracking Software', fragment='', nofollow=False)]

In [9]: 
paul@paul:~/tmp/stackoverflow$scrapy shellhttps://careers-meridianhealth.icims.com
[1]中的sel.xpath('.//iframe/@src')
Out[1]:[]
[2]中的sel.xpath('.//iframe/@src').extract()
出[2]:[u'https://careers-meridianhealth.icims.com/?in_iframe=1']
在[3]中:取数https://careers-meridianhealth.icims.com/?in_iframe=1')
2014-05-21 11:53:14+0200[默认]调试:重定向(302)到
2014-05-21 11:53:14+0200[默认]调试:重定向(302)到
2014-05-21 11:53:14+0200[默认]调试:爬网(200)(参考:无)
在[4]中:从scrapy.contrib.linkextractors.sgml导入SgmlLinkExtractor
在[5]中:lx=SgmlLinkExtractor()
在[6]:lx.提取链接(响应)
出[6]:
[链接(url=]https://careers-meridianhealth.icims.com/jobs/login?back=intro&hashed=0&in_iframe=1,text=u'提交简历',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1',text=u'查看所有未结职位',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/reminder?hashed=0&in_iframe=1',text=u'Reset Password',fragment='',nofollow=False),
链接(url=)https://media.icims.com/training/candidatefaq/faq.html“,text=u”“需要进一步帮助吗?”,fragment='',nofollow=False),
链接(url=)http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform',text=u‘申请人跟踪软件’,fragment='',nofollow=False)]
在[7]中:取数https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1')
2014-05-21 11:54:24+0200[默认]调试:爬网(200)(参考:无)
在[8]:lx.提取链接(响应)
出[8]:
[链接(url=]https://careers-meridianhealth.icims.com/jobs/search?in_iframe=1&pr=1,text=u',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5196/licensed-practical-nurse/job?in_iframe=1',text=u'LICENSED PRACTICAL NURSE',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5192/certified-nursing-assistant/job?in_iframe=1',text=u'CERTIFIED NURSING ASSISTANT',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5191/receptionist/job?in_iframe=1“,text=u‘前台接待员’,fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5190/rehabilitation-aide/job?in_iframe=1',text=u'recoveration AIDE',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5188/nurse-supervisor/job?in_iframe=1',text=u'NURSE SUPERVISOR',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5164/lpn/job?in_iframe=1',text=u'LPN',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5161/speech-pathologist-per-diem/job?in_iframe=1',text=u'每日言语病理学家',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5160/social-worker-part-time/job?in_iframe=1“,text=u‘社会工作者兼职’,fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5154/client-care-coordinator-nights/job?in_iframe=1“,text=u‘客户服务协调员之夜’,fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5153/greeter/job?in_iframe=1',text=u'GREETER',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5152/welcome-ambassador/job?in_iframe=1“,text=u‘欢迎大使’,fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5146/certified-medical-assistant-i/job?in_iframe=1',text=u'CERTIFIED MEDICAL ASSISTANT I',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5142/registered-nurse-full-time/job?in_iframe=1',text=u'注册全职护士',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5139/part-time-home-health-aide/job?in_iframe=1',text=u'PART-TIME HOME HEALTH AIDE',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5136/rehabilitation-tech/job?in_iframe=1“,text=u‘康复技术’,fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5127/registered-nurse/job?in_iframe=1',text=u'REGISTERED NURSE',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5123/dietary-aide/job?in_iframe=1,text=u'DIETARY AIDE',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5121/tcu-administrator-%5Btransitional-护理单元%5D/工作?在_iframe=1',text=u'TCU管理员[过渡护理单元]',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5119/mds-coordinator/job?in_iframe=1“,text=u'MDS协调器”,fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.
paul@paul:~/tmp/stackoverflow$ scrapy shell https://careers-meridianhealth.icims.com

In [1]: sel.xpath('.//iframe/@src')
Out[1]: [<Selector xpath='.//iframe/@src' data=u'https://careers-meridianhealth.icims.com'>]

In [2]: sel.xpath('.//iframe/@src').extract()
Out[2]: [u'https://careers-meridianhealth.icims.com/?in_iframe=1']

In [3]: fetch('https://careers-meridianhealth.icims.com/?in_iframe=1')
2014-05-21 11:53:14+0200 [default] DEBUG: Redirecting (302) to <GET https://careers-meridianhealth.icims.com/jobs?in_iframe=1> from <GET https://careers-meridianhealth.icims.com/?in_iframe=1>
2014-05-21 11:53:14+0200 [default] DEBUG: Redirecting (302) to <GET https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1&amp;hashed=0&in_iframe=1> from <GET https://careers-meridianhealth.icims.com/jobs?in_iframe=1>
2014-05-21 11:53:14+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1&amp;hashed=0&in_iframe=1> (referer: None)

In [4]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

In [5]: lx = SgmlLinkExtractor()

In [6]: lx.extract_links(response)
Out[6]: 
[Link(url='https://careers-meridianhealth.icims.com/jobs/login?back=intro&hashed=0&in_iframe=1', text=u'submit your resume', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1', text=u'view all open job positions', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/reminder?hashed=0&in_iframe=1', text=u'Reset Password', fragment='', nofollow=False),
 Link(url='https://media.icims.com/training/candidatefaq/faq.html', text=u'Need further assistance?', fragment='', nofollow=False),
 Link(url='http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform', text=u'Applicant Tracking Software', fragment='', nofollow=False)]

In [7]: fetch('https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1')
2014-05-21 11:54:24+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1> (referer: None)

In [8]: lx.extract_links(response)
Out[8]: 
[Link(url='https://careers-meridianhealth.icims.com/jobs/search?in_iframe=1&pr=1', text=u'', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5196/licensed-practical-nurse/job?in_iframe=1', text=u'LICENSED PRACTICAL NURSE', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5192/certified-nursing-assistant/job?in_iframe=1', text=u'CERTIFIED NURSING ASSISTANT', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5191/receptionist/job?in_iframe=1', text=u'RECEPTIONIST', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5190/rehabilitation-aide/job?in_iframe=1', text=u'REHABILITATION AIDE', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5188/nurse-supervisor/job?in_iframe=1', text=u'NURSE SUPERVISOR', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5164/lpn/job?in_iframe=1', text=u'LPN', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5161/speech-pathologist-per-diem/job?in_iframe=1', text=u'SPEECH PATHOLOGIST PER DIEM', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5160/social-worker-part-time/job?in_iframe=1', text=u'SOCIAL WORKER PART TIME', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5154/client-care-coordinator-nights/job?in_iframe=1', text=u'CLIENT CARE COORDINATOR NIGHTS', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5153/greeter/job?in_iframe=1', text=u'GREETER', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5152/welcome-ambassador/job?in_iframe=1', text=u'WELCOME AMBASSADOR', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5146/certified-medical-assistant-i/job?in_iframe=1', text=u'CERTIFIED MEDICAL ASSISTANT I', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5142/registered-nurse-full-time/job?in_iframe=1', text=u'REGISTERED NURSE FULL TIME', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5139/part-time-home-health-aide/job?in_iframe=1', text=u'PART TIME HOME HEALTH AIDE', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5136/rehabilitation-tech/job?in_iframe=1', text=u'REHABILITATION TECH', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5127/registered-nurse/job?in_iframe=1', text=u'REGISTERED NURSE', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5123/dietary-aide/job?in_iframe=1', text=u'DIETARY AIDE', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5121/tcu-administrator-%5Btransitional-care-unit%5D/job?in_iframe=1', text=u'TCU ADMINISTRATOR [TRANSITIONAL CARE UNIT]', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5119/mds-coordinator/job?in_iframe=1', text=u'MDS Coordinator', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/5108/per-diem-patient-service-tech/job?in_iframe=1', text=u'Per Diem PATIENT SERVICE TECH', fragment='', nofollow=False),
 Link(url='https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1', text=u'Go back to the welcome page', fragment='', nofollow=False),
 Link(url='https://media.icims.com/training/candidatefaq/faq.html', text=u'Need further assistance?', fragment='', nofollow=False),
 Link(url='http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform', text=u'Applicant Tracking Software', fragment='', nofollow=False)]

In [9]: