Web scraping 无法继续进行刮取或爬行
我试图从这(一个示例页面)中获取数据,但没有用。我不知道为什么它总是告诉我,过滤后的异地请求到另一个网站,而referer是没有的。我只是想知道它的工作名称、位置和链接。无论如何,这是我的代码:Web scraping 无法继续进行刮取或爬行,web-scraping,scrapy,Web Scraping,Scrapy,我试图从这(一个示例页面)中获取数据,但没有用。我不知道为什么它总是告诉我,过滤后的异地请求到另一个网站,而referer是没有的。我只是想知道它的工作名称、位置和链接。无论如何,这是我的代码: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.http import Request from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor fr
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(CrawlSpider):
name = "meridian"
allowed_domains = ["careers-meridianhealth.icims.com"]
start_urls = ["https://careers-meridianhealth.icims.com"]
rules = (Rule (SgmlLinkExtractor(deny = path_deny_base, allow=('\d+'),restrict_xpaths=('*'))
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//div[2]/h1')
linker = hxs.select('//div[2]/div[8]/a[1]')
loc_Con = hxs.select('//div[2]/span/span/span[1]')
loc_Reg = hxs.select('//div[2]/span/span/span[2]')
loc_Loc = hxs.select('//div[2]/span/span/span[3]')
items = []
for titles in titles:
item = CraigslistSampleItem()
#item ["job_id"] = id.select('text()').extract()[0].strip()
item ["title"] = map(unicode.strip, titles.select('text()').extract()) #ok
item ["link"] = linker.select('@href').extract() #ok
item ["info"] = (response.url)
temp1 = loc_Con.select('text()').extract()
temp2 = loc_Reg.select('text()').extract()
temp3 = loc_Loc.select('text()').extract()
temp1 = temp1[0] if temp1 else ""
temp2 = temp2[0] if temp2 else ""
temp3 = temp3[0] if temp3 else ""
item["code"] = "{0}-{1}-{2}".format(temp1, temp2, temp3)
items.append(item)
return(items)
如果你使用scrapy shell检查你的链接提取器,你会发现你的起始URL只包含指向网站的链接,而不在“careers.health.icims.com”下 您可以更改规则,将更多域添加到
允许的\u域
属性,或者根本不定义允许的\u属性
(因此所有域都将进行爬网,这可能意味着爬网大量页面)
但是如果仔细查看页面源代码,您会注意到它包含一个iframe
,如果您按照链接进行操作,您会发现https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1
其中包含个人职务公告:
paul@paul:~/tmp/stackoverflow$ scrapy shell https://careers-meridianhealth.icims.com
In [1]: sel.xpath('.//iframe/@src')
Out[1]: [<Selector xpath='.//iframe/@src' data=u'https://careers-meridianhealth.icims.com'>]
In [2]: sel.xpath('.//iframe/@src').extract()
Out[2]: [u'https://careers-meridianhealth.icims.com/?in_iframe=1']
In [3]: fetch('https://careers-meridianhealth.icims.com/?in_iframe=1')
2014-05-21 11:53:14+0200 [default] DEBUG: Redirecting (302) to <GET https://careers-meridianhealth.icims.com/jobs?in_iframe=1> from <GET https://careers-meridianhealth.icims.com/?in_iframe=1>
2014-05-21 11:53:14+0200 [default] DEBUG: Redirecting (302) to <GET https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1&hashed=0&in_iframe=1> from <GET https://careers-meridianhealth.icims.com/jobs?in_iframe=1>
2014-05-21 11:53:14+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1&hashed=0&in_iframe=1> (referer: None)
In [4]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
In [5]: lx = SgmlLinkExtractor()
In [6]: lx.extract_links(response)
Out[6]:
[Link(url='https://careers-meridianhealth.icims.com/jobs/login?back=intro&hashed=0&in_iframe=1', text=u'submit your resume', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1', text=u'view all open job positions', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/reminder?hashed=0&in_iframe=1', text=u'Reset Password', fragment='', nofollow=False),
Link(url='https://media.icims.com/training/candidatefaq/faq.html', text=u'Need further assistance?', fragment='', nofollow=False),
Link(url='http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform', text=u'Applicant Tracking Software', fragment='', nofollow=False)]
In [7]: fetch('https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1')
2014-05-21 11:54:24+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1> (referer: None)
In [8]: lx.extract_links(response)
Out[8]:
[Link(url='https://careers-meridianhealth.icims.com/jobs/search?in_iframe=1&pr=1', text=u'', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5196/licensed-practical-nurse/job?in_iframe=1', text=u'LICENSED PRACTICAL NURSE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5192/certified-nursing-assistant/job?in_iframe=1', text=u'CERTIFIED NURSING ASSISTANT', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5191/receptionist/job?in_iframe=1', text=u'RECEPTIONIST', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5190/rehabilitation-aide/job?in_iframe=1', text=u'REHABILITATION AIDE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5188/nurse-supervisor/job?in_iframe=1', text=u'NURSE SUPERVISOR', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5164/lpn/job?in_iframe=1', text=u'LPN', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5161/speech-pathologist-per-diem/job?in_iframe=1', text=u'SPEECH PATHOLOGIST PER DIEM', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5160/social-worker-part-time/job?in_iframe=1', text=u'SOCIAL WORKER PART TIME', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5154/client-care-coordinator-nights/job?in_iframe=1', text=u'CLIENT CARE COORDINATOR NIGHTS', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5153/greeter/job?in_iframe=1', text=u'GREETER', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5152/welcome-ambassador/job?in_iframe=1', text=u'WELCOME AMBASSADOR', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5146/certified-medical-assistant-i/job?in_iframe=1', text=u'CERTIFIED MEDICAL ASSISTANT I', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5142/registered-nurse-full-time/job?in_iframe=1', text=u'REGISTERED NURSE FULL TIME', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5139/part-time-home-health-aide/job?in_iframe=1', text=u'PART TIME HOME HEALTH AIDE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5136/rehabilitation-tech/job?in_iframe=1', text=u'REHABILITATION TECH', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5127/registered-nurse/job?in_iframe=1', text=u'REGISTERED NURSE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5123/dietary-aide/job?in_iframe=1', text=u'DIETARY AIDE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5121/tcu-administrator-%5Btransitional-care-unit%5D/job?in_iframe=1', text=u'TCU ADMINISTRATOR [TRANSITIONAL CARE UNIT]', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5119/mds-coordinator/job?in_iframe=1', text=u'MDS Coordinator', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5108/per-diem-patient-service-tech/job?in_iframe=1', text=u'Per Diem PATIENT SERVICE TECH', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1', text=u'Go back to the welcome page', fragment='', nofollow=False),
Link(url='https://media.icims.com/training/candidatefaq/faq.html', text=u'Need further assistance?', fragment='', nofollow=False),
Link(url='http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform', text=u'Applicant Tracking Software', fragment='', nofollow=False)]
In [9]:
paul@paul:~/tmp/stackoverflow$scrapy shellhttps://careers-meridianhealth.icims.com
[1]中的sel.xpath('.//iframe/@src')
Out[1]:[]
[2]中的sel.xpath('.//iframe/@src').extract()
出[2]:[u'https://careers-meridianhealth.icims.com/?in_iframe=1']
在[3]中:取数https://careers-meridianhealth.icims.com/?in_iframe=1')
2014-05-21 11:53:14+0200[默认]调试:重定向(302)到
2014-05-21 11:53:14+0200[默认]调试:重定向(302)到
2014-05-21 11:53:14+0200[默认]调试:爬网(200)(参考:无)
在[4]中:从scrapy.contrib.linkextractors.sgml导入SgmlLinkExtractor
在[5]中:lx=SgmlLinkExtractor()
在[6]:lx.提取链接(响应)
出[6]:
[链接(url=]https://careers-meridianhealth.icims.com/jobs/login?back=intro&hashed=0&in_iframe=1,text=u'提交简历',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1',text=u'查看所有未结职位',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/reminder?hashed=0&in_iframe=1',text=u'Reset Password',fragment='',nofollow=False),
链接(url=)https://media.icims.com/training/candidatefaq/faq.html“,text=u”“需要进一步帮助吗?”,fragment='',nofollow=False),
链接(url=)http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform',text=u‘申请人跟踪软件’,fragment='',nofollow=False)]
在[7]中:取数https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1')
2014-05-21 11:54:24+0200[默认]调试:爬网(200)(参考:无)
在[8]:lx.提取链接(响应)
出[8]:
[链接(url=]https://careers-meridianhealth.icims.com/jobs/search?in_iframe=1&pr=1,text=u',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5196/licensed-practical-nurse/job?in_iframe=1',text=u'LICENSED PRACTICAL NURSE',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5192/certified-nursing-assistant/job?in_iframe=1',text=u'CERTIFIED NURSING ASSISTANT',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5191/receptionist/job?in_iframe=1“,text=u‘前台接待员’,fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5190/rehabilitation-aide/job?in_iframe=1',text=u'recoveration AIDE',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5188/nurse-supervisor/job?in_iframe=1',text=u'NURSE SUPERVISOR',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5164/lpn/job?in_iframe=1',text=u'LPN',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5161/speech-pathologist-per-diem/job?in_iframe=1',text=u'每日言语病理学家',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5160/social-worker-part-time/job?in_iframe=1“,text=u‘社会工作者兼职’,fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5154/client-care-coordinator-nights/job?in_iframe=1“,text=u‘客户服务协调员之夜’,fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5153/greeter/job?in_iframe=1',text=u'GREETER',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5152/welcome-ambassador/job?in_iframe=1“,text=u‘欢迎大使’,fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5146/certified-medical-assistant-i/job?in_iframe=1',text=u'CERTIFIED MEDICAL ASSISTANT I',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5142/registered-nurse-full-time/job?in_iframe=1',text=u'注册全职护士',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5139/part-time-home-health-aide/job?in_iframe=1',text=u'PART-TIME HOME HEALTH AIDE',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5136/rehabilitation-tech/job?in_iframe=1“,text=u‘康复技术’,fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5127/registered-nurse/job?in_iframe=1',text=u'REGISTERED NURSE',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5123/dietary-aide/job?in_iframe=1,text=u'DIETARY AIDE',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5121/tcu-administrator-%5Btransitional-护理单元%5D/工作?在_iframe=1',text=u'TCU管理员[过渡护理单元]',fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.com/jobs/5119/mds-coordinator/job?in_iframe=1“,text=u'MDS协调器”,fragment='',nofollow=False),
链接(url=)https://careers-meridianhealth.icims.
paul@paul:~/tmp/stackoverflow$ scrapy shell https://careers-meridianhealth.icims.com
In [1]: sel.xpath('.//iframe/@src')
Out[1]: [<Selector xpath='.//iframe/@src' data=u'https://careers-meridianhealth.icims.com'>]
In [2]: sel.xpath('.//iframe/@src').extract()
Out[2]: [u'https://careers-meridianhealth.icims.com/?in_iframe=1']
In [3]: fetch('https://careers-meridianhealth.icims.com/?in_iframe=1')
2014-05-21 11:53:14+0200 [default] DEBUG: Redirecting (302) to <GET https://careers-meridianhealth.icims.com/jobs?in_iframe=1> from <GET https://careers-meridianhealth.icims.com/?in_iframe=1>
2014-05-21 11:53:14+0200 [default] DEBUG: Redirecting (302) to <GET https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1&hashed=0&in_iframe=1> from <GET https://careers-meridianhealth.icims.com/jobs?in_iframe=1>
2014-05-21 11:53:14+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1&hashed=0&in_iframe=1> (referer: None)
In [4]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
In [5]: lx = SgmlLinkExtractor()
In [6]: lx.extract_links(response)
Out[6]:
[Link(url='https://careers-meridianhealth.icims.com/jobs/login?back=intro&hashed=0&in_iframe=1', text=u'submit your resume', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1', text=u'view all open job positions', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/reminder?hashed=0&in_iframe=1', text=u'Reset Password', fragment='', nofollow=False),
Link(url='https://media.icims.com/training/candidatefaq/faq.html', text=u'Need further assistance?', fragment='', nofollow=False),
Link(url='http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform', text=u'Applicant Tracking Software', fragment='', nofollow=False)]
In [7]: fetch('https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1')
2014-05-21 11:54:24+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1> (referer: None)
In [8]: lx.extract_links(response)
Out[8]:
[Link(url='https://careers-meridianhealth.icims.com/jobs/search?in_iframe=1&pr=1', text=u'', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5196/licensed-practical-nurse/job?in_iframe=1', text=u'LICENSED PRACTICAL NURSE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5192/certified-nursing-assistant/job?in_iframe=1', text=u'CERTIFIED NURSING ASSISTANT', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5191/receptionist/job?in_iframe=1', text=u'RECEPTIONIST', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5190/rehabilitation-aide/job?in_iframe=1', text=u'REHABILITATION AIDE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5188/nurse-supervisor/job?in_iframe=1', text=u'NURSE SUPERVISOR', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5164/lpn/job?in_iframe=1', text=u'LPN', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5161/speech-pathologist-per-diem/job?in_iframe=1', text=u'SPEECH PATHOLOGIST PER DIEM', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5160/social-worker-part-time/job?in_iframe=1', text=u'SOCIAL WORKER PART TIME', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5154/client-care-coordinator-nights/job?in_iframe=1', text=u'CLIENT CARE COORDINATOR NIGHTS', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5153/greeter/job?in_iframe=1', text=u'GREETER', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5152/welcome-ambassador/job?in_iframe=1', text=u'WELCOME AMBASSADOR', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5146/certified-medical-assistant-i/job?in_iframe=1', text=u'CERTIFIED MEDICAL ASSISTANT I', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5142/registered-nurse-full-time/job?in_iframe=1', text=u'REGISTERED NURSE FULL TIME', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5139/part-time-home-health-aide/job?in_iframe=1', text=u'PART TIME HOME HEALTH AIDE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5136/rehabilitation-tech/job?in_iframe=1', text=u'REHABILITATION TECH', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5127/registered-nurse/job?in_iframe=1', text=u'REGISTERED NURSE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5123/dietary-aide/job?in_iframe=1', text=u'DIETARY AIDE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5121/tcu-administrator-%5Btransitional-care-unit%5D/job?in_iframe=1', text=u'TCU ADMINISTRATOR [TRANSITIONAL CARE UNIT]', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5119/mds-coordinator/job?in_iframe=1', text=u'MDS Coordinator', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5108/per-diem-patient-service-tech/job?in_iframe=1', text=u'Per Diem PATIENT SERVICE TECH', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1', text=u'Go back to the welcome page', fragment='', nofollow=False),
Link(url='https://media.icims.com/training/candidatefaq/faq.html', text=u'Need further assistance?', fragment='', nofollow=False),
Link(url='http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform', text=u'Applicant Tracking Software', fragment='', nofollow=False)]
In [9]: