Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/295.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Scrapy无法刮掉所有的页面。只擦伤了很少一部分_Python_Web Scraping_Scrapy_Web Crawler - Fatal编程技术网

Python Scrapy无法刮掉所有的页面。只擦伤了很少一部分

Python Scrapy无法刮掉所有的页面。只擦伤了很少一部分,python,web-scraping,scrapy,web-crawler,Python,Web Scraping,Scrapy,Web Crawler,我的目标是为了数据科学而把所有的职位名称、描述和公司名称都删掉。然而,当我编写代码时,我能够运行它,并且能够提取390个零工的信息,但随后它就停止工作了 # -*- coding: utf-8 -*- import scrapy class DataNewSpider(scrapy.Spider): name = 'data_new' allowed_domains = ['www.indeed.com'] start_urls = ['https://www.ind

我的目标是为了数据科学而把所有的职位名称、描述和公司名称都删掉。然而,当我编写代码时,我能够运行它,并且能够提取390个零工的信息,但随后它就停止工作了

# -*- coding: utf-8 -*-
import scrapy


class DataNewSpider(scrapy.Spider):
    name = 'data_new'
    allowed_domains = ['www.indeed.com']
    start_urls = ['https://www.indeed.com/jobs?q=data+science&l=united+states']

    def parse(self, response):
        #Grab the job card of all the 15 page
        all_jobs = response.xpath('//div[@class="jobsearch-SerpJobCard unifiedRow row result"]')
        #run the job card in for loop to get each Job descriptions specific url 
        for job in all_jobs:
            #Grab each of the url and join it with the domain to get a working url of the indiviudal job
            new_one = "https://www.indeed.com" + job.xpath('.//h2[@class="title"]//a/@href').extract_first()
            #yield each of the 15 jobs and send it to parse job for gaining the company,jd and title
            yield scrapy.Request(new_one,callback=self.parse_job)
        #locating the url for the > (next page button)
        next_page_part_url = response.xpath('//ul[@class="pagination-list"]//a/@href')[4].extract()
        next_page_url = "https://www.indeed.com" + next_page_part_url
        #requesting those url of all the pages and sending them to self parse to iterate over the next pages job cards
        yield scrapy.Request(next_page_url,callback=self.parse)

    def parse_job(self,response):
        #Grab title,JD and company name in each individual page
        title = response.xpath('//div/h1/text()').extract_first()
        description = response.xpath('//div[@class="jobsearch-jobDescriptionText"]/div').extract_first()
        company = response.xpath('normalize-space(//div[@class="icl-u-lg-mr--sm icl-u-xs-mr--xs"]//a/text())').extract_first()

        yield {"title":title,
        "description":description,
        "company":company}
{'title':无,'description':无,'company':'''
2020-10-08 19:34:46[刮板芯刮板]错误:卡盘错误处理(参考:https://www.indeed.com/jobs?q=data+科学与法律=美国+美国&start=740)
回溯(最近一次呼叫最后一次):
文件“c:\users\naman jogani\anaconda3\lib\site packages\scrapy\utils\defer.py”,第117行,在iter\U errback中
下一个(it)
文件“c:\users\naman jogani\anaconda3\lib\site packages\scrapy\utils\python.py”,第345行,下一步__
返回下一个(self.data)
文件“c:\users\naman jogani\anaconda3\lib\site packages\scrapy\utils\python.py”,第345行,下一步__
返回下一个(self.data)
文件“c:\users\naman jogani\anaconda3\lib\site packages\scrapy\core\spidermw.py”,第64行,在可评估文件中
对于iterable中的r:
文件“c:\users\naman jogani\anaconda3\lib\site packages\scrapy\spidermiddleware\offsite.py”,第29行,进程中\u spider\u输出
对于结果中的x:
文件“c:\users\naman jogani\anaconda3\lib\site packages\scrapy\core\spidermw.py”,第64行,在可评估文件中
对于iterable中的r:
文件“c:\users\naman jogani\anaconda3\lib\site packages\scrapy\spidermiddleware\referer.py”,第338行,在
返回(_set_referer(r)表示结果中的r或())
文件“c:\users\naman jogani\anaconda3\lib\site packages\scrapy\core\spidermw.py”,第64行,在可评估文件中
对于iterable中的r:
文件“c:\users\naman jogani\anaconda3\lib\site packages\scrapy\spidermiddleware\urlength.py”,第37行,在
返回(结果中的r表示r或()如果_过滤器(r))
文件“c:\users\naman jogani\anaconda3\lib\site packages\scrapy\core\spidermw.py”,第64行,在可评估文件中
对于iterable中的r:
文件“c:\users\naman jogani\anaconda3\lib\site packages\scrapy\spidermiddleware\depth.py”,第58行,在
返回(结果中的r表示r或()如果_过滤器(r))
文件“c:\users\naman jogani\anaconda3\lib\site packages\scrapy\core\spidermw.py”,第64行,在可评估文件中
对于iterable中的r:
文件“C:\Users\Naman Jogani\Desktop\Scraping Web\new\u project\new\u project\spider\data\u new.py”,第20行,在parse中
下一页\u part\u url=response.xpath('//ul[@class=“分页列表”]//a/@href')[4].extract()
文件“c:\users\naman jogani\anaconda3\lib\site packages\parsel\selector.py”,第61行,在\uu getitem中__
o=超级(SelectorList,self)。\u获取项目\u(pos)
索引器:列表索引超出范围
2020-10-08 19:34:47[scrapy.core.scraper]调试:从
{'title':无,'description':无,'company':'''}
2020-10-08 19:34:47[scrapy.core.scraper]调试:从
{'title':无,'description':无,'company':'''}
2020-10-08 19:34:50[scrapy.core.scraper]调试:从
{'title':无,'description':无,'company':'''}
2020-10-08 19:34:50[scrapy.core.scraper]调试:从
{'title':无,'description':无,'company':'''}
2020-10-08 19:34:50[刮屑芯发动机]信息:关闭卡盘(完成)
2020-10-08 19:34:50[垃圾统计]信息:倾销垃圾统计数据:
{'downloader/request_bytes':1441718,
“下载程序/请求计数”:1282,
“下载程序/请求方法/计数/获取”:1282,
“downloader/response_字节”:31114526,
“下载程序/响应计数”:1282,
“下载程序/响应状态/计数/200”:553,
“下载/响应状态\计数/301”:42,
“下载程序/响应状态\计数/302”:687,
“dupefilter/filtered”:56,
“已用时间秒”:118.909014,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2020,10,8,23,34,50584878),
“物料刮擦计数”:514,
“日志计数/调试”:1797,
“日志计数/错误”:1,
“日志计数/信息”:11,
“请求最大深度”:38,
“收到的响应数”:553,
“调度程序/出列”:1282,
“调度程序/出列/内存”:1282,
“调度程序/排队”:1282,
“调度程序/排队/内存”:1282,
“spider\u异常/索引器”:1,
“开始时间”:datetime.datetime(2020,10,8,23,32,51675864)}
2020-10-08 19:34:50[刮屑核心发动机]信息:十字轴关闭(完成)
描述显示了它上次运行的时间,直到然后开始抛出None、None和None,用于以下每个字段。然后它突然停止了。任何指导或链接也将有助于我了解如何解决问题。感谢您的帮助或建议。

A可以为您提供更多反馈,甚至帮助您自己解决问题。
{'title': None, 'description': None, 'company': ''}
2020-10-08 19:34:46 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.indeed.com/jobs?q=data+science&l=united+states&start=760> (referer: https://www.indeed.com/jobs?q=data+science&l=united+states&start=740)
Traceback (most recent call last):
  File "c:\users\naman jogani\anaconda3\lib\site-packages\scrapy\utils\defer.py", line 117, in iter_errback
    yield next(it)
  File "c:\users\naman jogani\anaconda3\lib\site-packages\scrapy\utils\python.py", line 345, in __next__
    return next(self.data)
  File "c:\users\naman jogani\anaconda3\lib\site-packages\scrapy\utils\python.py", line 345, in __next__
    return next(self.data)
  File "c:\users\naman jogani\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "c:\users\naman jogani\anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\users\naman jogani\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "c:\users\naman jogani\anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 338, in <genexpr>      
    return (_set_referer(r) for r in result or ())
  File "c:\users\naman jogani\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "c:\users\naman jogani\anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>     
    return (r for r in result or () if _filter(r))
  File "c:\users\naman jogani\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "c:\users\naman jogani\anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\users\naman jogani\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\Naman Jogani\Desktop\Scraping Web\new_project\new_project\spiders\data_new.py", line 20, in parse
    next_page_part_url = response.xpath('//ul[@class="pagination-list"]//a/@href')[4].extract()
  File "c:\users\naman jogani\anaconda3\lib\site-packages\parsel\selector.py", line 61, in __getitem__
    o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range
2020-10-08 19:34:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.indeed.com/viewjob?jk=20ff275fd0002ac7&tk=1ek5926t6s7bo800&from=serp&vjs=3&advn=2500590964327171&adid=119415745&ad=-6NYlbfkN0AC5S5KfpcrE62cRuYLg6qW_HWiPjKHP06qk-AGfbwYtGlr3wcSMURH9oqKq1q2FCfIFI88nD74GAnmcodtEx0ly0z-i6QYTR6rnxSwencFYBNRiEDaNsFgEsSbsxf6sbxiCvlo2JDu2DQluXNkeZ-PtwhVU50dPVfZqnxskNE6uyHp49kYSDfdBILNIIzyuNAvnYcGV41jAsALU8ZEb3Xj6Oa_aI8BlDKAMjqctbq6BKf_qXsVk1VcE9q8XKvy2THBn6KWOaiNru0zoZUBgW5brqdOhWvxa2nqDXGzMYBl1g==&sjdu=9XwHWtDrCBG1A5A-9uyGom44uoQy3i9t-65IIlj9h2Pvd9afrAiOVKuIWGgsOlV0gEbaxIb2_2CXWfl8teiMCg>
{'title': None, 'description': None, 'company': ''}
2020-10-08 19:34:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://us.conv.indeed.com/pagead/clk?mo=f&ad=-6NYlbfkN0DYofz-HQ3EEUWGs4-6JIL8FK8s56G6XwZsvFVhglGWaM6Ni6eeMEGmGEz_ZHPmGR2KL5rTyonrCQSmvySIObJPy6udzOqTmpTIb37B7qTGi-tqxBaiwtZ8YXsGZvueB_Fc_72G3tvHyzqF46x8vi4WJOi5FX6_9ET9-C9tZ0ddalaKjCGjySbTsIoTXf5lvR66RIJtfVARAyvj8HSBmrC5vaoLHOA9XfYyUeli0rGi7Khg7jAR-Ye8hQrYWvS9P9sGi6KSCoNSqHUQe6Hla-L67zeqtGZ1ctNFAGBIi1RY6b3uSY-KbJ1p4Ha6kJbR6D67YQ8R8457Fhds6eWkP3fUUUJy3CiOy6lOB-9XES9iT3u_m865IDaPp7v0Mt1_QC3zJ6CyMsK1jvFPn1SFkCuVJLyoOe88ocLP-Skm0Qd4RHDwxiSBj-CPtIZrd02s3rIHgjUG_Z9Zp7FLrFNREvQm5HY2T0dtj8IfISBE77GtE5lDOlQmhi5E-gqmP9ZHL0RDfDrHUHtJBucLwDcPsX22WLoIIkBlRDBEDdOnzsmMpIRIY9NNoI-xqaQ1UFOMO6tgTRGOobyU6Jt-ZwR6-PWW_6oXHAAWgeFxu_oaeik8ya-gAvJL11Jrml8gaVROpya690Yx9IhfCcx-B9k7OyaB3XWZzRQqGa4LxQkolRmUj7isaetSKYdgTapmvkOGiSLMBW-2PvW8Pl1ng_BfaqhhHXr98ZY-uTRhw4p_sOUTsIzcY5wvG7i9BGtFvr28m66YF37n7cJjqaB0lst1wStua29h5avW3rOKcEoM-djuBA==&p=6&fvj=0&vjs=3&ctk=1ek58uqom4f4i800&ctkRcv=1&pcid=&wwwho=4m_xAU4HGQbAlJdZTkhlA8hEG_rFObn1&iast=1&dupclk=0&vjs=3>
{'title': None, 'description': None, 'company': ''}
2020-10-08 19:34:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.indeed.com/viewjob?cmp=U.S.-Xpress&t=Senior+Data+Scientist&jk=adcdb45c011a0c70&vjs=3>
{'title': None, 'description': None, 'company': ''}
2020-10-08 19:34:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.indeed.com/viewjob?cmp=Jayes-Tech-LLC&t=Data+Engineer&jk=7793bab0b0a95738&vjs=3>
{'title': None, 'description': None, 'company': ''}
2020-10-08 19:34:50 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-08 19:34:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1441718,
 'downloader/request_count': 1282,
 'downloader/request_method_count/GET': 1282,
 'downloader/response_bytes': 31114526,
 'downloader/response_count': 1282,
 'downloader/response_status_count/200': 553,
 'downloader/response_status_count/301': 42,
 'downloader/response_status_count/302': 687,
 'dupefilter/filtered': 56,
 'elapsed_time_seconds': 118.909014,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 10, 8, 23, 34, 50, 584878),
 'item_scraped_count': 514,
 'log_count/DEBUG': 1797,
 'log_count/ERROR': 1,
 'log_count/INFO': 11,
 'request_depth_max': 38,
 'response_received_count': 553,
 'scheduler/dequeued': 1282,
 'scheduler/dequeued/memory': 1282,
 'scheduler/enqueued': 1282,
 'scheduler/enqueued/memory': 1282,
 'spider_exceptions/IndexError': 1,
 'start_time': datetime.datetime(2020, 10, 8, 23, 32, 51, 675864)}
2020-10-08 19:34:50 [scrapy.core.engine] INFO: Spider closed (finished)