Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/287.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 删除包含多个子页面/无子页面的asp.net页面:在if-else语句中生成_Python_Asp.net_Web Scraping_Scrapy_Yield - Fatal编程技术网

Python 删除包含多个子页面/无子页面的asp.net页面:在if-else语句中生成

Python 删除包含多个子页面/无子页面的asp.net页面:在if-else语句中生成,python,asp.net,web-scraping,scrapy,yield,Python,Asp.net,Web Scraping,Scrapy,Yield,以下是spyder.py文件: import scrapy from scrapy_spider.items import JobsItem class JobSpider(scrapy.Spider): name = 'burzarada' start_urls = ['https://burzarada.hzz.hr/Posloprimac_RadnaMjesta.aspx'] download_delay = 1.5 def pars

以下是spyder.py文件:

import scrapy
from scrapy_spider.items import JobsItem


class JobSpider(scrapy.Spider): 
    
    name = 'burzarada' 
    start_urls = ['https://burzarada.hzz.hr/Posloprimac_RadnaMjesta.aspx'] 
    download_delay = 1.5 

    def parse(self, response): 

        for href in response.css('div.NKZbox > div.KategorijeBox > a ::attr(href)').extract(): 

            eventTarget = href.replace("javascript:__doPostBack('", "").replace("','')", "")
            eventArgument = response.css('#__EVENTARGUMENT::attr(value)').extract()
            lastFocus = response.css('#__LASTFOCUS::attr(value)').extract()
            viewState = response.css('#__VIEWSTATE::attr(value)').extract()
            viewStateGenerator = response.css('#__VIEWSTATEGENERATOR::attr(value)').extract()
            viewStateEncrypted = response.css('#__VIEWSTATEENCRYPTED::attr(value)').extract()

            yield scrapy.FormRequest( 

                'https://burzarada.hzz.hr/Posloprimac_RadnaMjesta.aspx', 

                formdata = { 
                    '__EVENTTARGET': eventTarget, 
                    '__EVENTARGUMENT': eventArgument, 
                    '__LASTFOCUS': lastFocus, 
                    '__VIEWSTATE': viewState, 
                    '__VIEWSTATEGENERATOR': viewStateGenerator,
                    '__VIEWSTATEENCRYPTED': viewStateEncrypted,
                },

                callback=self.parse_category 
            )
            
            
    def parse_category(self, response): 
        
        href = response.xpath('//select[@id="ctl00_MainContent_ddlPageSize"]').extract()
       
        eventTarget = "ctl00$MainContent$ddlPageSize"
        eventArgument = response.css('#__EVENTARGUMENT::attr(value)').extract()
        lastFocus = response.css('#__LASTFOCUS::attr(value)').extract()
        viewState = response.css('#__VIEWSTATE::attr(value)').extract()
        viewStateGenerator = response.css('#__VIEWSTATEGENERATOR::attr(value)').extract()
        viewStateEncrypted = response.css('#__VIEWSTATEENCRYPTED::attr(value)').extract()
        pageSize = '75'
        sort = '0'

        yield scrapy.FormRequest( 

            'https://burzarada.hzz.hr/Posloprimac_RadnaMjesta.aspx', 

            formdata = { 
                '__EVENTTARGET': eventTarget, 
                '__EVENTARGUMENT': eventArgument, 
                '__LASTFOCUS': lastFocus, 
                '__VIEWSTATE': viewState, 
                '__VIEWSTATEGENERATOR': viewStateGenerator,
                '__VIEWSTATEENCRYPTED': viewStateEncrypted,
                'ctl00$MainContent$ddlPageSize': pageSize,
                'ctl00$MainContent$ddlSort': sort,
            },

            callback=self.parse_multiple_pages 
        )
        
    
    def parse_multiple_pages(self, response):

        hrefs = response.xpath('//*[@id="ctl00_MainContent_gwSearch"]//tr[last()]//li/a/@href').extract()
        
        ##################################
        # Here is the part of problem

        if len(hrefs) != 0: # yield statement
            
            for href in hrefs:

                eventTarget = href.replace("javascript:__doPostBack('", "").replace("','')", "")
                eventArgument = response.css('#__EVENTARGUMENT::attr(value)').extract()
                lastFocus = response.css('#__LASTFOCUS::attr(value)').extract()
                viewState = response.css('#__VIEWSTATE::attr(value)').extract()
                viewStateGenerator = response.css('#__VIEWSTATEGENERATOR::attr(value)').extract()
                viewStateEncrypted = response.css('#__VIEWSTATEENCRYPTED::attr(value)').extract()
                pageSize = '75'
                sort = '0'

                print(eventTarget)

                yield scrapy.FormRequest( 

                    'https://burzarada.hzz.hr/Posloprimac_RadnaMjesta.aspx', 

                    formdata = { 
                        '__EVENTTARGET': eventTarget, 
                        '__EVENTARGUMENT': eventArgument, 
                        '__LASTFOCUS': lastFocus, 
                        '__VIEWSTATE': viewState, 
                        '__VIEWSTATEGENERATOR': viewStateGenerator,
                        '__VIEWSTATEENCRYPTED': viewStateEncrypted,
                        'ctl00$MainContent$ddlPageSize': pageSize,
                        'ctl00$MainContent$ddlSort': sort,
                    },

                    callback=self.parse_links 
                    
                )
        
        else: # another yield

            for link in links:

                link = 'https://burzarada.hzz.hr/' + link

                yield scrapy.Request(url=link, callback=self.parse_job)
        ##########################################
    def parse_links(self, response):

        links = response.xpath('//a[@class="TitleLink"]/@href').extract()

        for link in links:

            link = 'https://burzarada.hzz.hr/' + link
            yield scrapy.Request(url=link, callback=self.parse_job)

    def parse_job(self, response):

        item = JobsItem()

        item['url'] = ''
        item['title'] = ''
        item['workplace'] = ''
        item['required_workers'] = ''
        item['type_of_employment'] = ''
        item['working_hours'] = ''
        item['mode_of_operation'] = ''
        item['accomodation'] = ''
        item['transportation_fee'] = ''
        item['start_date'] = ''
        item['end_date'] = ''
        item['education_level'] = ''
        item['work_experience'] = ''
        item['other_information'] = ''
        item['employer'] = ''
        item['contact'] = ''
        item['driving_test'] = ''

        yield item
您可以看到页面结构不是很复杂

这是指向我想抓取的页面的链接

页面上有16个超链接,每个超链接都会发布请求,以在列表中获得不同数量的作业

第一个链接有1000个。作业列表的视口比例设置为25,因此第一个链接没有子页面,第二个链接有10多个子页面

我设法将它们改为75,这样我就不必处理许多子页面。问题将转到下一部分

问题是,我无法在第一个链接(没有子页面的链接)上获取任何项目。仅从第二个链接(包含10+子页面的链接)开始进行刮取。我试图在代码中使用几个print()来遵循流程(为了简洁明了而删除),但我发现它从未触及else:部分

如果我只尝试第一个页面(将for循环限制为在函数parse()中只运行一次),那么它可以正常工作

我已经为此挣扎了几个小时,但我找不到任何有用的答案

我猜这是因为第一个链接中没有子页面。如果它有一些,那么我就不必添加,如果还有


有人能帮我吗?

我已经启动了该代码,它似乎总体上正常工作

刮削仅从第二个链接开始

实际上,它尝试与第一类一起工作。问题是没有定义
链接
,爬行器出现故障。异常-
name错误:未定义名称“链接”
。Scrapy可能无法解析页面,但这不会停止整个爬虫程序,因此Scrapy将继续处理具有分页功能的页面

您还可以在spider中的第一个请求中包含分页和排序。在这种情况下,您可以通过删除
parse\u category
来简化爬行器

还有这个选择器

hrefs = response.xpath('//*[@id="ctl00_MainContent_gwSearch"]//tr[last()]//li/a/@href').extract()
可以简单得多:

hrefs = response.xpath('//ul[contains(@class, "pagination")]//a/@href').extract()

把上面提到的每件事都考虑进去可能会简单一些。

谢谢你,但我没有得到我想要的结果。它仍然拒绝阅读其他页面。它只在一页纸里转了又转。让我向您展示我正在调试的代码@Nikita如果在函数的开头添加
parse\u multiple\u pages,title-{response.xpath('//span[@id=“ctl00\u MainContent\u lblResults”]//text()').extract()}>”
,您将看到您正在阅读同一页。您肯定在
解析\u类别
中有一些错误。我文章中pastebin链接中的代码被简化,因此它没有
parse_category
和错误。一点更新的代码和很少的评论它工作!非常感谢你。你拯救了这一天!!!