Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/303.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在scrapy中生成项目和回调请求_Python_Callback_Scrapy - Fatal编程技术网

Python 在scrapy中生成项目和回调请求

Python 在scrapy中生成项目和回调请求,python,callback,scrapy,Python,Callback,Scrapy,免责声明:我对Python和Scrapy都是新手 我试图让我的爬行器从起始url收集url,遵循这些收集的url,以及以下两种方式: 在下一页搜索特定项目(并最终返回) 从下一页收集更多特定的URL并遵循这些URL 我希望能够继续这个生成项和回调请求的过程,但我不太确定如何做到这一点。 目前我的代码只返回URL,不返回任何项目。我显然做错了什么。如有任何反馈,将不胜感激 class VSSpider(scrapy.Spider): name = "vs5" allowed_dom

免责声明:我对Python和Scrapy都是新手

我试图让我的爬行器从起始url收集url,遵循这些收集的url,以及以下两种方式:

  • 在下一页搜索特定项目(并最终返回)
  • 从下一页收集更多特定的URL并遵循这些URL
  • 我希望能够继续这个生成项和回调请求的过程,但我不太确定如何做到这一点。 目前我的代码只返回URL,不返回任何项目。我显然做错了什么。如有任何反馈,将不胜感激

    class VSSpider(scrapy.Spider):
        name = "vs5"
        allowed_domains = ["votesmart.org"]
        start_urls = [
                      "https://votesmart.org/officials/WA/L/washington-state-legislative#.V8M4p5MrKRv",
                      ]
    
        def parse(self, response):
            sel = Selector(response)
            #this gathers links to the individual legislator pages, it works
            for href in response.xpath('//h5/a/@href'): 
                url = response.urljoin(href.extract())
                yield scrapy.Request(url, callback=self.parse1)
    
        def parse1(self, response):
            sel = Selector(response)
            items = []
            #these xpaths are on the next page that the spider should follow, when it first visits an individual legislator page
            for sel in response.xpath('//*[@id="main"]/section/div/div/div'):
                item = LegislatorsItems()
                item['current_office'] = sel.xpath('//tr[1]/td/text()').extract()
                item['running_for'] = sel.xpath('//tr[2]/td/text()').extract()
                items.append(item)
            #this is the xpath to the biography of the legislator, which it should follow and scrape next
            for href in response.xpath('//*[@id="folder-bio"]/@href'):
                url = response.urljoin(href.extract())
                yield scrapy.Request(url, callback=self.parse2, meta={'items': items})
    
        def parse2(self, response):
            sel = Selector(response)
            items = response.meta['items']
            #this is an xpath on the biography page
            for sel in response.xpath('//*[@id="main"]/section/div[2]/div/div[3]/div/'):
                item = LegislatorsItems()
                item['tester'] = sel.xpath('//div[2]/div[2]/ul/li[3]').extract()
                items.append(item)
                return items
    

    谢谢

    您的问题有两个级别。

    1.禁用JS时,Bio url不可用。在浏览器中关闭JS并检查此页面:

    您应该会看到标签中的href为空,并且在注释下隐藏了正确的url

    <a href="#" class="folder" id="folder-bio">
    <!--<a href='/candidate/biography/126288/derek-stanford' itemprop="url" class='more'>
               See Full Biographical and Contact Information</a>-->
    

    快速浏览一下您的代码后,我想最后一行中的
    退货项目
    应该有不同的缩进级别。除了starrify提到的内容之外,
    parse2
    甚至可以访问吗?你可以发布爬网日志吗?谢谢你关于JS的提示,这让事情变得容易多了。代码中的所有内容对我来说都很好,除了
    parse\u bio
    。它返回(bios的)正确的URL,但也只是{'tester':[]},而不是指定的xpath,这让我觉得xpath选择器有一个简单的问题,但是当我在Google Chrome的控制台中检查它时,它工作了。还有,你把它改成css有什么特别的原因吗?谢谢没有关系!我的xpath选择器肯定是错的。
    from scrapy import Spider, Request
    
    class VSSpider(Spider):
        name = "vs5"
        allowed_domains = ["votesmart.org"]
        start_urls = ["https://votesmart.org/officials/WA/L/washington-state-legislative"]
    
        def parse(self, response):
            for href in response.css('h5 a::attr(href)').extract():
                person_url = response.urljoin(href)
                yield Request(person_url, callback=self.parse_person)
    
        def parse_person(self, response):  # former "parse1"
            # define item, one for both parse_person and bio function
            item = LegislatorsItems()
    
            # extract text from left menu table and populate to item
            desc_rows = response.css('.span-abbreviated td::text').extract()
            if desc_rows:
                item['current_office'] = desc_rows[0]
                item['running_for'] = desc_rows[1] if len(desc_rows) > 1 else None
    
            # create right bio url and pass item to it
            bio_url = response.url.replace('votesmart.org/candidate/', 
                                           'votesmart.org/candidate/biography/')
            return Request(bio_url, callback=self.parse_bio, meta={'item': item})
    
        def parse_bio(self, response):  # former "parse2"
            # get item from meta, add "tester" data and return
            item = response.meta['item']
            item['tester'] = response.css('.item.first').xpath('//li[3]').extract()
            print(item)   # for python 2: print item 
            return item