Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/308.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 刮擦:循环导致变量不更新_Python_Web Scraping_Scrapy_Cycle - Fatal编程技术网

Python 刮擦:循环导致变量不更新

Python 刮擦:循环导致变量不更新,python,web-scraping,scrapy,cycle,Python,Web Scraping,Scrapy,Cycle,下面是我想运行4次的代码部分。如果没有计数器,它将按预期工作:检索到下一页的链接并获取相关数据: def parse_commits_page(self, response): yield { 'author': response.xpath('//a[@rel="author"]/text()').extract(), 'name': response.xpath('//strong/a/text()').extract(), 'last

下面是我想运行4次的代码部分。如果没有计数器,它将按预期工作:检索到下一页的链接并获取相关数据:

def parse_commits_page(self, response):
    yield {
        'author': response.xpath('//a[@rel="author"]/text()').extract(),
        'name': response.xpath('//strong/a/text()').extract(),
        'last_commits': response.xpath('//relative-time/text()').extract()
    }
    next_page = response.xpath('//a[@rel="nofollow"]/@href')[-1].extract()
    yield response.follow(next_page, callback=self.parse_commits_page)
以下是我尝试过的循环的变体:

添加一个简单的全局计数器:

count = 0
def parse_commits_page(self, response):
    global count
    while (count < 4):
        yield {
            'author': response.xpath('//a[@rel="author"]/text()').extract(),
            'name': response.xpath('//strong/a/text()').extract(),
            'last_commits': response.xpath('//relative-time/text()').extract()
        }
        count = count + 1
        next_page = response.xpath('//a[@rel="nofollow"]/@href')[-1].extract()
        yield response.follow(next_page, callback=self.parse_commits_page)  
在计数器的情况下,响应值更新一次(如果按照本代码的方式放置),或者如果
count=count+1
放在末尾,则根本不更新

如果子功能响应仅在最后一次迭代时更新,则会导致2个刮页,而不是4个

实施周期的正确方法是什么,以便按照预期更新变量?

下面是完整的代码,如果这有帮助的话(我现在使用4个def而不是一个循环):

def parse_commits_page(self, response):
    def grabber( response ):
        return {
            'author': response.xpath('//a[@rel="author"]/text()').extract(),
            'name': response.xpath('//strong/a/text()').extract(),
            'last_commits': response.xpath('//relative-time/text()').extract()
        }

    yield grabber( response )

    for i in range(3):
        yield response.follow(
            response.xpath('//a[@rel="nofollow"]/@href')[-1].extract(),
            callback=grabber
        )
# -*- coding: utf-8 -*-
import scrapy
from random import randint
from time import sleep

BASE_URL = 'https://github.com'

class DiscoverSpider(scrapy.Spider):
    name = 'discover_commits_new'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/search?utf8=%E2%9C%93&q=stars%3E100&ref=simplesearch']

    def parse(self, response):
        # Select all the project urls on page
        project = BASE_URL + response.xpath('//h3/a[@class="v-align-middle"]/@href').extract_first()
        yield response.follow(project, self.parse_project)
        # Random wait, so GitHub doesn't ban me right away
        sleep(randint(5,20))

        # Follow to the next page when every project on this one is scraped
        next_page = response.xpath('//a[@rel="next"]/@href').extract_first()
        if next_page is not None:
            next_page = BASE_URL + next_page
:           yield response.follow(next_page, callback=self.parse)

    # Parse the main page of the project
    def parse_project(self, response):
        yield {
            'author': response.xpath('//a[@rel="author"]/text()').extract(),
            'name': response.xpath('//strong/a/text()').extract(),
            'tags': [x.strip() for x in response.css('.topic-tag::text').extract()],
            'lang_name': response.css('.lang::text').extract(),
            'lang_perc' : response.css('.percent::text').extract(),
            'stars': response.css('.social-count::text').extract()[1].strip(),
            'forks': response.css('.social-count::text').extract()[2].strip(),
            'commits': response.css('.text-emphasized::text').extract()[0].strip(),
            'contributors': response.css('.text-emphasized::text').extract()[3].strip()
        }

        commits_page = BASE_URL + response.xpath('//*[@class="commits"]//@href').extract_first()
        yield response.follow(commits_page, self.parse_commits_page)

    # Get last commits
    def parse_commits_page(self, response):
        yield {
            'author': response.xpath('//a[@rel="author"]/text()').extract(),
            'name': response.xpath('//strong/a/text()').extract(),
            'last_commits': response.xpath('//relative-time/text()').extract()
        }
        next_page = response.xpath('//a[@rel="nofollow"]/@href')[-1].extract()
        yield response.follow(next_page, callback=self.parse_commits_page1)

    def parse_commits_page1(self, response):
        yield {
            'author': response.xpath('//a[@rel="author"]/text()').extract(),
            'name': response.xpath('//strong/a/text()').extract(),
            'last_commits': response.xpath('//relative-time/text()').extract()
        }
        next_page = response.xpath('//a[@rel="nofollow"]/@href')[-1].extract()
        yield response.follow(next_page, callback=self.parse_commits_page2)

    def parse_commits_page2(self, response):
        yield {
            'author': response.xpath('//a[@rel="author"]/text()').extract(),
            'name': response.xpath('//strong/a/text()').extract(),
            'last_commits': response.xpath('//relative-time/text()').extract()
        }
        next_page = response.xpath('//a[@rel="nofollow"]/@href')[-1].extract()
        yield response.follow(next_page, callback=self.parse_commits_page3)

    def parse_commits_page3(self, response):
        yield {
            'author': response.xpath('//a[@rel="author"]/text()').extract(),
            'name': response.xpath('//strong/a/text()').extract(),
            'last_commits': response.xpath('//relative-time/text()').extract()
        }