Python 如何在scrapy中设置优先级_Python_Scrapy_Splash Screen_Scrapy Splash

Python 如何在scrapy中设置优先级

python scrapy

Python 如何在scrapy中设置优先级,python,scrapy,splash-screen,scrapy-splash,Python,Scrapy,Splash Screen,Scrapy Splash,试图刮网页，我需要设置优先级，以刮他们的顺序。现在，它想刮除每个url的所有第1页，然后刮除所有第2页，依此类推。但是我需要它来刮取url 1的所有页面和url 2的所有页面等等。我一直试图通过将第一个url设置为最高优先级（即csv文件中的url数量）来使用优先级。但它不起作用，主要是因为我不能减少优先级值，因为它在for循环中，所以每次进入循环时，它都会将优先级重置为原始数字，所以每次都相同，所以它们都具有相同的优先级。我如何才能让优先级正常工作，以便按我想要的顺序刮取URL SplashS

试图刮网页，我需要设置优先级，以刮他们的顺序。现在，它想刮除每个url的所有第1页，然后刮除所有第2页，依此类推。但是我需要它来刮取url 1的所有页面和url 2的所有页面等等。我一直试图通过将第一个url设置为最高优先级（即csv文件中的url数量）来使用优先级。但它不起作用，主要是因为我不能减少优先级值，因为它在for循环中，所以每次进入循环时，它都会将优先级重置为原始数字，所以每次都相同，所以它们都具有相同的优先级。我如何才能让优先级正常工作，以便按我想要的顺序刮取URL

SplashSpider.py

class MySpider(Spider):

    # Name of Spider
    name = 'splash_spider'
    # getting all the url + ip address + useragent pairs then request them
    def start_requests(self):


        # get the file path of the csv file that contains the pairs from the settings.py
        with open(self.settings["PROXY_CSV_FILE"], mode="r") as csv_file:
           # requests is a list of dictionaries like this -> {url: str, ua: str, ip: str}
            requests = process_csv(csv_file)
            for i, req in enumerate(requests):
                x = len(requests) - i  # <- check here
                # Return needed url with set delay of 3 seconds
                yield SplashRequest(url=req["url"], callback=self.parse, args={"wait": 3},
                        # Pair with user agent specified in csv file
                        headers={"User-Agent": req["ua"]},
                        # Sets splash_url to whatever the current proxy that goes with current URL  is instead of actual splash url
                        splash_url = req["ip"],
                        priority = x,
                        meta={'priority': x}
                        )

更新2

2019-06-13 15:16:23 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.starcitygames.com/catalog/category/1014?&start=50> (referer: http://www.starcitygames.com/catalog/category/Visions)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/home/north/scrapy_splash/scrapy_javascript/scrapy_javascript/spiders/SplashSpider.py", line 104, in parse
    priority = response.meta['priority']
KeyError: 'priority'

2019-06-13 15:16:23[scrapy.core.scraper]错误：蜘蛛错误处理（参考：http://www.starcitygames.com/catalog/category/Visions)
回溯（最近一次呼叫最后一次）：
文件“/usr/local/lib/python3.6/site packages/scrapy/utils/defer.py”，第102行，在iter\u errback中
下一个（it）
文件“/usr/local/lib/python3.6/site packages/scrapy/spidermiddleware/offsite.py”，第29行，进程中输出
对于结果中的x：
文件“/usr/local/lib/python3.6/site packages/scrapy/spidermiddleware/referer.py”，第339行，在
返回（_set_referer（r）表示结果中的r或（））
文件“/usr/local/lib/python3.6/site packages/scrapy/spidermiddleware/urlength.py”，第37行，在
返回（结果中的r表示r或（）如果_过滤器（r））
文件“/usr/local/lib/python3.6/site packages/scrapy/spidermiddleware/depth.py”，第58行，in
返回（结果中的r表示r或（）如果_过滤器（r））
文件“/usr/home/north/scrapy_splash/scrapy_javascript/scrapy_javascript/spider/SplashSpider.py”，第104行，在解析中
优先级=响应.meta['priority']
KeyError:“优先级”

要通过数组更改它们，最好执行以下操作：

   for i, req in enumerate(requests):
        x = len(requests) - i  # <- check here

        # Return needed url with set delay of 3 seconds
        yield SplashRequest(url=req["url"], callback=self.parse, args={"wait": 3},
                # Pair with user agent specified in csv file
                headers={"User-Agent": req["ua"]},
                # Sets splash_url to whatever the current proxy that goes with current URL  is instead of actual splash url
                splash_url = req["ip"],
                priority = x,
                meta={'priority': x}  # <- check here!!
                )

这修复了我的减量问题，但它仍然没有按顺序输出数据。它现在有点变了，它输出了整个数据的第一个链接，只有一页，然后，对于第二个和第三个链接，它将输出其中一个的第1页，然后输出另一个的第1页，并继续交替，直到完成页面较少的类似2，然后继续刮除链接3和最后一个链接，直到完成这两个链接。所以它是不同的，但它仍然不能正常工作。它基本上似乎希望一次做两个链接而不是一个。它可能与我的settings.py文件有关吗？我更新了我的问题，现在已经包括在内。您是否也在

parse

函数中设置了相同的优先级？你需要这样做。和parentrequest中一样，我一直在尝试这样做，但对python来说还是新手，我一直在努力访问这些数据，因为我在一个单独的函数中。在我的解析函数中，是否有方法从我的start_requests函数访问优先级数据？

   for i, req in enumerate(requests):
        x = len(requests) - i  # <- check here

        # Return needed url with set delay of 3 seconds
        yield SplashRequest(url=req["url"], callback=self.parse, args={"wait": 3},
                # Pair with user agent specified in csv file
                headers={"User-Agent": req["ua"]},
                # Sets splash_url to whatever the current proxy that goes with current URL  is instead of actual splash url
                splash_url = req["ip"],
                priority = x,
                meta={'priority': x}  # <- check here!!
                )

    def parse(self, response):
        # I skip you logic here
        priority = response.meta['priority']
        next_page = response.xpath('//a[contains(., "- Next>>")]/@href').get()
        # If it exists and there is a next page enter if statement
        if next_page is not None:
            # Go to next page
            yield response.follow(next_page, self.parse, priority=priority, meta={'priority': priority})