Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 把一张清单作为论据交给一个吝啬鬼_Python 3.x_Scrapy_Arguments_Parameter Passing - Fatal编程技术网

Python 3.x 把一张清单作为论据交给一个吝啬鬼

Python 3.x 把一张清单作为论据交给一个吝啬鬼,python-3.x,scrapy,arguments,parameter-passing,Python 3.x,Scrapy,Arguments,Parameter Passing,我希望能够将URL列表作为参数提供给我的scrapy scraper,这样我就可以定期对其进行迭代,以避免403个错误。目前我认为Scrapy不允许我这么做 scrapy crawl nosetime -o results.jl ['/pinpai/10036120-yuguoboshi-hugo-boss.html', '/pinpai/10094164-kedi-coty.html', '/pinpai/10021965-gaotiye-jean-paul-gaultier.html', '

我希望能够将URL列表作为参数提供给我的scrapy scraper,这样我就可以定期对其进行迭代,以避免403个错误。目前我认为Scrapy不允许我这么做

scrapy crawl nosetime -o results.jl ['/pinpai/10036120-yuguoboshi-hugo-boss.html', '/pinpai/10094164-kedi-coty.html', '/pinpai/10021965-gaotiye-jean-paul-gaultier.html', '/pinpai/10088596-laerfu-laolun-ralph-lauren.html']
或者是一个URL文件

目前,这些URL是在my spider中硬写的:

import scrapy
from ..pipelines import NosetimeScraperPipeline
import time

headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; TencentTraveler 4.0; Trident/4.0; SLCC1; Media Center PC 5.0; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30618)'}
base_url = 'https://www.nosetime.com'

class NosetimeScraper(scrapy.Spider):
    name = "nosetime"

    urls = ['/pinpai/10036120-yuguoboshi-hugo-boss.html', # I want to get rid of this
            '/pinpai/10094164-kedi-coty.html',            # unless I can use something like time.sleep(12*60*60)
            '/pinpai/10021965-gaotiye-jean-paul-gaultier.html', # for each before being taken as argument
            '/pinpai/10088596-laerfu-laolun-ralph-lauren.html']

    start_urls = ['https://www.nosetime.com' + url for url in urls]
    base_url = 'https://www.nosetime.com'

    def parse(self, response):
        # proceed to other pages of the listings
        urls = response.css('a.imgborder::attr(href)').getall()
        for url in urls:
            print("url: ", url)
            yield scrapy.Request(url=base_url + url, callback=self.parse)

        # now that we have the urls we need to know if the dire are the things we can scrape
        pipeline = NosetimeScraperPipeline()
        perfume = pipeline.process_response(response)
        try:
            if perfume['enname']:
                print("Finally are going to store: ", perfume['enname'])
                pipeline.save_in_mongo(perfume)
        except KeyError:
            pass

有一个非常简单的示例,您可以调整以获得包含URL列表的文件名:

scrapy crawl myspider -a urls_file=URLs.txt

   def __init__(self, urls_file=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.urls_file = urls_file
        # ...
   def start_requests(self):
       with open(self.urls_file, 'r') as f:
       # read and yield your URLs here