Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/285.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在启动CrawlerProcess/Scrapy时修改Spider的CSV文件输入_Python_Scrapy - Fatal编程技术网

Python 在启动CrawlerProcess/Scrapy时修改Spider的CSV文件输入

Python 在启动CrawlerProcess/Scrapy时修改Spider的CSV文件输入,python,scrapy,Python,Scrapy,我正在用CrawlerProcess并行启动几个蜘蛛,就像那样 def main(): # ----- This part launch all given spiders ----- # process = CrawlerProcess(get_project_settings()) process.crawl(FirstSpider) process.crawl(SecondSpider) process.crawl(ThirdSpider)

我正在用CrawlerProcess并行启动几个蜘蛛,就像那样

def main():

    # ----- This part launch all given spiders ----- #

    process = CrawlerProcess(get_project_settings())

    process.crawl(FirstSpider)
    process.crawl(SecondSpider)
    process.crawl(ThirdSpider)
    process.crawl(EtcSpider)

    process.start()  # the script will block here until the crawling is finished
所有爬行器都基于CSV输入文件工作,该文件包含可在网站上查找的信息。以下是一个示例:

class FirstSpider(scrapy.Spider):
    name = "first_bot"

    def start_requests(self):
        base_url = "https://example.fr/catalogsearch/result/?q="
        script_dir = osp.dirname(osp.realpath(__file__))
        file_path = osp.join(script_dir, 'files', 'to_collect_firstbot.csv')
        input_file = open(file_path, 'r', encoding="utf-8", errors="ignore")
        reader = csv.reader(input_file)
        for row in reader:
            if row:
                url = row[0]
                absolute_url = base_url + url
                print(absolute_url)
                yield scrapy.Request(
                    absolute_url,
                    meta={
                        "handle_httpstatus_list": [302, 301, 502],
                    },
                    callback=self.parse
                )
它可以工作,但我可能不得不修改输入文件名,这是为每个spider记录的


是否可以在所有spider脚本上保留一个默认自定义文件,然后放入core.py文件启动所有spider,如果需要,修改CSV输入文件在这种情况下,所有spider的文件和名称都相同

获取参数,您可以从您的爬行器内部使用它们。

您可以将参数传递给您的爬行器爬网,我认为这正是您需要的

将代码更改为:

class FirstSpider(scrapy.Spider):
    name = "first_bot"

    file_name = 'to_collect_firstbot.csv' # <- we are gonna change this variable later

    def start_requests(self):
        base_url = "https://example.fr/catalogsearch/result/?q="
        script_dir = osp.dirname(osp.realpath(__file__))
        file_path = osp.join(script_dir, 'files', self.file_name) # here we use the argument
        input_file = open(file_path, 'r', encoding="utf-8", errors="ignore")
        reader = csv.reader(input_file)
        for row in reader:
            if row:
                url = row[0]
                absolute_url = base_url + url
                print(absolute_url)
                yield scrapy.Request(
                    absolute_url,
                    meta={
                        "handle_httpstatus_list": [302, 301, 502],
                    },
                    callback=self.parse
                )
检查第三次调用是否未设置file_name参数,这意味着spider将使用spider代码中指定的默认参数:

file_name = 'to_collect_firstbot.csv'

这正是我需要的!谢谢
file_name = 'to_collect_firstbot.csv'