Python 通过.txt文件向Scrapy Spider传递要爬网的URL列表_Python_Web Scraping_Scrapy_Command Line Arguments_Scrapy Spider

Python 通过.txt文件向Scrapy Spider传递要爬网的URL列表

python web-scraping scrapy

Python 通过.txt文件向Scrapy Spider传递要爬网的URL列表,python,web-scraping,scrapy,command-line-arguments,scrapy-spider,Python,Web Scraping,Scrapy,Command Line Arguments,Scrapy Spider,我对Python有点陌生，对Scrapy也很陌生我已经设置了一个蜘蛛来爬行并提取我需要的所有信息。但是，我需要将URL的.txt文件传递给start\u url变量对于exmaple： class LinkChecker(BaseSpider): name = 'linkchecker' start_urls = [] #Here I want the list to start crawling a list of urls from a text file a pass

我对Python有点陌生，对Scrapy也很陌生

我已经设置了一个蜘蛛来爬行并提取我需要的所有信息。但是，我需要将URL的.txt文件传递给start\u url变量

对于exmaple：

class LinkChecker(BaseSpider):
    name = 'linkchecker'
    start_urls = [] #Here I want the list to start crawling a list of urls from a text file a pass via the command line.

我做了一些研究，结果总是空手而归。我见过这种类型的示例（），但我认为这不适用于传递文本文件。

您可以简单地在.txt文件中读取：

with open('your_file.txt') as f:
    start_urls = f.readlines()

如果以尾随换行符结尾，请尝试：

with open('your_file.txt') as f:
    start_urls = [url.strip() for url in f.readlines()]

希望这有助于使用

-a

选项运行spider，如：

scrapy crawl myspider -a filename=text.txt

然后在spider的

\uuuu init\uuuu

方法中读取文件，并定义

开始\uURL

：

class MySpider(BaseSpider):
    name = 'myspider'

    def __init__(self, filename=None):
        if filename:
            with open(filename, 'r') as f:
                self.start_urls = f.readlines()

希望能有所帮助。

如果您的URL是行分隔的

def get_urls(filename):
        f = open(filename).read().split()
        urls = []
        for i in f:
                urls.append(i)
        return urls

然后，这行代码将为您提供URL

class MySpider(scrapy.Spider):
    name = 'nameofspider'

    def __init__(self, filename=None):
        if filename:
            with open('your_file.txt') as f:
                self.start_urls = [url.strip() for url in f.readlines()]

这将是您的代码。它将从.txt文件中提取URL，如果它们以行分隔，例如， url1 url2 等等

在此之后，运行命令-->

假设您的文件名为“file.txt”，然后运行命令-->

scrapy crawl nameofspider -a filename=filename.txt

scrapy crawl myspider -a filename=file.txt