Python 通过.txt文件向Scrapy Spider传递要爬网的URL列表
我对Python有点陌生,对Scrapy也很陌生 我已经设置了一个蜘蛛来爬行并提取我需要的所有信息。但是,我需要将URL的.txt文件传递给start\u url变量 对于exmaple:Python 通过.txt文件向Scrapy Spider传递要爬网的URL列表,python,web-scraping,scrapy,command-line-arguments,scrapy-spider,Python,Web Scraping,Scrapy,Command Line Arguments,Scrapy Spider,我对Python有点陌生,对Scrapy也很陌生 我已经设置了一个蜘蛛来爬行并提取我需要的所有信息。但是,我需要将URL的.txt文件传递给start\u url变量 对于exmaple: class LinkChecker(BaseSpider): name = 'linkchecker' start_urls = [] #Here I want the list to start crawling a list of urls from a text file a pass
class LinkChecker(BaseSpider):
name = 'linkchecker'
start_urls = [] #Here I want the list to start crawling a list of urls from a text file a pass via the command line.
我做了一些研究,结果总是空手而归。我见过这种类型的示例(),但我认为这不适用于传递文本文件。您可以简单地在.txt文件中读取:
with open('your_file.txt') as f:
start_urls = f.readlines()
如果以尾随换行符结尾,请尝试:
with open('your_file.txt') as f:
start_urls = [url.strip() for url in f.readlines()]
希望这有助于使用
-a
选项运行spider,如:
scrapy crawl myspider -a filename=text.txt
然后在spider的\uuuu init\uuuu
方法中读取文件,并定义开始\uURL
:
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, filename=None):
if filename:
with open(filename, 'r') as f:
self.start_urls = f.readlines()
希望能有所帮助。如果您的URL是行分隔的
def get_urls(filename):
f = open(filename).read().split()
urls = []
for i in f:
urls.append(i)
return urls
然后,这行代码将为您提供URL
class MySpider(scrapy.Spider):
name = 'nameofspider'
def __init__(self, filename=None):
if filename:
with open('your_file.txt') as f:
self.start_urls = [url.strip() for url in f.readlines()]
这将是您的代码。它将从.txt文件中提取URL,如果它们以行分隔,例如,
url1
url2
等等
在此之后,运行命令-->
假设您的文件名为“file.txt”,然后运行命令-->
scrapy crawl nameofspider -a filename=filename.txt
scrapy crawl myspider -a filename=file.txt