Python 3.x 把一张清单作为论据交给一个吝啬鬼
我希望能够将URL列表作为参数提供给我的scrapy scraper,这样我就可以定期对其进行迭代,以避免403个错误。目前我认为Scrapy不允许我这么做Python 3.x 把一张清单作为论据交给一个吝啬鬼,python-3.x,scrapy,arguments,parameter-passing,Python 3.x,Scrapy,Arguments,Parameter Passing,我希望能够将URL列表作为参数提供给我的scrapy scraper,这样我就可以定期对其进行迭代,以避免403个错误。目前我认为Scrapy不允许我这么做 scrapy crawl nosetime -o results.jl ['/pinpai/10036120-yuguoboshi-hugo-boss.html', '/pinpai/10094164-kedi-coty.html', '/pinpai/10021965-gaotiye-jean-paul-gaultier.html', '
scrapy crawl nosetime -o results.jl ['/pinpai/10036120-yuguoboshi-hugo-boss.html', '/pinpai/10094164-kedi-coty.html', '/pinpai/10021965-gaotiye-jean-paul-gaultier.html', '/pinpai/10088596-laerfu-laolun-ralph-lauren.html']
或者是一个URL文件
目前,这些URL是在my spider中硬写的:
import scrapy
from ..pipelines import NosetimeScraperPipeline
import time
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; TencentTraveler 4.0; Trident/4.0; SLCC1; Media Center PC 5.0; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30618)'}
base_url = 'https://www.nosetime.com'
class NosetimeScraper(scrapy.Spider):
name = "nosetime"
urls = ['/pinpai/10036120-yuguoboshi-hugo-boss.html', # I want to get rid of this
'/pinpai/10094164-kedi-coty.html', # unless I can use something like time.sleep(12*60*60)
'/pinpai/10021965-gaotiye-jean-paul-gaultier.html', # for each before being taken as argument
'/pinpai/10088596-laerfu-laolun-ralph-lauren.html']
start_urls = ['https://www.nosetime.com' + url for url in urls]
base_url = 'https://www.nosetime.com'
def parse(self, response):
# proceed to other pages of the listings
urls = response.css('a.imgborder::attr(href)').getall()
for url in urls:
print("url: ", url)
yield scrapy.Request(url=base_url + url, callback=self.parse)
# now that we have the urls we need to know if the dire are the things we can scrape
pipeline = NosetimeScraperPipeline()
perfume = pipeline.process_response(response)
try:
if perfume['enname']:
print("Finally are going to store: ", perfume['enname'])
pipeline.save_in_mongo(perfume)
except KeyError:
pass
有一个非常简单的示例,您可以调整以获得包含URL列表的文件名:
scrapy crawl myspider -a urls_file=URLs.txt
def __init__(self, urls_file=None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.urls_file = urls_file
# ...
def start_requests(self):
with open(self.urls_file, 'r') as f:
# read and yield your URLs here