Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/310.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用scrapy爬行时动态启动URL列表_Python_Web Scraping_Scrapy - Fatal编程技术网

Python 使用scrapy爬行时动态启动URL列表

Python 使用scrapy爬行时动态启动URL列表,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,test.py class SomewebsiteProductSpider(scrapy.Spider): name = "somewebsite" allowed_domains = ["somewebsite.com"] start_urls = [ ] def parse(self, response): items = somewebsiteItem() title = response.xpath('//h1[@id="title"]/spa

test.py

class SomewebsiteProductSpider(scrapy.Spider):
    name = "somewebsite"
    allowed_domains = ["somewebsite.com"]


start_urls = [

]

def parse(self, response):
    items = somewebsiteItem()

    title = response.xpath('//h1[@id="title"]/span/text()').extract()
    sale_price = response.xpath('//span[contains(@id,"ourprice") or contains(@id,"saleprice")]/text()').extract()
    category = response.xpath('//a[@class="a-link-normal a-color-tertiary"]/text()').extract()
    availability = response.xpath('//div[@id="availability"]//text()').extract()
    items['product_name'] = ''.join(title).strip()
    items['product_sale_price'] = ''.join(sale_price).strip()
    items['product_category'] = ','.join(map(lambda x: x.strip(), category)).strip()
    items['product_availability'] = ''.join(availability).strip()
    fo = open("C:\\Users\\user1\PycharmProjects\\test.txt", "w")
    fo.write("%s \n%s \n%s" % (items['product_name'], items['product_sale_price'], self.start_urls))
    fo.close()
    print(items)
    yield items
在启动爬网过程之前,如何将动态开始URL列表从test.py传递给SomewebsiteProductSpiders对象?任何帮助都将不胜感激。 谢谢。

process.crawl接受传递给spider构造函数的可选参数,因此您可以从spider的uu init_uuu填充开始URL,也可以使用自定义开始请求过程。比如说

test.py

class SomewebsiteProductSpider(scrapy.Spider):
    name = "somewebsite"
    allowed_domains = ["somewebsite.com"]


start_urls = [

]

def parse(self, response):
    items = somewebsiteItem()

    title = response.xpath('//h1[@id="title"]/span/text()').extract()
    sale_price = response.xpath('//span[contains(@id,"ourprice") or contains(@id,"saleprice")]/text()').extract()
    category = response.xpath('//a[@class="a-link-normal a-color-tertiary"]/text()').extract()
    availability = response.xpath('//div[@id="availability"]//text()').extract()
    items['product_name'] = ''.join(title).strip()
    items['product_sale_price'] = ''.join(sale_price).strip()
    items['product_category'] = ','.join(map(lambda x: x.strip(), category)).strip()
    items['product_availability'] = ''.join(availability).strip()
    fo = open("C:\\Users\\user1\PycharmProjects\\test.txt", "w")
    fo.write("%s \n%s \n%s" % (items['product_name'], items['product_sale_price'], self.start_urls))
    fo.close()
    print(items)
    yield items
有点像蜘蛛


只需将start_URL作为参数传递,就可以避免@mizghun答案中的额外kwargs解析

class SomewebsiteProductSpider(scrapy.Spider):
    ...
    def __init__(self, *args, **kwargs):
        self.start_urls = kwargs.pop('url_list', [])
        super(SomewebsiteProductSpider, *args, **kwargs)

也许这是一个稍微不同的问题,但当我将ITME列表作为字符串传递时,我会将每个url作为一个单独的字母。下面是一个scrapy网站的语句表单如果要从命令行设置start_url属性,则必须使用类似ast.literal_eval或json.loads的内容将其解析为列表,然后将其设置为属性。否则,您将导致对start_url字符串的迭代,这是一个非常常见的python陷阱,导致每个字符都被视为一个单独的url。如何设置属性?我想您的意思是将项目列表作为CLI参数传递。然后只需将其作为逗号分隔的列表传递,并像self.start\u url=comma\u delimited\u string.split','
class SomewebsiteProductSpider(scrapy.Spider):
    ...
    def __init__(self, *args, **kwargs):
        self.start_urls = kwargs.pop('url_list', [])
        super(SomewebsiteProductSpider, *args, **kwargs)
import scrapy
from scrapy.crawler import CrawlerProcess

class QuotesSpider(scrapy.Spider):
  name = 'quotes'

  def parse(self, response):
    print(response.url)

 process = CrawlerProcess()
 process.crawl(QuotesSpider, start_urls=["http://example.com", "http://example.org"])
 process.start()