Scrapy：为每个参数创建一个新项_Scrapy

Scrapy：为每个参数创建一个新项

scrapy

Scrapy：为每个参数创建一个新项,scrapy,Scrapy,我是新手。我目前正在尝试扩展我的爬行爬行器，以便它可以从文本文档中接收多个参数（而不是像scrapy crawl crawl5-a start\u url=“argument”那样手动将每个参数输入命令行）。目前，我可以输入一个参数并生成几个项目。但我想就两个问题提供一些指导：如何为每个参数创建一个项如何将该项用作从每个参数生成的项的容器我在这里的目标是模拟多次运行爬行器，同时将每个参数返回的项分开编辑。。这是我的代码——正如你所看到的，它是thesaurus.com的刮刀 import

我是新手。我目前正在尝试扩展我的爬行爬行器，以便它可以从文本文档中接收多个参数（而不是像

scrapy crawl crawl5-a start\u url=“argument”

那样手动将每个参数输入命令行）。目前，我可以输入一个参数并生成几个项目。但我想就两个问题提供一些指导：

如何为每个参数创建一个项

如何将该项用作从每个参数生成的项的容器

我在这里的目标是模拟多次运行爬行器，同时将每个参数返回的项分开

编辑。。这是我的代码——正如你所看到的，它是thesaurus.com的刮刀

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from thesaurus.items import ThesaurusItem

class MySpider(CrawlSpider):
    name = 'crawl5'
    def __init__(self, *args, **kwargs): 
        self.start_urls = ["http://www.thesaurus.com/browse/%s" %kwargs.get('start_url')] 
        self.allowed_domains = ["thesaurus.com"]
        self.rules = (
            Rule(LinkExtractor(restrict_xpaths=("//div[id='paginator']//a/@href"))),
            Rule(LinkExtractor(allow=('http://www.thesaurus.com/browse/%s/.$' %kwargs.get('start_url'), 'http://www.thesaurus.com/browse/%s/..$' %kwargs.get('start_url'))), callback='parse_item', follow=True)
        )
        super(MySpider, self).__init__(*args, **kwargs) 

    def parse_start_url(self, response):
        for sel in response.xpath("//div[contains(@class, 'syn_of_syns')]"):
            print(sel)
            item = ThesaurusItem()
            item['mainsynonym'] = sel.xpath("div/div/div/a/text()").extract()
            item['definition'] = sel.xpath("div/div/div[@class='def']/text()").extract()
            item['secondarysynonym'] = sel.xpath('div/div/ul/li/a/text()').extract()
            yield item

    def parse_item(self, response):
        for sel in response.xpath("//div[contains(@class, 'syn_of_syns')]"):
            print(sel)
            item = ThesaurusItem()
            item['mainsynonym'] = sel.xpath("div/div/div/a/text()").extract()
            item['definition'] = sel.xpath("div/div/div[@class='def']/text()").extract()
            item['secondarysynonym'] = sel.xpath('div/div/ul/li/a/text()').extract()
            yield item

我建议您添加用于生成一个项目的代码，以便进一步指导您。要添加更多的项来启动URL很容易，只需将它们添加到数组中，可能您有更复杂的想法，所以请在此处发布。事实上，我认为为每个参数创建一个项更有意义。。让我来编辑这个问题，我可能不会帮你，因为我不使用链接提取器-我只使用基本蜘蛛，在我看来它更容易使用。看起来您的代码在这两个方法之间是重复的。我看了Thesaurus.com，看起来他们没有目录，所以你需要使用搜索来查找链接。Selenium与Scrapy的结合可能有助于激发用户在页面上的行为：输入文本并单击搜索。也许您需要在数组中输入一个预定义的单词列表来查找。