Python 如何在程序中将参数传递给scrapy Spider？_Python_Scrapy

Python 如何在程序中将参数传递给scrapy Spider？

python scrapy

Python 如何在程序中将参数传递给scrapy Spider？,python,scrapy,Python,Scrapy,我是python和scrapy的新手。我使用此博客中的方法在flask应用程序中运行spider。以下是代码： # list of crawlers TO_CRAWL = [DmozSpider, EPGDspider, GDSpider] # crawlers that are running RUNNING_CRAWLERS = [] def spider_closing(spider): """ Activates on spider closed signal

我是python和scrapy的新手。我使用此博客中的方法在flask应用程序中运行spider。以下是代码：

# list of crawlers
TO_CRAWL = [DmozSpider, EPGDspider, GDSpider]

# crawlers that are running 
RUNNING_CRAWLERS = []

def spider_closing(spider):
    """
    Activates on spider closed signal
    """
    log.msg("Spider closed: %s" % spider, level=log.INFO)
    RUNNING_CRAWLERS.remove(spider)
    if not RUNNING_CRAWLERS:
        reactor.stop()

# start logger
log.start(loglevel=log.DEBUG)

# set up the crawler and start to crawl one spider at a time
for spider in TO_CRAWL:
    settings = Settings()

    # crawl responsibly
    settings.set("USER_AGENT", "Kiran Koduru (+http://kirankoduru.github.io)")
    crawler = Crawler(settings)
    crawler_obj = spider()
    RUNNING_CRAWLERS.append(crawler_obj)

    # stop reactor when spider closes
    crawler.signals.connect(spider_closing, signal=signals.spider_closed)
    crawler.configure()
    crawler.crawl(crawler_obj)
    crawler.start()

# blocks process; so always keep as the last statement
reactor.run()

这是我的代码：

class EPGDspider(scrapy.Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
term = "man"
start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]
MONGODB_DB = name + "_" + term
MONGODB_COLLECTION = name + "_" + term

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
    url_list = []
    base_url = "http://epgd.biosino.org/EPGD"

    for site in sites:
        item = EPGD()
        item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
        item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
        item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
        item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
        item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
        item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
        item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
        item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
        item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
        yield item

    sel_tmp = Selector(response)
    link = sel_tmp.xpath('//span[@id="quickPage"]')

    for site in link:
        url_list.append(site.xpath('a/@href').extract())

    for i in range(len(url_list[0])):
        if cmp(url_list[0][i], "#") == 0:
            if i+1 < len(url_list[0]):
                print url_list[0][i+1]
                actual_url = "http://epgd.biosino.org/EPGD/search/"+ url_list[0][i+1]
                yield Request(actual_url, callback=self.parse)
                break
            else:
                print "The index is out of range!"

有人能告诉我怎么处理吗

首先，要在一个脚本中运行多个spider，建议使用而不是spider实例

要使用

CrawlerProcess

将参数传递给spider，只需将参数添加到spider子类之后的

.crawl（）

调用中， e、 g

通过这种方式传递的参数可以作为spider属性使用（与命令行上的

-a term=someterm

相同）

最后，与在

\uuuu init\uuuu

中构建

start\u URL

不同，您可以通过使用

self.term

实现相同的功能，并且可以像这样构建初始请求：

def start_requests(self):
    yield Request("http://epgd.biosino.org/"
                  "EPGD/search/textsearch.jsp?"
                  "textquery={}"
                  "&submit=Feeling+Lucky".format(self.term))

是

scrapy crawl my_spider-a start_url=”http://google.com“

工作正常，但我不想在命令行中调用spider，我想在程序中调用spider。首先，感谢您的详细回答！！我尝试了

CrawlerProcess

，但有一个问题，我不能在Flask应用程序中使用它，当我使用时，有一个bug说信号只在主线程中工作，我问了这个问题，但没有有效的解决方案。那么你有其他方法吗？如果你想使用

scrapy.crawler.crawler

，而不仅仅是设置。e、 g.

crawler=crawler（DmozSpider，设置）

然后

crawler.crawler（term=“someterm”）

问题是我在烧瓶应用程序中运行这些爬行器，所以我应该尝试

scrapy.crawler.crawler

而不是

CrawlerProcess

？我不知道如何在烧瓶应用程序中运行scrapy spider。我会问aroundI发现我使用了

scrapy-0.24.0

而不是

scrapy-1.0

，并且在

scrapy-0.24.0

中，爬虫程序只有一个参数

设置

，与最新的有点不同。

    process.crawl(DmozSpider, term='someterm', someotherterm='anotherterm')

def start_requests(self):
    yield Request("http://epgd.biosino.org/"
                  "EPGD/search/textsearch.jsp?"
                  "textquery={}"
                  "&submit=Feeling+Lucky".format(self.term))