Python 如何在程序中将参数传递给scrapy Spider?
我是python和scrapy的新手。我使用此博客中的方法在flask应用程序中运行spider。以下是代码:Python 如何在程序中将参数传递给scrapy Spider?,python,scrapy,Python,Scrapy,我是python和scrapy的新手。我使用此博客中的方法在flask应用程序中运行spider。以下是代码: # list of crawlers TO_CRAWL = [DmozSpider, EPGDspider, GDSpider] # crawlers that are running RUNNING_CRAWLERS = [] def spider_closing(spider): """ Activates on spider closed signal
# list of crawlers
TO_CRAWL = [DmozSpider, EPGDspider, GDSpider]
# crawlers that are running
RUNNING_CRAWLERS = []
def spider_closing(spider):
"""
Activates on spider closed signal
"""
log.msg("Spider closed: %s" % spider, level=log.INFO)
RUNNING_CRAWLERS.remove(spider)
if not RUNNING_CRAWLERS:
reactor.stop()
# start logger
log.start(loglevel=log.DEBUG)
# set up the crawler and start to crawl one spider at a time
for spider in TO_CRAWL:
settings = Settings()
# crawl responsibly
settings.set("USER_AGENT", "Kiran Koduru (+http://kirankoduru.github.io)")
crawler = Crawler(settings)
crawler_obj = spider()
RUNNING_CRAWLERS.append(crawler_obj)
# stop reactor when spider closes
crawler.signals.connect(spider_closing, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(crawler_obj)
crawler.start()
# blocks process; so always keep as the last statement
reactor.run()
这是我的代码:
class EPGDspider(scrapy.Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
term = "man"
start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]
MONGODB_DB = name + "_" + term
MONGODB_COLLECTION = name + "_" + term
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
url_list = []
base_url = "http://epgd.biosino.org/EPGD"
for site in sites:
item = EPGD()
item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
yield item
sel_tmp = Selector(response)
link = sel_tmp.xpath('//span[@id="quickPage"]')
for site in link:
url_list.append(site.xpath('a/@href').extract())
for i in range(len(url_list[0])):
if cmp(url_list[0][i], "#") == 0:
if i+1 < len(url_list[0]):
print url_list[0][i+1]
actual_url = "http://epgd.biosino.org/EPGD/search/"+ url_list[0][i+1]
yield Request(actual_url, callback=self.parse)
break
else:
print "The index is out of range!"
有人能告诉我怎么处理吗 首先,要在一个脚本中运行多个spider,建议使用而不是spider实例 要使用
CrawlerProcess
将参数传递给spider,只需将参数添加到spider子类之后的.crawl()
调用中,
e、 g
通过这种方式传递的参数可以作为spider属性使用(与命令行上的-a term=someterm
相同)
最后,与在\uuuu init\uuuu
中构建start\u URL
不同,您可以通过使用self.term
实现相同的功能,并且可以像这样构建初始请求:
def start_requests(self):
yield Request("http://epgd.biosino.org/"
"EPGD/search/textsearch.jsp?"
"textquery={}"
"&submit=Feeling+Lucky".format(self.term))
是
scrapy crawl my_spider-a start_url=”http://google.com“
工作正常,但我不想在命令行中调用spider,我想在程序中调用spider。首先,感谢您的详细回答!!我尝试了CrawlerProcess
,但有一个问题,我不能在Flask应用程序中使用它,当我使用时,有一个bug说信号只在主线程中工作,我问了这个问题,但没有有效的解决方案。那么你有其他方法吗?如果你想使用scrapy.crawler.crawler
,而不仅仅是设置。e、 g.crawler=crawler(DmozSpider,设置)
然后crawler.crawler(term=“someterm”)
问题是我在烧瓶应用程序中运行这些爬行器,所以我应该尝试scrapy.crawler.crawler
而不是CrawlerProcess
?我不知道如何在烧瓶应用程序中运行scrapy spider。我会问aroundI发现我使用了scrapy-0.24.0
而不是scrapy-1.0
,并且在scrapy-0.24.0
中,爬虫程序只有一个参数设置
,与最新的有点不同。
process.crawl(DmozSpider, term='someterm', someotherterm='anotherterm')
def start_requests(self):
yield Request("http://epgd.biosino.org/"
"EPGD/search/textsearch.jsp?"
"textquery={}"
"&submit=Feeling+Lucky".format(self.term))