Python 在Scrapy中使用多个spider时,如何通过管道将数据保存到MongoDB?
我使用两个爬行器从网页获取数据,同时使用Python 在Scrapy中使用多个spider时,如何通过管道将数据保存到MongoDB?,python,mongodb,python-2.7,scrapy,pipeline,Python,Mongodb,Python 2.7,Scrapy,Pipeline,我使用两个爬行器从网页获取数据,同时使用Crawler Process()运行它们。 蜘蛛的代码: class GDSpider(Spider): name = "GenDis" allowed_domains = ["gold.jgi.doe.gov"] base_url ="https://gold.jgi.doe.gov/projects" stmp = [] term = "man" for i in range(1, 1000): url = "https://gold.jgi
Crawler Process()
运行它们。
蜘蛛的代码:
class GDSpider(Spider):
name = "GenDis"
allowed_domains = ["gold.jgi.doe.gov"]
base_url ="https://gold.jgi.doe.gov/projects"
stmp = []
term = "man"
for i in range(1, 1000):
url = "https://gold.jgi.doe.gov/projects?page="+ str(i) +"&Project.Project+Name="+ term+ "&count=25"
stmp.append(url)
start_urls = stmp
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
for site in sites:
item = GenDis()
item['Id'] = site.xpath('td/a/text()').extract()
item['Link'] = site.xpath('td/a/@href').extract()
item['Name'] = map(unicode.strip, site.xpath('td[2]/text()').extract())
item['Status'] = map(unicode.strip, site.xpath('td[3]/text()').extract())
item['Add_Date'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
yield item
class EPGD_spider(Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
term = "man"
start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]
MONGODB_DB = name + "_" + term
MONGODB_COLLECTION = name + "_" + term
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
url_list = []
base_url = "http://epgd.biosino.org/EPGD"
for site in sites:
item = EPGD()
item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
yield item
sel_tmp = Selector(response)
link = sel_tmp.xpath('//span[@id="quickPage"]')
for site in link:
url_list.append(site.xpath('a/@href').extract())
for i in range(len(url_list[0])):
if cmp(url_list[0][i], "#") == 0:
if i+1 < len(url_list[0]):
print url_list[0][i+1]
actual_url = "http://epgd.biosino.org/EPGD/search/"+ url_list[0][i+1]
yield Request(actual_url, callback=self.parse)
break
else:
print "The index is out of range!"
process = CrawlerProcess()
process.crawl(EPGD_spider)
process.crawl(GDSpider)
process.start() # the script will block here until all crawling jobs are finished
当我一次使用一个爬行器时,它工作正常。但当我同时运行它们时,管道似乎不再工作了。数据库和集合均未设置。
我已经多次看到Scrapy文档的
CrawlerProcess()
部分,但它没有提到管道方面的内容。有人能告诉我我的代码出了什么问题吗?这应该可以解决问题:
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl(EPGD_spider)
process.crawl(GDSpider)
process.start()
您可能还需要重构spider代码,以便为每个spider打开连接(此示例使用下面的“奖金提示2”):
额外提示1:使用pymongo,否则您可能会获得非常糟糕的性能(另请参阅)
额外提示2:所有这些设置都很难管理。考虑使用类似的东西“打包”它们在一个单独的URL中,并使它们更易于管理(如果是更干净的话)。
额外提示3:您可能会进行过多的写入/事务处理。如果用例允许,将结果保存到.jl
文件,并用于爬网完成时的批量导入。下面是如何做的更详细
假设一个名为tutorial
的项目和一个名为example
的爬行器创建了100个项目,您将在tutorial/extensions.py
中创建一个扩展:
import logging
import subprocess
from scrapy import signals
from scrapy.exceptions import NotConfigured
logger = logging.getLogger(__name__)
class MyBulkExtension(object):
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def __init__(self, crawler):
settings = crawler.settings
self._feed_uri = settings.get('FEED_URI', None)
if self._feed_uri is None:
raise NotConfigured('Missing FEED_URI')
self._db = settings.get('BULK_MONGO_DB', None)
if self._db is None:
raise NotConfigured('Missing BULK_MONGO_DB')
self._collection = settings.get('BULK_MONGO_COLLECTION', None)
if self._collection is None:
raise NotConfigured('Missing BULK_MONGO_COLLECTION')
crawler.signals.connect(self._closed, signal=signals.spider_closed)
def _closed(self, spider, reason, signal, sender):
logger.info("writting file %s to db %s, colleciton %s" %
(self._feed_uri, self._db, self._collection))
command = ("mongoimport --db %s --collection %s --drop --file %s" %
(self._db, self._collection, self._feed_uri))
p = subprocess.Popen(command.split())
p.communicate()
logger.info('Import done')
在教程/settings.py
上,激活扩展并设置两个设置:
EXTENSIONS = {
'tutorial.extensions.MyBulkExtension': 500
}
BULK_MONGO_DB = "test"
BULK_MONGO_COLLECTION = "foobar"
然后可以按如下方式运行爬网:
$ scrapy crawl -L INFO example -o foobar.jl
...
[tutorial.extensions] INFO: writting file foobar.jl to db test, colleciton foobar
connected to: 127.0.0.1
dropping: test.foobar
check 9 100
imported 100 objects
[tutorial.extensions] INFO: Import done
...
您想将项目存储在同一个数据库和集合中还是不同的数据库和集合中?不同的数据库和集合,您知道怎么做吗?您的意思是我写了太多的数据,我应该查看
mongoimport
一次将数据写入mongodb?每次写入一个数据时,mongo(以及每隔一个数据库)完成获取一些锁的过程。这使得书写成本更高。通过使用批量导入,锁只获取一次,通常会启用几次db级别的优化,并且导入效率更高。您可以使用原始代码为每个项目编写代码。如果您的用例允许您在爬网结束时进行批量导入,即爬网需要相对较短的时间,那么您更愿意进行批量导入而不是单独插入。谢谢!我会得到更多关于散装的东西@我很高兴见到你。我更新了答案,使其包含了正确的操作方法。感谢您的详细回答,我还有一个问题,如果我要插入数据的数据库和集合的名称在scrapy之前尚未设置。例如:我希望将数据库和集合的名称设置为name+term
,仅当名称为EPGD
且术语为man
,我希望数据库和集合的名称为EPGD man
。如何做到这一点?
EXTENSIONS = {
'tutorial.extensions.MyBulkExtension': 500
}
BULK_MONGO_DB = "test"
BULK_MONGO_COLLECTION = "foobar"
$ scrapy crawl -L INFO example -o foobar.jl
...
[tutorial.extensions] INFO: writting file foobar.jl to db test, colleciton foobar
connected to: 127.0.0.1
dropping: test.foobar
check 9 100
imported 100 objects
[tutorial.extensions] INFO: Import done
...