Python 在Scrapy中使用多个spider时，如何通过管道将数据保存到MongoDB？_Python_Mongodb_Python 2.7_Scrapy_Pipeline

Python 在Scrapy中使用多个spider时，如何通过管道将数据保存到MongoDB？

python mongodb python-2.7 scrapy

Python 在Scrapy中使用多个spider时，如何通过管道将数据保存到MongoDB？,python,mongodb,python-2.7,scrapy,pipeline,Python,Mongodb,Python 2.7,Scrapy,Pipeline,我使用两个爬行器从网页获取数据，同时使用Crawler Process（）运行它们。蜘蛛的代码： class GDSpider(Spider): name = "GenDis" allowed_domains = ["gold.jgi.doe.gov"] base_url ="https://gold.jgi.doe.gov/projects" stmp = [] term = "man" for i in range(1, 1000): url = "https://gold.jgi

我使用两个爬行器从网页获取数据，同时使用

Crawler Process（）

运行它们。蜘蛛的代码：

class GDSpider(Spider):
name = "GenDis"
allowed_domains = ["gold.jgi.doe.gov"]
base_url ="https://gold.jgi.doe.gov/projects"
stmp = []
term = "man"
for i in range(1, 1000):
    url = "https://gold.jgi.doe.gov/projects?page="+ str(i) +"&Project.Project+Name="+ term+ "&count=25"
    stmp.append(url)

start_urls = stmp

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')

    for site in sites:
        item = GenDis()
        item['Id'] = site.xpath('td/a/text()').extract()
        item['Link'] = site.xpath('td/a/@href').extract()
        item['Name'] = map(unicode.strip, site.xpath('td[2]/text()').extract())
        item['Status'] = map(unicode.strip, site.xpath('td[3]/text()').extract())
        item['Add_Date'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
        yield item



class EPGD_spider(Spider):
    name = "EPGD"
    allowed_domains = ["epgd.biosino.org"]
    term = "man"
    start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]
    MONGODB_DB = name + "_" + term
    MONGODB_COLLECTION = name + "_" + term

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
    url_list = []
    base_url = "http://epgd.biosino.org/EPGD"

    for site in sites:
        item = EPGD()
        item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
        item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
        item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
        item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
        item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
        item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
        item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
        item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
        item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
        yield item

    sel_tmp = Selector(response)
    link = sel_tmp.xpath('//span[@id="quickPage"]')

    for site in link:
        url_list.append(site.xpath('a/@href').extract())

    for i in range(len(url_list[0])):
        if cmp(url_list[0][i], "#") == 0:
            if i+1 < len(url_list[0]):
                print url_list[0][i+1]
                actual_url = "http://epgd.biosino.org/EPGD/search/"+ url_list[0][i+1]
                yield Request(actual_url, callback=self.parse)
                break
            else:
                print "The index is out of range!"

process = CrawlerProcess()
process.crawl(EPGD_spider)
process.crawl(GDSpider)
process.start() # the script will block here until all crawling jobs are finished

当我一次使用一个爬行器时，它工作正常。但当我同时运行它们时，管道似乎不再工作了。数据库和集合均未设置。

我已经多次看到Scrapy文档的

CrawlerProcess（）

部分，但它没有提到管道方面的内容。有人能告诉我我的代码出了什么问题吗？

这应该可以解决问题：

from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl(EPGD_spider)
process.crawl(GDSpider)
process.start()

您可能还需要重构spider代码，以便为每个spider打开连接（此示例使用下面的“奖金提示2”）：

额外提示1：使用pymongo，否则您可能会获得非常糟糕的性能（另请参阅）

额外提示2：所有这些设置都很难管理。考虑使用类似的东西“打包”它们在一个单独的URL中，并使它们更易于管理（如果是更干净的话）。额外提示3：您可能会进行过多的写入/事务处理。如果用例允许，将结果保存到

.jl

文件，并用于爬网完成时的批量导入。下面是如何做的更详细

假设一个名为

tutorial

的项目和一个名为

example

的爬行器创建了100个项目，您将在

tutorial/extensions.py

中创建一个扩展：

import logging
import subprocess

from scrapy import signals
from scrapy.exceptions import NotConfigured

logger = logging.getLogger(__name__)


class MyBulkExtension(object):

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def __init__(self, crawler):
        settings = crawler.settings

        self._feed_uri = settings.get('FEED_URI', None)
        if self._feed_uri is None:
            raise NotConfigured('Missing FEED_URI')
        self._db = settings.get('BULK_MONGO_DB', None)
        if self._db is None:
            raise NotConfigured('Missing BULK_MONGO_DB')
        self._collection = settings.get('BULK_MONGO_COLLECTION', None)
        if self._collection is None:
            raise NotConfigured('Missing BULK_MONGO_COLLECTION')

        crawler.signals.connect(self._closed, signal=signals.spider_closed)

    def _closed(self, spider, reason, signal, sender):
        logger.info("writting file %s to db %s, colleciton %s" %
                    (self._feed_uri, self._db, self._collection))
        command = ("mongoimport --db %s --collection %s --drop --file %s" %
                   (self._db, self._collection, self._feed_uri))

        p = subprocess.Popen(command.split())
        p.communicate()

        logger.info('Import done')

在

教程/settings.py

上，激活扩展并设置两个设置：

EXTENSIONS = {
    'tutorial.extensions.MyBulkExtension': 500
}

BULK_MONGO_DB = "test"
BULK_MONGO_COLLECTION = "foobar"

然后可以按如下方式运行爬网：

$ scrapy crawl -L INFO example -o foobar.jl
...
[tutorial.extensions] INFO: writting file foobar.jl to db test, colleciton foobar
connected to: 127.0.0.1
dropping: test.foobar
check 9 100
imported 100 objects
[tutorial.extensions] INFO: Import done
...

您想将项目存储在同一个数据库和集合中还是不同的数据库和集合中？不同的数据库和集合，您知道怎么做吗？您的意思是我写了太多的数据，我应该查看

mongoimport

一次将数据写入mongodb？每次写入一个数据时，mongo（以及每隔一个数据库）完成获取一些锁的过程。这使得书写成本更高。通过使用批量导入，锁只获取一次，通常会启用几次db级别的优化，并且导入效率更高。您可以使用原始代码为每个项目编写代码。如果您的用例允许您在爬网结束时进行批量导入，即爬网需要相对较短的时间，那么您更愿意进行批量导入而不是单独插入。谢谢！我会得到更多关于散装的东西@我很高兴见到你。我更新了答案，使其包含了正确的操作方法。感谢您的详细回答，我还有一个问题，如果我要插入数据的数据库和集合的名称在scrapy之前尚未设置。例如：我希望将数据库和集合的名称设置为

name+term

，仅当名称为

EPGD

且术语为

man

，我希望数据库和集合的名称为

EPGD man

。如何做到这一点？

EXTENSIONS = {
    'tutorial.extensions.MyBulkExtension': 500
}

BULK_MONGO_DB = "test"
BULK_MONGO_COLLECTION = "foobar"

$ scrapy crawl -L INFO example -o foobar.jl
...
[tutorial.extensions] INFO: writting file foobar.jl to db test, colleciton foobar
connected to: 127.0.0.1
dropping: test.foobar
check 9 100
imported 100 objects
[tutorial.extensions] INFO: Import done
...