Python 为什么mongoDB导入一百万个html页面如此缓慢？_Python_Mongodb_Performance_Pymongo

Python 为什么mongoDB导入一百万个html页面如此缓慢？

python mongodb performance

Python 为什么mongoDB导入一百万个html页面如此缓慢？,python,mongodb,performance,pymongo,Python,Mongodb,Performance,Pymongo,我是MongoDB的新手最近，我爬了250万个网页用于个人培训。每个页面大约为0.1MB，并在我的硬盘上保存为html文件我计划将所有html文件导入mongodb进行进一步处理，为了加速导入，我在python中使用了多处理模型。这是我的代码： #!/usr/bin/env python # -*- coding: utf-8 -*- import re, sys, os, time, subprocess from pymongo import Mon

我是MongoDB的新手

最近，我爬了250万个网页用于个人培训。每个页面大约为0.1MB，并在我的硬盘上保存为html文件

我计划将所有html文件导入mongodb进行进一步处理，为了加速导入，我在python中使用了多处理模型。这是我的代码：



    #!/usr/bin/env python
    # -*- coding: utf-8 -*-

    import re, sys, os, time, subprocess
    from pymongo import MongoClient
    from multiprocessing.dummy import Pool


    ##connect to default MongoClient
    client = MongoClient()
    ## Link to a database, if the database not exist, create it.
    db = client['mongoF']
    ## create a collection (like a table)
    collection = db['HTMLs']

    shellMSG = subprocess.check_output("find /home/pages_8 -type f", shell=True)
    html_ls =  shellMSG.split('\n')

    def html2mongo(html):
        item = {
            "_id"  : html[-13:-5],
            "category"  : re.search("pages_.*?/(.*)/", html).group(1).split('/'),
            "bs"        : open(html).read()
        }
        collection.insert(item)
        msg = "\rImported " + str(collection.count()) + " files!"
        sys.stdout.write(msg); sys.stdout.flush()  

    t1 = time.time()

    pool = Pool(16)
    pool.map(html2mongo, html_ls) 
    pool.close()  
    pool.join()

    print "\nImporting used " + str(int(time.time() - t1)) + " seconds in total!\n"

我的电脑是现代游戏电脑，速度很快。CPU有8个核心。根据这个，我把游泳池号码设为16

执行上述脚本时，我发现只导入了130万个文件，因为存储mongoDB数据库的磁盘上没有剩余空间。但是上面的处理花了7个多小时，我认为这是非常缓慢的

我还发现虚拟内存非常巨大：372GB

我的问题是: 1.导入一百万个这样的文件的正常速度是多少？ 2.我应该使用什么样的方法来改进导入

问候

抱歉，我重新检查了每个文件的大小是否在0.09M左右。我导入了132万个文件，使用了26738秒，所以速度是：1320000/26738*0.9MB=4.4MB/s。我知道现代硬盘的写入速度应该大于80MB/s，对吗？硬盘的速度是一回事。另一件事是AFAIK mongo在每次写入时锁定整个集合。这显然是低效的。在我看来，你不应该真的使用那个伪数据库。