从python中的couchdb模块访问视图

从python中的couchdb模块访问视图,python,python-3.x,couchdb,cloudant,Python,Python 3.x,Couchdb,Cloudant,- 在处理模块中的视图时,我注意到python在开始处理整个结果集之前会将其加载到内存中。因此,即使指定了限制和跳过参数,它也会首先将整个结果集加载到内存中,然后应用限制、跳过限制,最后返回结果 例如,下面是我的代码: import requests import json import couchdb import time couch = couchdb.Server(url) def dumpidtofile(dbname,view): db=couch[dbname]

-

在处理模块中的视图时,我注意到python在开始处理整个结果集之前会将其加载到内存中。因此,即使指定了限制和跳过参数,它也会首先将整个结果集加载到内存中,然后应用限制、跳过限制,最后返回结果

例如,下面是我的代码:

import requests
import json
import couchdb
import time

couch = couchdb.Server(url)

def dumpidtofile(dbname,view):
    db=couch[dbname]
    func_total_time=0
    count=db.info()['doc_count'] # Get a count of total number of documents
    batch=count // 10000 # Divide the total count in batches of 10000 and save the quotient 
    f=open(dbname, 'w')
    if batch == 0 :
        print ("Number of documents less that 10000. continuing !!")
        start_time = time.monotonic()
        for item in db.view(view):
            # print (item.key)
            f.write(item.key)
            f.write('\n')
        elapsed_time = time.monotonic() - start_time
        func_total_time=elapsed_time
        print ("Loop finished. Time spent in this loop was {0}".format(elapsed_time))
        print ("Total Function Time :", func_total_time)
    else:
        print ("Number of documents greater that 10000. Breaking into batches !!")
        batch=batch + 1 # This is the number of times that we would have to iterate to retrieve all documents
        for i in range(batch):
            start_time = time.monotonic()
            for item in db.view(view,limit=10000,skip=i*10000):
                # print (item.key)
                f.write(item.key)
                f.write('\n')
            elapsed_time = time.monotonic() - start_time
            func_total_time = func_total_time + elapsed_time
            print ("Loop {0} finished. Time spent in this loop was {1}".format(i,elapsed_time))
        print ("Total Function Time :", func_total_time)
    f.close()

prog_start_time = time.monotonic()
dumpidtofile("mydb","myindex/myview")
prog_end_time = time.monotonic() - prog_start_time
print ("Total Program Time :", prog_end_time)
这是我的示例输出

程序在图像中突出显示的点处等待约90秒,然后继续。当时我怀疑视图可能在开始处理这些循环之前就已经全部加载了。现在,这对于小型数据库来说可能很好,但对于大型数据库来说似乎不是那么好(我使用的一些数据库是~15/20gbs)

因此,我想我的问题是:

  • 有没有更好的方法来迭代文档,特别是在大型数据库中,一次只加载一部分文档

  • 我如何找出在该计划中花费的时间最多的地方,以及如何对其进行优化

  • 对问题的长度表示歉意。我不知道打字的时间有那么长。:)

    谢谢-你可以试试图书馆。它可以批量获取结果

    例如:

    from cloudant import couchdb_admin_party
    from cloudant.result import Result
    
    db_name = 'animaldb'
    ddoc_id = 'views101'
    view_id = 'diet'
    
    with couchdb_admin_party(url='http://localhost:5984') as client:
        db = client.get(db_name, remote=True)
        view = db.get_design_document(ddoc_id).get_view(view_id)
    
        with open('/tmp/results.txt', 'w') as f:
            for result in Result(view, page_size=1000):
                f.write(result.get('key') + '\n')