Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/mongodb/12.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何以1000为单位阅读整个集合?_Python_Mongodb_Pymongo - Fatal编程技术网

Python 如何以1000为单位阅读整个集合?

Python 如何以1000为单位阅读整个集合?,python,mongodb,pymongo,Python,Mongodb,Pymongo,我需要用Python代码读取MongoDB中的整个集合(集合名称为“test”)。我试着 self.__connection__ = Connection('localhost',27017) dbh = self.__connection__['test_db'] collection = dbh['test'] 如何将集合以1000块的形式读取(避免内存溢出,因为集合可能非常大) 使用光标。游标有一个“batchSize”变量,用于控制在执行

我需要用Python代码读取MongoDB中的整个集合(集合名称为“test”)。我试着

    self.__connection__ = Connection('localhost',27017)
    dbh = self.__connection__['test_db']            
    collection = dbh['test']

如何将集合以1000块的形式读取(避免内存溢出,因为集合可能非常大)

使用光标。游标有一个“batchSize”变量,用于控制在执行查询后,每批实际发送给客户端的文档数量。不过,您不必触摸此设置,因为默认设置很好,而且在大多数驱动程序中,调用“getmore”命令的复杂性是隐藏的。我不熟悉pymongo,但它的工作原理如下:

cursor = db.col.find() // Get everything!

while(cursor.hasNext()) {
    /* This will use the documents already fetched and if it runs out of documents in it's local batch it will fetch another X of them from the server (where X is batchSize). */
    document = cursor.next();

    // Do your magic here
}

我同意雷蒙的观点,但你提到了1000批次,他的回答并没有真正涵盖。您可以在光标上设置批次大小:

cursor.batch_size(1000);
您也可以跳过记录,例如:

cursor.skip(4000);

这就是你要找的吗?这实际上是一种分页模式。但是,如果您只是试图避免内存耗尽,那么实际上不需要设置批大小或跳过。

要使用Pymongo创建当前Python 2中的初始连接,请执行以下操作:

host = 'localhost'
port = 27017
db_name = 'test_db'
collection_name = 'test'
使用MongoClient进行连接

# Connect to MongoDB
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
dbh = client[dbname]
collection = dbh[collection_name]
因此,从这里可以找到正确的答案。 我想使用块(在本例中为1000大小)进行阅读

例如,我们可以决定需要多少块大小(chunksize)

其中,本例中的查询是
query={}

我使用类似的想法从MongoDB创建数据帧。 我使用类似的方法来分块编写MongoDB


我希望它能有所帮助。

受@Rafael Valero+的启发,修复了代码中的最后一个块缺陷,并使其更加通用。我创建了生成器函数,通过查询和投影迭代mongo集合:

def iterate_by_chunks(collection, chunksize=1, start_from=0, query={}, projection={}):
   chunks = range(start_from, collection.find(query).count(), int(chunksize))
   num_chunks = len(chunks)
   for i in range(1,num_chunks+1):
      if i < num_chunks:
          yield collection.find(query, projection=projection)[chunks[i-1]:chunks[i]]
      else:
          yield collection.find(query, projection=projection)[chunks[i-1]:chunks.stop]
然后按块进行迭代:

chunk_n=0
total_docs=0
for docs in mess_chunk_iter:
   chunk_n=chunk_n+1        
   chunk_len = 0
   for d in docs:
      chunk_len=chunk_len+1
      total_docs=total_docs+1
   print(f'chunk #: {chunk_n}, chunk_len: {chunk_len}')
print("total docs iterated: ", total_docs)

chunk #: 1, chunk_len: 400
chunk #: 2, chunk_len: 400
chunk #: 3, chunk_len: 400
chunk #: 4, chunk_len: 400
chunk #: 5, chunk_len: 400
chunk #: 6, chunk_len: 400
chunk #: 7, chunk_len: 281
total docs iterated:  2681

下面是一个按批迭代任何迭代器或生成器的通用解决方案:

def _as_batch(cursor, batch_size=50):
    # iterate over something (pymongo cursor, generator, ...) by batch. 
    # Note: the last batch may contain less than batch_size elements.
    batch = []
    try:
        while True:
            for _ in range(batch_size):
                batch.append(next(cursor))
            yield batch
            batch = []
    except StopIteration as e:
        if len(batch):
            yield batch
只要
光标
定义了一个方法
\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu。因此,我们可以在原始游标或转换的记录上使用它

示例

简单用法:

for batch in db['coll_name'].find():
    # do stuff
更复杂的用法(例如对批量更新有用):

重新实现
count()
函数:

sum(map(len, _as_batch( db['coll_name'].find() )))

在Python中是如何做到这一点的?很抱歉向您发送了一个url,但我相信这已经很好地解决了:您的代码中有一个bug-您错过了最后一块。例如,如果您有98个与查询相关的文档,那么您将得到10个文档中的10个块,但是循环将只检索前90个,而缺少最后8个。像这样预定义块将有助于确保您获得所有内容,而且您不会在您的情况下重复使用光标,这不是我需要的。
chunk\u iter=[[I,I+chunk\u size]对于范围内的i(0,cursor.count(),chunk\u size)]
chunk_n=0
total_docs=0
for docs in mess_chunk_iter:
   chunk_n=chunk_n+1        
   chunk_len = 0
   for d in docs:
      chunk_len=chunk_len+1
      total_docs=total_docs+1
   print(f'chunk #: {chunk_n}, chunk_len: {chunk_len}')
print("total docs iterated: ", total_docs)

chunk #: 1, chunk_len: 400
chunk #: 2, chunk_len: 400
chunk #: 3, chunk_len: 400
chunk #: 4, chunk_len: 400
chunk #: 5, chunk_len: 400
chunk #: 6, chunk_len: 400
chunk #: 7, chunk_len: 281
total docs iterated:  2681
def _as_batch(cursor, batch_size=50):
    # iterate over something (pymongo cursor, generator, ...) by batch. 
    # Note: the last batch may contain less than batch_size elements.
    batch = []
    try:
        while True:
            for _ in range(batch_size):
                batch.append(next(cursor))
            yield batch
            batch = []
    except StopIteration as e:
        if len(batch):
            yield batch
for batch in db['coll_name'].find():
    # do stuff
def update_func(doc):
    # dummy transform function
    doc['y'] = doc['x'] + 1
    return doc

query = (update_func(doc) for doc in db['coll_name'].find())
for batch in _as_batch(query):
    # do stuff
sum(map(len, _as_batch( db['coll_name'].find() )))