Python 使用Pandas和PyMongo将MongoDB数据加载到数据帧的更好方法？_Python_Pandas_Pymongo

Python 使用Pandas和PyMongo将MongoDB数据加载到数据帧的更好方法？

python pandas

Python 使用Pandas和PyMongo将MongoDB数据加载到数据帧的更好方法？,python,pandas,pymongo,Python,Pandas,Pymongo,我有一个0.7GB的MongoDB数据库，其中包含我试图加载到数据帧中的推文。然而，我得到了一个错误 MemoryError: 我的代码如下所示： cursor = tweets.find() #Where tweets is my collection tweet_fields = ['id'] result = DataFrame(list(cursor), columns = tweet_fields) 我在下面的答案中尝试了这些方法，在加载数据库之前，这些方法会在某个时候创建

我有一个0.7GB的MongoDB数据库，其中包含我试图加载到数据帧中的推文。然而，我得到了一个错误

MemoryError:

我的代码如下所示：

cursor = tweets.find() #Where tweets is my collection
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

我在下面的答案中尝试了这些方法，在加载数据库之前，这些方法会在某个时候创建数据库所有元素的列表

然而，在另一个关于list（）的回答中，这个人说这对小数据集有好处，因为所有内容都加载到内存中

就我而言，我认为这是错误的根源。数据太多，无法加载到内存中。我还可以使用什么方法？

我已将代码修改为以下内容：

cursor = tweets.find(fields=['id'])
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

通过在find（）函数中添加fields参数，我限制了输出。这意味着我没有将每个字段都加载到数据框中，而是只将选定的字段加载到数据框中。现在一切正常。

从mongodb查询创建数据帧的最快、可能也是最节省内存的方法（如您的情况）是使用

有一个简洁明了的解释。

一个优雅的方法是：

import pandas as pd
def my_transform_logic(x):
    if x :
        do_something
        return result

def process(cursor):
    df = pd.DataFrame(list(cursor))
    df['result_col'] = df['col_to_be_processed'].apply(lambda value: my_transform_logic(value))

    #making list off dictionaries
    db.collection_name.insert_many(final_df.to_dict('records'))

    # or update
    db.collection_name.update_many(final_df.to_dict('records'),upsert=True)


#make a list of cursors.. you can read the parallel_scan api of pymongo

cursors = mongo_collection.parallel_scan(6)
for cursor in cursors:
    process(cursor)

我在一个mongoDB集合上使用上述代码上的Joblib对260万条记录尝试了上述过程。我的代码没有抛出任何内存错误加工在2小时内完成

使用

classmethod

可能是最好的方法：

from pandas import pd
import pymongo

client = pymongo.MongoClient()
data = db.mydb.mycollection.find() # or db.mydb.mycollection.aggregate(pipeline)

df = pd.DataFrame.from_records(data)

现在我已经多次看到stackoverflow被推上，但它几乎总是被误用。如果您查看API，似乎很清楚，只允许使用数字而不是文本数据类型。在这种情况下，我假设（由于包含了术语“tweet”），这不是monary的正确用例。