Python 3.x 对百万条记录的数据库表进行分析的最佳方法是什么?

Python 3.x 对百万条记录的数据库表进行分析的最佳方法是什么?,python-3.x,pandas,data-analysis,Python 3.x,Pandas,Data Analysis,我有一个50MN记录的csv文件,我想用熊猫来处理它。当我将其加载到数据帧时,我的系统被挂起。任何想法都会很有帮助吗?读取csv并将其存储在SQLite内存数据库中 import pandas as pd from sqlalchemy import create_engine # database connection import datetime as dt disk_engine = create_engine('sqlite:///311_8M.db') # Initializes d

我有一个50MN记录的csv文件,我想用熊猫来处理它。当我将其加载到数据帧时,我的系统被挂起。任何想法都会很有帮助吗?

读取csv并将其存储在SQLite内存数据库中

import pandas as pd
from sqlalchemy import create_engine # database connection
import datetime as dt
disk_engine = create_engine('sqlite:///311_8M.db') # Initializes database with filename 311_8M.db in current directory
start = dt.datetime.now()
chunksize = 20000
j = 0
index_start = 1

for df in pd.read_csv('big.csv', chunksize=chunksize, iterator=True, encoding='utf-8'):

    df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) # Remove spaces from columns

    df['CreatedDate'] = pd.to_datetime(df['CreatedDate']) # Convert to datetimes
    df['ClosedDate'] = pd.to_datetime(df['ClosedDate'])

    df.index += index_start

    # Remove the un-interesting columns
    columns = ['Agency', 'CreatedDate', 'ClosedDate', 'ComplaintType', 'Descriptor',
               'CreatedDate', 'ClosedDate', 'TimeToCompletion',
               'City']

    for c in df.columns:
        if c not in columns:
            df = df.drop(c, axis=1)    


    j+=1
    print '{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunksize)

    df.to_sql('data', disk_engine, if_exists='append')
    index_start = df.index[-1] + 1

df = pd.read_sql_query('SELECT * FROM data LIMIT 3', disk_engine)

然后,您可以查询任何您喜欢的内容。

使用chunk逐块处理,也可以在chunk中执行您的分析谢谢您的回答。谢谢,您也可以自己进行研究!