Python 将大量数据从远程服务器拉入数据帧_Python_Postgresql_Pandas_Psycopg2

Python 将大量数据从远程服务器拉入数据帧

python postgresql pandas

Python 将大量数据从远程服务器拉入数据帧,python,postgresql,pandas,psycopg2,Python,Postgresql,Pandas,Psycopg2,为了提供尽可能多的上下文，我尝试使用psycopg2连接，将远程postgres服务器（heroku）上存储的一些数据拉入pandas数据帧我对两个特定的表，用户和事件感兴趣，并且连接工作正常，因为当下拉用户数据时 import pandas.io.sql as sql # [...] users = sql.read_sql("SELECT * FROM users", conn) 等待几秒钟后，数据帧按预期返回 <class 'pandas.core.frame.DataFram

为了提供尽可能多的上下文，我尝试使用psycopg2连接，将远程postgres服务器（heroku）上存储的一些数据拉入pandas数据帧

我对两个特定的表，用户和事件感兴趣，并且连接工作正常，因为当下拉用户数据时

import pandas.io.sql as sql 
# [...]
users = sql.read_sql("SELECT * FROM users", conn)

等待几秒钟后，数据帧按预期返回

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67458 entries, 0 to 67457
Data columns (total 35 columns): [...]

当我在iPython笔记本上尝试时，我得到了死内核错误

内核已死亡，是否要重新启动它？如果不重新启动内核，则可以保存笔记本，但在笔记本重新打开之前，运行的代码将无法工作

更新#1:

为了更好地了解我试图拉入的events表的大小，下面是记录的数量和每个表的属性数量：

In [11]: sql.read_sql("SELECT count(*) FROM events", conn)
Out[11]:
     count
0  2711453

In [12]: len(sql.read_sql("SELECT * FROM events LIMIT 1", conn).columns)
Out[12]: 18

更新#2:

内存无疑是当前执行

read\u sql

的瓶颈：当下拉事件并尝试运行另一个iPython实例时，结果是

vagrant@data-science-toolbox:~$ sudo ipython
-bash: fork: Cannot allocate memory

更新#3:

我首先尝试了一个只返回部分数据帧数组的

read\u sql\u chunked

实现：

def read_sql_chunked(query, conn, nrows, chunksize=1000):
    start = 0
    dfs = []
    while start < nrows:
        df = pd.read_sql("%s LIMIT %s OFFSET %s" % (query, chunksize, start), conn)
        start += chunksize
        dfs.append(df)
        print "Events added: %s to %s of %s" % (start-chunksize, start, nrows)
    # print "concatenating dfs"
    return dfs

event_dfs = read_sql_chunked("SELECT * FROM events", conn, events_count, 100000)

同样，对CSV的写入成功完成（一个657MB的文件），但从CSV的读取从未完成

既然2GB似乎还不够，那么如何估计有多少RAM足以读取657MB的CSV文件呢

感觉我对DataFrames或psycopg2缺少一些基本的理解，但我被卡住了，我甚至无法确定瓶颈或优化的位置

从远程（postgres）服务器获取大量数据的正确策略是什么？

我怀疑有两个（相关）因素会导致速度缓慢：

read\u sql

是用python编写的，所以速度有点慢（特别是与用cython编写的

read\u csv

相比，后者是用cython编写的，并且为了提高速度而仔细实现！），它依赖于sqlalchemy，而不是一些（可能快得多）C-DBAPI。迁移到sqlalchmey的动力是使将来的迁移更容易（以及跨sql平台的支持）

由于内存中的python对象太多（这与不使用C-DBAPI有关），您可能会耗尽内存，但可能会被解决

我认为最直接的解决方案是基于块的方法（在pandas

read\u-sql

和

read\u-sql\u-table

中本机实现这一功能是一种可行的方法）

编辑：从Pandas v0.16.2开始，这种基于块的方法在

read\u sql

中本机实现

由于您使用的是postgres，因此可以访问，这使得分块非常容易。（我认为这些不是在所有sql语言中都可用，对吗？）

首先，获取表中的行数（或an）：

nrows = con.execute('SELECT count(*) FROM users').fetchone()[0]  # also works with an sqlalchemy engine

使用此选项迭代表（为了进行调试，您可以添加一些打印语句以确认它正在工作/未崩溃！），然后合并结果：

def read_sql_chunked(query, con, nrows, chunksize=1000):
    start = 1
    dfs = []  # Note: could probably make this neater with a generator/for loop
    while start < nrows:
        df = pd.read_sql("%s LIMIT %s OFFSET %s" % (query, chunksize, start), con)
        dfs.append(df)
    return pd.concat(dfs, ignore_index=True)

def read_sql_chunked（查询、con、nrows、chunksize=1000）：
开始=1
dfs=[]#注意：可能会通过生成器/for循环使这更整洁
启动


注意：这假设数据库适合内存！如果没有，您将需要处理每个块（mapreduce样式）。。。或者投资更多的内存
 尝试使用熊猫：
mysql_cn = mysql.connector.connect(host='localhost', port=123, user='xyz',  passwd='****', db='xy_db')**

data= pd.read_sql('SELECT * FROM table;', con=mysql_cn)

mysql_cn.close()

这对我很有用。
下面是一个基本的光标示例，可能会有所帮助：
导入psycopg2
请注意，我们必须导入Psycopg2 extras库！
导入psycopg2.extras
导入系统
def main（）：
conn\u string=“host='localhost'dbname='my\u数据库'user='postgres'password='secret'”
###打印用于连接的连接字符串
conn = psycopg2.connect(conn_string)

### HERE IS THE IMPORTANT PART, by specifying a name for the cursor
### psycopg2 creates a server-side cursor, which prevents all of the
### records from being downloaded at once from the server.
cursor = conn.cursor('cursor_unique_name', cursor_factory=psycopg2.extras.DictCursor)
cursor.execute('SELECT * FROM my_table LIMIT 1000')

### Because cursor objects are iterable we can just call 'for - in' on
### the cursor object and the cursor will automatically advance itself
### each iteration.
### This loop should run 1000 times, assuming there are at least 1000
### records in 'my_table'
row_count = 0
for row in cursor:
    row_count += 1
    print "row: %s    %s\n" % (row_count, row)

如果name==“main”：
main（）
作为一种体验，这太糟糕了！希望我们能在将来为您实现这一目标。出于好奇，您的表有多大/有多少行？@AndyHayden已更新，以添加事件表中每个表的记录数和属性数。是否需要在内存中同时添加所有数据？或者一个数据帧中同时只有部分数据（例如某些列）就足够了吗？（但除此之外，你关于一个数据帧可以有多大的问题当然是合理的）@joris在这一点上，我同时使用了两个数据帧：一对数据帧和18列中非常小的子集，整个数据集被划分为28个部分数据帧。至少在最初的探索中，拥有所有的数据似乎是理想的。使用HDF5（使用pandasread_hdf
/HDFStore）可以方便快捷地根据需要查询数据子集，如果数据太大而无法一次将所有数据都放到内存中（比sql快得多，并且可以查询与csv相反的子集）内存很可能是瓶颈：我运行的虚拟机只有默认的512M。快速增加到1024M，如果这不起作用，我将尝试使用分块阅读。@MariusButuc让我知道此解决方案的公平性/如果您有任何问题！通过我的新尝试（仍然不太成功）添加了更新#3。@MariusButuc肯定会获得/分配更多ram，2gb在我看来不是很重要-它几乎肯定会在这里交换！您可以使用pytables/HDF5在磁盘上执行concat。。。看，但这可能还不够。它在AWS m3上确实有效。大型。。。7GB的RAM是成功的关键。
mysql_cn = mysql.connector.connect(host='localhost', port=123, user='xyz',  passwd='****', db='xy_db')**

data= pd.read_sql('SELECT * FROM table;', con=mysql_cn)

mysql_cn.close()

conn = psycopg2.connect(conn_string)

### HERE IS THE IMPORTANT PART, by specifying a name for the cursor
### psycopg2 creates a server-side cursor, which prevents all of the
### records from being downloaded at once from the server.
cursor = conn.cursor('cursor_unique_name', cursor_factory=psycopg2.extras.DictCursor)
cursor.execute('SELECT * FROM my_table LIMIT 1000')

### Because cursor objects are iterable we can just call 'for - in' on
### the cursor object and the cursor will automatically advance itself
### each iteration.
### This loop should run 1000 times, assuming there are at least 1000
### records in 'my_table'
row_count = 0
for row in cursor:
    row_count += 1
    print "row: %s    %s\n" % (row_count, row)