在MySQL中以增量方式选择记录,并在Python中保存为csv
我需要查询数据库进行一些数据分析,我有超过2000万条记录。我对数据库的访问受限,我的查询在8分钟后超时。因此,我试图将查询分解成更小的部分,并将结果保存到excel中,以供以后处理 这就是我目前所拥有的。如何让python在每个x编号(例如1000000条记录)上循环查询,并将它们存储在同一个csv中,直到搜索完所有20 mil++记录在MySQL中以增量方式选择记录,并在Python中保存为csv,python,mysql,sql,loops,Python,Mysql,Sql,Loops,我需要查询数据库进行一些数据分析,我有超过2000万条记录。我对数据库的访问受限,我的查询在8分钟后超时。因此,我试图将查询分解成更小的部分,并将结果保存到excel中,以供以后处理 这就是我目前所拥有的。如何让python在每个x编号(例如1000000条记录)上循环查询,并将它们存储在同一个csv中,直到搜索完所有20 mil++记录 import MySQLdb import csv db_main = MySQLdb.connect(host="localhost",
import MySQLdb
import csv
db_main = MySQLdb.connect(host="localhost",
port = 1234,
user="user1",
passwd="test123",
db="mainDB")
cur = db_main .cursor()
cur.execute("SELECT a.user_id, b.last_name, b.first_name,
FLOOR(DATEDIFF(CURRENT_DATE(), c.birth_date) / 365) age,
DATEDIFF(b.left_date, b.join_date) workDays
FROM users a
INNER JOIN users_signup b ON a.user_id a = b.user_id
INNER JOIN users_personal c ON a.user_id a = c.user_id
INNER JOIN
(
SELECT distinct d.a.user_id FROM users_signup d
WHERE (user_id >=1 AND user_id <1000000)
AND d.join_date >= '2013-01-01' and d.join_date < '2014-01-01'
)
AS t ON a.user_id = t.user_id")
result=cur.fetchall()
c = csv.writer(open("temp.csv","wb"))
for row in result:
c.writerow(row)
您的代码应该如下所示。您可以通过per_查询变量调整其性能
以下是一个可能对您有所帮助的实施示例:
from contextlib import contextmanager
import MySQLdb
import csv
connection_args = {"host": "localhost", "port": 1234, "user": "user1", "passwd": "test123", "db": "mainDB"}
@contextmanager
def get_cursor(**kwargs):
''' The contextmanager allow to automatically close
the cursor.
'''
db = MySQLdb.connect(**kwargs)
cursor = db.cursor()
try:
yield cursor
finally:
cursor.close()
# note the placeholders for the limits
query = """ SELECT a.user_id, b.last_name, b.first_name,
FLOOR(DATEDIFF(CURRENT_DATE(), c.birth_date) / 365) age,
DATEDIFF(b.left_date, b.join_date) workDays
FROM users a
INNER JOIN users_signup b ON a.user_id a = b.user_id
INNER JOIN users_personal c ON a.user_id a = c.user_id
INNER JOIN
(
SELECT distinct d.a.user_id FROM users_signup d
WHERE (user_id >= 1 AND user_id < 1000000)
AND d.join_date >= '2013-01-01' and d.join_date < '2014-01-01'
) AS t ON a.user_id = t.user_id OFFSET %s LIMIT %s """
csv_file = csv.writer(open("temp.csv","wb"))
# One million at the time
STEP = 1000000
for step_nb in xrange(0, 20):
with get_cursor(**connection_args) as cursor:
cursor.execute(query, (step_nb * STEP, (step_nb + 1) * STEP)) # query the DB
for row in cursor: # use the cursor instead of fetching everything in memory
csv_file.writerow(row)
编辑:对批处理的误解,尽管它在用户id上未经测试的代码,但这应该让您开始
SQL = """
SELECT a.user_id, b.last_name, b.first_name,
FLOOR(DATEDIFF(CURRENT_DATE(), c.birth_date) / 365) age,
DATEDIFF(b.left_date, b.join_date) workDays
FROM users a
INNER JOIN users_signup b ON a.user_id a = b.user_id
INNER JOIN users_personal c ON a.user_id a = c.user_id
INNER JOIN
(
SELECT distinct d.a.user_id FROM users_signup d
WHERE (user_id >=1 AND user_id <1000000)
AND d.join_date >= '2013-01-01' and d.join_date < '2014-01-01'
)
AS t ON a.user_id = t.user_id
OFFSET %s LIMIT %s
"""
BATCH_SIZE = 100000
with open("temp.csv","wb") as f:
writer = csv.writer(f)
cursor = db_main.cursor()
offset = 0
limit = BATCH_SIZE
while True:
cursor.execute(SQL, (offset, limit))
for row in cursor:
writer.writerow(row)
else:
# no more rows, we're done
break
offset += BATCH_SIZE
cursor.close()
也许可以尝试在sql查询中使用LIMIT和OFFSET?当我尝试运行它时,它给了我一个sql查询。没有确切的误差,但它在抱怨偏移量和限制。
SQL = """
SELECT a.user_id, b.last_name, b.first_name,
FLOOR(DATEDIFF(CURRENT_DATE(), c.birth_date) / 365) age,
DATEDIFF(b.left_date, b.join_date) workDays
FROM users a
INNER JOIN users_signup b ON a.user_id a = b.user_id
INNER JOIN users_personal c ON a.user_id a = c.user_id
INNER JOIN
(
SELECT distinct d.a.user_id FROM users_signup d
WHERE (user_id >=1 AND user_id <1000000)
AND d.join_date >= '2013-01-01' and d.join_date < '2014-01-01'
)
AS t ON a.user_id = t.user_id
OFFSET %s LIMIT %s
"""
BATCH_SIZE = 100000
with open("temp.csv","wb") as f:
writer = csv.writer(f)
cursor = db_main.cursor()
offset = 0
limit = BATCH_SIZE
while True:
cursor.execute(SQL, (offset, limit))
for row in cursor:
writer.writerow(row)
else:
# no more rows, we're done
break
offset += BATCH_SIZE
cursor.close()