Python 谷歌云，大查询需要大内存_Python_Google Bigquery

Python 谷歌云，大查询需要大内存

python google-bigquery

Python 谷歌云，大查询需要大内存,python,google-bigquery,Python,Google Bigquery,TLDR；在BQ中查询12.9MB在Python中大约需要540MB内存。这是线性增长的我正在查询一些bigQuery表。在上运行以下查询结果： Query complete (5.2s elapsed, 12.9 MB processed) 大约有15万行数据。当我在python中执行相同的查询时，相同的查询将使用高达540Mb的ram。如果我查询300k行，这将导致ram使用率加倍。当我多次执行相同的查询时，ram的使用情况不会改变。所以我最好的猜测是它使用了一些永远不会被释放的缓冲区

TLDR；在BQ中查询12.9MB在Python中大约需要540MB内存。这是线性增长的

我正在查询一些bigQuery表。在上运行以下查询

结果：

Query complete (5.2s elapsed, 12.9 MB processed)

大约有15万行数据。当我在python中执行相同的查询时，相同的查询将使用高达540Mb的ram。如果我查询300k行，这将导致ram使用率加倍。当我多次执行相同的查询时，ram的使用情况不会改变。所以我最好的猜测是它使用了一些永远不会被释放的缓冲区。我已经测试了

gc.collect（）

是否有帮助，但没有。我还将数据转储到json，该文件大约为25MB。所以我的问题是：为什么内存使用量如此之大，有没有办法改变它

我的代码：

from apiclient.discovery import build
from oauth2client.file import Storage
from oauth2client.client import OAuth2WebServerFlow
from oauth2client.tools import run
import httplib2
import sys

projectId = '....'
bqCredentialsFile = 'bigquery_credentials.dat'
clientId = '....'  # production
secret = '.......apps.googleusercontent.com '  # production

storage = Storage(bqCredentialsFile)
credentials = storage.get()
if credentials is None or credentials.invalid:
    flow = OAuth2WebServerFlow(client_id=clientId, client_secret=secret, scope='https://www.googleapis.com/auth/bigquery')
    credentials = run(flow, storage)

http = httplib2.Http()
http = credentials.authorize(http)
svc = build('bigquery', 'v2', http=http)


def getQueryResults(jobId, pageToken):
    req = svc.jobs()
    return req.getQueryResults(projectId=projectId, jobId=jobId, pageToken=pageToken).execute()


def query(queryString, priority='BATCH'):
    req = svc.jobs()
    body = {'query': queryString, 'maxResults': 100000, 'configuration': {'priority': priority}}
    res = req.query(projectId=projectId, body=body).execute()
    if 'rows' in res:
        for row in res['rows']:
            yield row
        for _ in range(int(res['totalRows']) / 100000):
            pageToken = res['pageToken']
            res = getQueryResults(res['jobReference']['jobId'], pageToken=pageToken)
            for row in res['rows']:
                yield row


def querySome(tableKeys):
    queryString = '''SELECT * FROM {0} '''.format(','.join(tableKeys))
    if len(tableKeys) > 0:
        return query(queryString, priority='BATCH')


if __name__ == '__main__':
    import simplejson as json
    tableNames = [['dataset1.table1', 'dataset1.table2']
    output = list(querySome(tableNames)) 
    fl = open('output.json', 'w')
    fl.write(json.dumps(output))
    fl.close()
    print input('done')

在我看来，问题出现在

output=list（querySome（tableNames））

行中。我不是python专家，但据我所知，这将把生成器转换成一个具体的列表，并且需要将整个结果存储在内存中。如果一行一行地迭代，一次只写一行，您可能会发现您有更好的内存使用行为

例如：

output = querySome(tableNames)
fl = open('output.json', 'w')
for line in output:
  fl.write(json.dumps(output))
  fl.write('\n')
fl.close()
print input('done')

还有。。。。

当您得到查询结果时，返回的行可能少于100000行，因为BigQuery限制了响应的大小。相反，您应该迭代，直到响应中没有返回pageToken。

否。如果我忽略结果，内存使用量是相同的

output = querySome(tableNames)
fl = open('output.json', 'w')
for line in output:
  fl.write(json.dumps(output))
  fl.write('\n')
fl.close()
print input('done')