Python 如何将表导出到以字段值作为文件名而不是部分的文件中_Python_Google Bigquery_Google Cloud Dataflow

Python 如何将表导出到以字段值作为文件名而不是部分的文件中

python google-bigquery google-cloud-dataflow

Python 如何将表导出到以字段值作为文件名而不是部分的文件中,python,google-bigquery,google-cloud-dataflow,Python,Google Bigquery,Google Cloud Dataflow,我有一个非常大的表通过谷歌云存储的大查询出来表中的一个字段是ZipCode（00000）是否仍然可以通过zipcode查询表，并将结果导出到文件中，以邮政编码作为文件名。每个文件都有该邮政编码的记录这可以做到吗？我会使用一些Java和Beam/Dataflow动态测试你的管道将：从BigQuery读取数据对于每一行，请查看指定其应转到哪个目标文件的特定列看看：这个问题是针对Python提出的，但目前DynamicDestination仅在Java上可用。这里有一个Py

我有一个非常大的表通过谷歌云存储的大查询出来

表中的一个字段是ZipCode（00000）

是否仍然可以通过zipcode查询表，并将结果导出到文件中，以邮政编码作为文件名。每个文件都有该邮政编码的记录

这可以做到吗？

我会使用一些Java和Beam/Dataflow动态测试

你的管道将：

从BigQuery读取数据
对于每一行，请查看指定其应转到哪个目标文件的特定列

看看：

这个问题是针对Python提出的，但目前DynamicDestination仅在Java上可用。

这里有一个Python解决方案，我没有使用BigQuery导出。尽管如此，最终结果还是以换行符分隔的json文件的形式保存在存储器中（这样就可以加载回BigQuery）。它涉及一个查询，但对于非常大的表来说可能会很昂贵。我使用了一个包含一个ZipCode列和两个以上列（col1，col2）的表作为示例，但这并不重要。此外，我还硬编码了身份验证部分

#!/usr/bin/python

from argparse import ArgumentParser
from google.cloud import bigquery
from google.cloud import storage

def main(project_id, dataset_id, table_id, bucket_name):

    client = bigquery.Client.from_service_account_json('service_account.json',project=project_id)
    dataset = client.dataset(dataset_id)
    # Create a table for intermediate results
    table_ref = client.dataset(dataset_id).table('tmp')

    # Query job with 'tmp' as destination
    # Group by non grouped/aggregated field ZipCode using ARRAY_AGG
    job_config = bigquery.QueryJobConfig()
    job_config.destination = table_ref
    sql = 'SELECT ZipCode, ARRAY_AGG(STRUCT(col1, col2)) FROM `{}.{}.{}` GROUP BY ZipCode'.format(project_id, dataset_id, table_id)
    query_job = client.query(
        sql,
        location='US',
        job_config=job_config)
    query_job.result()

    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)

    rows = client.list_rows(client.get_table(table_ref))
    for row in rows:
        record=''
        # Rest of row is a list of dictionaries with unicode items
        for r in row[1:][0]:
            r = {str(k):str(v) for k,v in r.items()}
            record+=(str(r))+'\n'
        # row[0] will have ZipCode which we want to use to name the exported files
        filename=row[0]+'.json'
        blob = bucket.blob(filename)
        print 'Exporting to gs://{}/{}'.format(bucket_name,filename)
        blob.upload_from_string(record)

    # Delete the tmp table
    client.delete_table(table_ref)

if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument('-p','--project', help="project where the ZipCode table resides", dest='project_id')
    parser.add_argument('-d','--dataset', help="dataset with the ZipCode table", dest='dataset_id')
    parser.add_argument('-t','--table', help="ZipCode table", dest='table_id')
    parser.add_argument('-b','--bucket', help="destination bucket", dest='bucket')

    args = parser.parse_args()
    main(args.project_id,args.dataset_id,args.table_id,args.bucket)

所以这基本上是使用python发送调用到bigquery，使用api指示bigquery保存为gcs文件。我喜欢。谢谢