Python 将csv上载到大查询时添加日期加载字段
使用Python。 在处理csv文件到大查询时,有没有办法添加一个额外的字段。 我想添加一个带有当前日期的date_加载字段 我用过的谷歌代码示例Python 将csv上载到大查询时添加日期加载字段,python,google-bigquery,Python,Google Bigquery,使用Python。 在处理csv文件到大查询时,有没有办法添加一个额外的字段。 我想添加一个带有当前日期的date_加载字段 我用过的谷歌代码示例 # from google.cloud import bigquery # client = bigquery.Client() # dataset_id = 'my_dataset' dataset_ref = client.dataset(dataset_id) job_config = bigquery.LoadJobConfig() job
# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'my_dataset'
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.schema = [
bigquery.SchemaField('name', 'STRING'),
bigquery.SchemaField('post_abbr', 'STRING')
]
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = 'gs://cloud-samples-data/bigquery/us-states/us-states.csv'
load_job = client.load_table_from_uri(
uri,
dataset_ref.table('us_states'),
job_config=job_config) # API request
print('Starting job {}'.format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print('Job finished.')
destination_table = client.get_table(dataset_ref.table('us_states'))
print('Loaded {} rows.'.format(destination_table.num_rows))
通过修改此文件以适应您的问题,您可以从我的本地PC打开并读取原始CSV文件,通过添加列进行编辑,并在每行末尾附加时间戳,以避免出现空列。解释如何在Python中获取带有自定义日期和时间的时间戳
然后将结果数据写入输出文件,并将其加载到Google存储。您可以找到如何从Python文件运行外部命令的信息
我希望这有帮助
#Import the dependencies
import csv,datetime,subprocess
from google.cloud import bigquery
#Replace the values for variables with the appropriate ones
#Name of the input csv file
csv_in_name = 'us-states.csv'
#Name of the output csv file to avoid messing up the original
csv_out_name = 'out_file_us-states.csv'
#Name of the NEW COLUMN NAME to be added
new_col_name = 'date_loaded'
#Type of the new column
col_type = 'DATETIME'
#Name of your bucket
bucket_id = 'YOUR BUCKET ID'
#Your dataset name
ds_id = 'YOUR DATASET ID'
#The destination table name
destination_table_name = 'TABLE NAME'
# read and write csv files
with open(csv_in_name,'r') as r_csvfile:
with open(csv_out_name,'w') as w_csvfile:
dict_reader = csv.DictReader(r_csvfile,delimiter=',')
#add new column with existing
fieldnames = dict_reader.fieldnames + [new_col_name]
writer_csv = csv.DictWriter(w_csvfile,fieldnames,delimiter=',')
writer_csv.writeheader()
for row in dict_reader:
#Put the timestamp after the last comma so that the column is not empty
row[new_col_name] = datetime.datetime.now()
writer_csv.writerow(row)
#Copy the file to your Google Storage bucket
subprocess.call('gsutil cp ' + csv_out_name + ' gs://' + bucket_id , shell=True)
client = bigquery.Client()
dataset_ref = client.dataset(ds_id)
job_config = bigquery.LoadJobConfig()
#Add a new column to the schema!
job_config.schema = [
bigquery.SchemaField('name', 'STRING'),
bigquery.SchemaField('post_abbr', 'STRING'),
bigquery.SchemaField(new_col_name, col_type)
]
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
#Address string of the output csv file
uri = 'gs://' + bucket_id + '/' + csv_out_name
load_job = client.load_table_from_uri(uri,dataset_ref.table(destination_table_name),job_config=job_config) # API request
print('Starting job {}'.format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print('Job finished.')
destination_table = client.get_table(dataset_ref.table(destination_table_name))
print('Loaded {} rows.'.format(destination_table.num_rows))
您可以在加载时继续加载数据,但要加载到名为
old\u table
的表中
加载后,您可以运行以下操作:
bq --location=US query --destination_table mydataset.newtable --use_legacy_sql=false --replace=true 'select *, current_date() as date_loaded from mydataset.old_table'
这基本上是在
new\u table
的末尾用一个新列date\u loaded
加载旧表的内容。这样,您现在就拥有了一个新专栏,而无需在本地下载,也无需全部下载。对您有用吗?如果不是,也许更好的方法是使用。如果它仍然不起作用,那么我看到的唯一出路就是将这些数据带到本地,对其进行迭代并添加日期字段。如果您正在处理大量数据,则不建议这样做……或者将其加载到BigQuery中的staging/tmp表中,然后用SQL点击它,并添加date\u loaded
字段作为SQL转换的一部分。将结果写入主表。如果使用基于摄取的分区表,请注意它是UTC格式的,除非直接寻址分区()