Python 3.x 数据流（Apache Beam）无法在BigQuery上写入_Python 3.x_Google Bigquery_Etl_Google Cloud Dataflow_Apache Beam

Python 3.x 数据流（Apache Beam）无法在BigQuery上写入

python-3.x google-bigquery google-cloud-dataflow

Python 3.x 数据流（Apache Beam）无法在BigQuery上写入,python-3.x,google-bigquery,etl,google-cloud-dataflow,apache-beam,Python 3.x,Google Bigquery,Etl,Google Cloud Dataflow,Apache Beam,我有一个管道，在最后的步骤中必须在BigQuery上写两条记录，我真的不知道为什么它似乎什么都不插入。我没有错误，表存在并且已经包含记录，实际上我必须使用TRUNCATE/INSERT模式有人能帮我弄清楚为什么它不能像我期望的那样工作吗这是我的管道： p = beam.Pipeline(options=pipeline_options) (p | 'Read Configuration Table ' >> beam.io.Read(beam

我有一个管道，在最后的步骤中必须在BigQuery上写两条记录，我真的不知道为什么它似乎什么都不插入。我没有错误，表存在并且已经包含记录，实际上我必须使用TRUNCATE/INSERT模式

有人能帮我弄清楚为什么它不能像我期望的那样工作吗

这是我的管道：

     p = beam.Pipeline(options=pipeline_options)

    (p
        | 'Read Configuration Table ' >> beam.io.Read(beam.io.BigQuerySource(config['ENVIRONMENT']['configuration_table']))
        | 'Get Files from Server' >> beam.Map(import_file)
        | 'Upload files on Bucket' >> beam.Map(upload_file_on_bucket)
        | 'Set record update' >> beam.Map(set_last_step)
        | 'Update table' >> beam.io.gcp.bigquery.WriteToBigQuery(
                table=config['ENVIRONMENT']['configuration_table'],
                write_disposition='WRITE_TRUNCATE',
                schema=('folder:STRING, last_file:STRING')
                )
     )

与

WriteToBigQuery阶段的输入记录和“更新表”阶段的输出记录如下：

{'folder': '1952', 'last_file': '1952_2019120617.log.gz'}
{'folder': '1951', 'last_file': '1951_2019120617.log.gz'}

数据流中的调试信息为：

2019-12-06 18:09:36 DEBUG    Creating or getting table <TableReference
 datasetId: 'MYDATASET'
 projectId: 'MYPROJECT'
 tableId: 'MYTABLE'> with schema {'fields': [{'name': 'folder', 'type': 'STRING', 'mode': 'NULLABLE'}, {'name': 'last_file', 'type': 'STRING', 'mode': 'NULLABLE'}]}.
2019-12-06 18:09:36 DEBUG    Created the table with id MYTABLE
2019-12-06 18:09:36 INFO     Created table MYPROJECT.MYDATASET.MYTABLE with schema <TableSchema
 fields: [<TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'folder'
 type: 'STRING'>, <TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'last_file'
 type: 'STRING'>]>. Result: <Table
 creationTime: 1575652176727
 etag: '0/GXOOeXPCmYsMfgGNxl2Q=='
 id: 'MYPROJECT:MYDATASET.MYTABLE'
 kind: 'bigquery#table'
 lastModifiedTime: 1575652176766
 location: 'EU'
 numBytes: 0
 numLongTermBytes: 0
 numRows: 0
 schema: <TableSchema
 fields: [<TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'folder'
 type: 'STRING'>, <TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'last_file'
 type: 'STRING'>]>
 selfLink: 'https://www.googleapis.com/bigquery/v2/projects/MYPROJECT/datasets/MYDATASET/tables/MYTABLE'
 tableReference: <TableReference
 datasetId: 'MYDATASET'
 projectId: 'MYPROJECT'
 tableId: 'MYTABLE'> with schema {'fields': [{'name': 'folder', 'type': 'STRING', 'mode': 'NULLABLE'}, {'name': 'last_file', 'type': 'STRING', 'mode': 'NULLABLE'}]}.
2019-12-06 18:09:36 DEBUG    Created the table with id MYTABLE
2019-12-06 18:09:36 INFO     Created table MYPROJECT.MYDATASET.MYTABLE with schema <TableSchema
 fields: [<TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'folder'
 type: 'STRING'>, <TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'last_file'
 type: 'STRING'>]>. Result: <Table
 creationTime: 1575652176727
 etag: '0/GXOOeXPCmYsMfgGNxl2Q=='
 id: 'MYPROJECT:MYDATASET.MYTABLE'
 kind: 'bigquery#table'
 lastModifiedTime: 1575652176766
 location: 'EU'
 numBytes: 0
 numLongTermBytes: 0
 numRows: 0
 schema: <TableSchema
 fields: [<TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'folder'
 type: 'STRING'>, <TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'last_file'
 type: 'STRING'>]>
 selfLink: 'https://www.googleapis.com/bigquery/v2/projects/MYPROJECT/datasets/MYDATASET/tables/MYTABLE'
 tableReference: <TableReference
 datasetId: 'MYDATASET'
 projectId: 'MYPROJECT'
 tableId: 'MYTABLE'>
 type: 'TABLE'>.
2019-12-06 18:09:36 WARNING  Sleeping for 150 seconds before the write as BigQuery inserts can be routed to deleted table for 2 mins after the delete and create.
2019-12-06 18:12:06 DEBUG    Attempting to flush to all destinations. Total buffered: 2
2019-12-06 18:12:06 DEBUG    Flushing data to MYPROJECT:MYDATASET.MYTABLE. Total 2 rows.
2019-12-06 18:12:07 DEBUG    Passed: True. Errors are []

在本例中，我将XML元素解析为DF，并将其推入GBQ。希望你能在这里找到有用的东西

import pandas as pd
import xml.etree.ElementTree as ET
import datetime
import json
import requests
import pandas_gbq
from lxml import etree

# authentication: working now....
login = 'FN.LN@your_email.com' 
password = 'your_AS_pswd'


AsOfDate = datetime.datetime.today().strftime('%m-%d-%Y')

#1) SLA=471162: Execute Query
REQUEST_URL = 'https://www.some_data.com'
response = requests.get(REQUEST_URL, auth=(login, password))
xml_data = response.text.encode('utf-8', 'ignore') 

#print(response.text)

#tree = etree.parse(xml_data)
root = ET.fromstring(xml_data)

# start collecting root elements and headers for data frame 1
desc = root.get("SLA_Description")
frm = root.get("start_date")
thru = root.get("end_date")
dev = root.get("obj_device")
loc = root.get("locations")
loc = loc[:-1]
df1 = pd.DataFrame([['From:',frm],['Through:',thru],['Object:',dev],['Location:',loc]])
df1.columns = ['SLAs','Analytics']
#print(df1)

# start getting the analytics for data frame 2
data=[['Goal:',root[0][0].text],['Actual:',root[0][1].text],['Compliant:',root[0][2].text],['Errors:',root[0][3].text],['Checks:',root[0][4].text]]
df2 = pd.DataFrame(data)
df2.columns = ['SLAs','Analytics']
#print(df2)

# merge data frame 1 with data frame 2
df3 = df1.append(df2, ignore_index=True)
#print(df3)

# append description and today's date onto data frame
df3['Description'] = desc
df3['AsOfDate'] = AsOfDate

#df3.dtypes

# push from data frame, where data has been transformed, into Google BQ
pandas_gbq.to_gbq(df3, 'website.Metrics', 'your-firm', chunksize=None, reauth=False, if_exists='append', private_key=None, auth_local_webserver=False, table_schema=None, location=None, progress_bar=True, verbose=None)
print('Execute Query, Done!!')

我运行了一个简化的，它似乎为我工作。请注意，如果在BigQueryUI上进行预览，而不是选择文件夹、最后一个\u文件，则流式插入将需要一段时间才能显示。。。查询应立即返回正确的结果我同意@GuillemXercavins。您如何验证记录没有出现在BigQuery中？你能试着对结果进行查询吗？嗨，谢谢你们的回答！我知道流媒体流中插入的记录可能会在延迟后出现，但我确信我的记录不会插入到BigQuery上，因为我运行了一个查询来访问没有结果的数据，我已经等待了2个小时，流式缓冲区应该在90分钟内刷新以访问数据，最后我强制刷新缓冲区：创建或替换表MYPROJECT.MYDATASET.MYTABLE作为从MYPROJECT.MYDATASET.MYTABLE中选择*是否缺少某些内容？鉴于完全符合条件，似乎没有设置您的配置日志中的表名为MYPROJECT:MYDATASET.MYTABLE。table=config['ENVIRONMENT']['configuration\u table']工作正常吗？您可以尝试记录吗？使用“WRITE_TRUNCATE”不可能每次都覆盖表数据吗？也许可以试试“WRITE_APPEND”。谢谢你的回答，但提供的解决方案不是我想要的。我希望避免使用gbq，而是使用DataFlow的本机方法。

import pandas as pd
import xml.etree.ElementTree as ET
import datetime
import json
import requests
import pandas_gbq
from lxml import etree

# authentication: working now....
login = 'FN.LN@your_email.com' 
password = 'your_AS_pswd'


AsOfDate = datetime.datetime.today().strftime('%m-%d-%Y')

#1) SLA=471162: Execute Query
REQUEST_URL = 'https://www.some_data.com'
response = requests.get(REQUEST_URL, auth=(login, password))
xml_data = response.text.encode('utf-8', 'ignore') 

#print(response.text)

#tree = etree.parse(xml_data)
root = ET.fromstring(xml_data)

# start collecting root elements and headers for data frame 1
desc = root.get("SLA_Description")
frm = root.get("start_date")
thru = root.get("end_date")
dev = root.get("obj_device")
loc = root.get("locations")
loc = loc[:-1]
df1 = pd.DataFrame([['From:',frm],['Through:',thru],['Object:',dev],['Location:',loc]])
df1.columns = ['SLAs','Analytics']
#print(df1)

# start getting the analytics for data frame 2
data=[['Goal:',root[0][0].text],['Actual:',root[0][1].text],['Compliant:',root[0][2].text],['Errors:',root[0][3].text],['Checks:',root[0][4].text]]
df2 = pd.DataFrame(data)
df2.columns = ['SLAs','Analytics']
#print(df2)

# merge data frame 1 with data frame 2
df3 = df1.append(df2, ignore_index=True)
#print(df3)

# append description and today's date onto data frame
df3['Description'] = desc
df3['AsOfDate'] = AsOfDate

#df3.dtypes

# push from data frame, where data has been transformed, into Google BQ
pandas_gbq.to_gbq(df3, 'website.Metrics', 'your-firm', chunksize=None, reauth=False, if_exists='append', private_key=None, auth_local_webserver=False, table_schema=None, location=None, progress_bar=True, verbose=None)
print('Execute Query, Done!!')