Python 3.x 监视WriteToBigQuery_Python 3.x_Google Bigquery_Google Cloud Dataflow_Apache Beam

Python 3.x 监视WriteToBigQuery

python-3.x google-bigquery google-cloud-dataflow

Python 3.x 监视WriteToBigQuery,python-3.x,google-bigquery,google-cloud-dataflow,apache-beam,Python 3.x,Google Bigquery,Google Cloud Dataflow,Apache Beam,在我的管道中，我使用WriteToBigQuery，如下所示： | beam.io.WriteToBigQuery( 'thijs:thijsset.thijstable', schema=table_schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND, create_disposition=beam.io.BigQueryDisposition.CREATE_IF_N

在我的管道中，我使用WriteToBigQuery，如下所示：

| beam.io.WriteToBigQuery(
     'thijs:thijsset.thijstable',
      schema=table_schema,
      write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
      create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)

| beam.io.WriteToBigQuery(
     'thijs:thijsset.thijstable',
      schema=table_schema,
      write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
      create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
| ['FailedRows'] from previous step
| "print" >> beam.Map(print)

这将返回文档中描述的Dict，如下所示：

beam.io.WriteToBigQuery ptTransform返回一个字典，其 BigQueryWriteFn.FAILED\行条目包含所有无法写入的行

如何打印此dict并将其转换为pcollection，或者如何仅打印失败的_行

如果我这样做：

|“print”>>beam.Map（print）

然后我得到：

AttributeError:“dict”对象没有属性“pipeline”

我一定读过一百条管道，但在WriteToBigQuery之后，我从未见过任何东西

[编辑] 当我完成管道并将结果存储在变量中时，我有以下几点：

{'FailedRows': <PCollection[WriteToBigQuery/StreamInsertRows/ParDo(BigQueryWriteFn).FailedRows] at 0x7f0e0cdcfed0>}

处理无效输入的死信是一种常见的Beam/Dataflow用法，可用于Java和Python SDK，但后者的示例并不多

假设我们有一些伪输入数据，其中有10行是好的，一行是坏的，不符合表模式：

schema=“索引：整数，事件：字符串”
data=[{0}，good_line{1}'。范围（10）内i的格式（i+1，i+1）]
data.append（'这是一个坏行'）

然后，我要做的是命名写入结果（

events

，在本例中）：

events=（p
|“创建数据”>>beam.Create（数据）
|“CSV to dict”>>beam.ParDo（CsvToDictFn（））
|“写入BigQuery”>>beam.io.gcp.BigQuery.WriteToBigQuery(
“{0}：数据流测试。好的线”。格式（项目），
schema=schema，
)
)

然后访问侧面输出：

（事件[beam.io.gcp.bigquery.BigQueryWriteFn.FAILED\u行]
|“错误行”>>beam.io.textio.WriteToText（“error_log.txt”））

这与

DirectRunner

配合得很好，并将好的行写入BigQuery：

将坏的文件保存到本地文件：

$cat error\u log.txt-00000-of-00001
（'PROJECT_ID:dataflow_test.good_line'，{'index'：'this a bad row'}）

如果使用

DataflowRunner

运行它，则需要一些附加标志。如果遇到

类型错误：“PDone”对象没有属性“\uuu getitem\uuuuuu”

错误，则需要添加

--experiments=use\u beam\u bq\u sink

以使用新的BigQuery sink

如果出现一个

键错误：“FailedRows”

，这是因为新接收器将无法为批处理管道加载BigQuery作业：

流媒体插入、文件加载或默认设置。装载简介到BigQuery的数据：。默认情况下，将在流式管道和在批处理管道上加载文件

您可以通过在

WriteToBigQuery

中指定

method='STREAMING\u INSERTS'

来覆盖该行为：

DirectRunner

和

DataflowRunner

的完整代码