Python BigQuery不接受protobuf中的二进制数据_Python_Encoding_Google Bigquery_Protocol Buffers_Google Cloud Dataflow

Python BigQuery不接受protobuf中的二进制数据

python encoding google-bigquery protocol-buffers google-cloud-dataflow

Python BigQuery不接受protobuf中的二进制数据,python,encoding,google-bigquery,protocol-buffers,google-cloud-dataflow,Python,Encoding,Google Bigquery,Protocol Buffers,Google Cloud Dataflow,我有一个数据流管道来解析从发布订阅到大查询的数据。数据采用proto3格式我从pubsub接收的数据是用protobuf中的“SerializeToString（）”方法编码的。然后我对它进行反序列化，并将解析后的数据插入到bigquery中，它工作得非常好。但是，在我收到probotobuf的二进制数据时，我被要求存储该数据，以防在插入时出错。为此，我创建了一个简单的bigquery表，其中只有一个字段“data”，接受字节因此，我在管道中添加了一个步骤，它只是从PubSub消息中获

我有一个数据流管道来解析从发布订阅到大查询的数据。数据采用proto3格式

我从pubsub接收的数据是用protobuf中的“SerializeToString（）”方法编码的。
然后我对它进行反序列化，并将解析后的数据插入到bigquery中，它工作得非常好。但是，在我收到probotobuf的二进制数据时，我被要求存储该数据，以防在插入时出错。
为此，我创建了一个简单的bigquery表，其中只有一个字段“data”，接受字节

因此，我在管道中添加了一个步骤，它只是从PubSub消息中获取数据并返回：

class GetBytes(beam.DoFn):
    def process(self, element):

        obj: Dict = {
            'data': element.data
        }
        logging.info(f'data bytes: {obj}')
        logging.info(f'data type: {type(obj["data"])}')
        return [obj]

以下是我用来插入到BQ的管道中的行：

    bytes_status = (status | 'Get Bytes Result' >> beam.ParDo(GetBytes()))
    bytes_status | 'Write to BQ BackUp' >> beam.io.WriteToBigQuery('my_project:my_dataset.my_table')

日志似乎获得了正确的数据：

2020-09-29 11:16:40.094 CESTdata字节：[资料\\40资料\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\04\x08\r\x10\x02\n\x04\x08\x0e\x10\x02\n\x04\x08\x0f\x10\x02\n\x04\x08\x10\x10\x02\n\x04\x08\x11\x10\x01\n\4\0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \ \ \ \ \ \ \ \0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \ \ \ \ \ \ \ \ \ \ \ \0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\x0c\n\x04\x08\r\x10\x02\n\x04\x08\x0e\x10\x02\n\x04\x08\x0f\x10\x02\n\x04\x08\x10\x10\x02\n\x04\x08\x11\x10\x02\n\x04\x08\x12\x10\x02\x10\xb4\x95\x99\xc9\xcd.}

但我不断收到以下错误：

UnicodeDecodeError:“utf-8[在运行“GeneratedPtTransform-297”时]”编解码器无法解码位置101中的字节0x89：无效的开始字节

（可能错误与以前的日志不一致，但这始终是此类消息）

我尝试从BigQueryUI插入字节数据，一切都很顺利

知道哪里出了问题吗

谢谢：）

BigQuery需要

字节

值以base64编码的方式编写。您可以在

上找到一些文档和链接，了解更多详细信息。哦，好的，apache beam sdk for python使用的是旧sql而不是标准sql…太遗憾了！您使用的beam版本是什么？嘿@Pablo我使用的是apache beam[gcp]==2.24.0