Python Beam/Google云数据流从PubSub读取缺少数据_Python_Google Bigquery_Google Cloud Dataflow_Apache Beam_Google Cloud Pubsub

Python Beam/Google云数据流从PubSub读取缺少数据

python google-bigquery google-cloud-dataflow

Python Beam/Google云数据流从PubSub读取缺少数据,python,google-bigquery,google-cloud-dataflow,apache-beam,google-cloud-pubsub,Python,Google Bigquery,Google Cloud Dataflow,Apache Beam,Google Cloud Pubsub,我有两条数据流管道（pubsub到bigquery），代码如下： class transform_class(beam.DoFn): def process(self, element, publish_time=beam.DoFn.TimestampParam, *args, **kwargs): logging.info(element) yield element class identify_and_transform_tables(beam.

我有两条数据流管道（pubsub到bigquery），代码如下：

class transform_class(beam.DoFn):

    def process(self, element, publish_time=beam.DoFn.TimestampParam, *args, **kwargs):
        logging.info(element)
        yield element

class identify_and_transform_tables(beam.DoFn):
    #Adding Publish Timestamp
    #Since I'm reading from a topic that consist data from multiple tables, 
    #function here is to identify the tables and split them apart


def run(pipeline_args=None):
    # `save_main_session` is set to true because some DoFn's rely on
    # globally imported modules.
    pipeline_options = PipelineOptions(
        pipeline_args, streaming=True, save_main_session=True)

    with beam.Pipeline(options=pipeline_options) as pipeline:
        lines = (pipeline 
                | 'Read PubSub Messages' >> beam.io.ReadFromPubSub(topic='topic name',with_attributes = True)
                | 'Transforming Messages' >> beam.ParDo(transform_class())
                | 'Identify Tables' >> beam.ParDo(identify_and_transform_tables()).with_outputs('table_name'))

        table_name = lines.table_name
        table_name = (table_name 
                        | 'Write table_name to BQ' >> beam.io.WriteToBigQuery(
                        table='table_name',
                        create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                        write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
                        )

    result = pipeline.run()

这两个管道都从同一子主题读取。在核对时，我发现一些数据丢失，两条管道丢失的数据不同。比如说,

管道1中缺少第56-62行，但管道2中存在该行
第90-95行在管道2中缺失，但在管道1中存在

因此，这意味着数据存在于pubsub主题中
正如您在代码中看到的，第一个功能是将pubsub消息直接记录到stackdriver中。除了bigquery之外，我还仔细检查了stackdriver日志中缺少的数据

我发现的另一件事是，这些丢失的数据会在大量时间内发生。例子，第56-62行的时间戳为“2019-12-03 05:52:18.754150 UTC”，并接近该时间戳（毫秒）

因此，我唯一的结论是，从PubSub读取的数据流有时会丢失数据

我们非常感谢您的帮助。

我不确定在这种情况下发生了什么，但这是防止数据丢失需要遵循的一条重要规则：

不要阅读主题，如

beam.io.ReadFromPubSub（topic='topic name'）

中所述

一定要从订阅中读取，如

beam.io.ReadFromPubSub（subscription='subscription name'）

中所述

这是因为在重新启动的情况下，将在第一种情况下创建一个新订阅，并且此订阅可能只包含创建后接收的数据。如果您事先创建订阅，数据将保留在订阅中，直到它被读取（或过期）。

您好，我已经按照您的建议创建了一个订阅，并拉到了同一主题。然后，我创建了第三条管道读取上述订阅。我得到了相同的结果，这次第三条管道（readfromsubscription）也丢失了数据，这与第一条和第二条管道不同。有什么想法吗？真有意思！所以每一行在经过第三个管道时都被记录在堆栈驱动程序中，但是它从来没有在BigQuery中登陆过？传输了多少数据？（如果我想复制场景）不，先生，stackdriver中的行也丢失了。它从未被人读过。我的数据量不是很大，目前峰值时每秒大约有1到2k条记录，慢速时不到20条记录。如果行在stackdriver中，但不在bigquery中，我会得出结论，我的代码有问题，但目前不是这样case@FelipeHoffa，不错的建议，尽管在某些情况下，beam ACK在消息到达目的地之前对其进行确认（例如，用BQ书写）