Google cloud platform 如何在云数据流python管道中读取多种数据存储类型_Google Cloud Platform_Google Cloud Datastore_Google Cloud Dataflow_Apache Beam_Apache Beam Io

Google cloud platform 如何在云数据流python管道中读取多种数据存储类型

google-cloud-platform google-cloud-dataflow

Google cloud platform 如何在云数据流python管道中读取多种数据存储类型,google-cloud-platform,google-cloud-datastore,google-cloud-dataflow,apache-beam,apache-beam-io,Google Cloud Platform,Google Cloud Datastore,Google Cloud Dataflow,Apache Beam,Apache Beam Io,我试图从python管道中的默认名称空间中读取多个数据存储类型，并希望处理它们。我编写的函数在DirectRunner本地运行良好，但当我使用DataflowRunner在云上运行管道时，其中一种（包含1500条记录）的读取速度非常快，而另一种（包含数百万条记录）的读取速度非常慢作为参考，当我尝试读取管道中的一种（包含数百万条记录）时，需要10分钟，但当同时执行这两种操作时，几乎需要1小时，而且它只处理了1/10的记录我想不出是什么问题这是我的密码 def read_from_datast

我试图从python管道中的默认名称空间中读取多个数据存储类型，并希望处理它们。我编写的函数在DirectRunner本地运行良好，但当我使用DataflowRunner在云上运行管道时，其中一种（包含1500条记录）的读取速度非常快，而另一种（包含数百万条记录）的读取速度非常慢

作为参考，当我尝试读取管道中的一种（包含数百万条记录）时，需要10分钟，但当同时执行这两种操作时，几乎需要1小时，而且它只处理了1/10的记录

我想不出是什么问题

这是我的密码

def read_from_datastore(project,user_options, pipeline_options):
  p = beam.Pipeline(options=pipeline_options)
  query = query_pb2.Query()
  query.kind.add().name = user_options.kind   #reading 1st kind this is the one with million records

  students = p | 'ReadFromDatastore' >> ReadFromDatastore(project=project,query=query)

  query = query_pb2.Query()
  query.kind.add().name = user_options.kind2   #reading 2nd kind this is the one with 1500 records

  courses = p | 'ReadFromDatastore2' >> ReadFromDatastore(project=project,query=query)

  open_courses = courses | 'closed' >> beam.FlatMap(filter_closed_courses)
  enrolled_students = students | beam.ParDo(ProfileDataDumpDataFlow(),AsIter(open_courses))

如果有人知道为什么会发生这种情况，请告诉我。

我看到您正在进行两种类型的联接操作。为此，如果您将其加载到，它将更合适、更快。在BigQuery中执行所需的联接操作

它不是在工作中花费时间读取实体，而是连接操作

你能分享你指定的信息吗？特别是num_workers、max_num_workers和machine_type。你可以看看如何使用数据流进行关系连接。嘿@Yurci管道选项是默认的Google数据流选项，没有修改。我不认为连接是这里的问题。当我试着在评论ParDo之后一起阅读这两个实体，然后直接打印它们时，情况仍然是一样的。1500门左右的课程几乎可以立即阅读，而拥有数百万记录的学生需要花费大量时间。或者，当我硬编码大约200-300门课程并运行它时，连接工作非常完美，我在大约15-20分钟内得到所有学生数据的输出。