如何分析Python数据流作业？_Python_Google Cloud Dataflow

如何分析Python数据流作业？

python google-cloud-dataflow

如何分析Python数据流作业？,python,google-cloud-dataflow,Python,Google Cloud Dataflow,我编写了一个Python数据流作业来处理一些数据：管道 |“读取”>>beam.io.ReadFromText（已知参数输入）#9分44秒 |“解析线”>>beam.Map（解析线）#4分55秒 |“添加键”>>波束图（添加键）#48秒 |“按键分组”>>beam.GroupByKey（）#11分56秒 |“映射值”>>beam.ParDo（MapValuesFn（））#11分40秒 |“json_encode”>>beam.Map（json.dumps）#26秒 |“输出”>>beam.io

我编写了一个Python数据流作业来处理一些数据：

管道
|“读取”>>beam.io.ReadFromText（已知参数输入）#9分44秒
|“解析线”>>beam.Map（解析线）#4分55秒
|“添加键”>>波束图（添加键）#48秒
|“按键分组”>>beam.GroupByKey（）#11分56秒
|“映射值”>>beam.ParDo（MapValuesFn（））#11分40秒
|“json_encode”>>beam.Map（json.dumps）#26秒
|“输出”>>beam.io.textio.WriteToText（已知参数输出）#22秒

（我已删除特定于业务的语言。）

输入是一个1.36 GiB gz压缩的CSV，但是作业需要37分钟34秒才能运行（我使用的是数据流，因为我预计输入的大小会快速增长）

如何识别管道中的瓶颈并加快其执行？没有一个单独的函数在计算上是昂贵的

来自数据流控制台的自动缩放信息：

12:00:35 PM     Starting a pool of 1 workers. 
12:05:02 PM     Autoscaling: Raised the number of workers to 2 based on the rate of progress in the currently running step(s).
12:10:02 PM     Autoscaling: Reduced the number of workers to 1 based on the rate of progress in the currently running step(s).
12:29:09 PM     Autoscaling: Raised the number of workers to 3 based on the rate of progress in the currently running step(s).
12:35:10 PM     Stopping worker pool.

我搜索了

dev@beam.apache.org

发现有一个线程讨论了这个主题：

您可以查看此线程以获取有用的信息和/或在需要时提出问题/要求/讨论。

我搜索了

dev@beam.apache.org

发现有一个线程讨论了这个主题：

您可以检查此线程以获取有用的信息和/或提出问题/要求/讨论（如果需要）。

无意中，我发现本例中的问题是CSV的压缩

输入是单个gz压缩的CSV。为了更容易地检查数据，我切换到未压缩的CSV。这将处理时间减少到17分钟以下，Dataflow的自动缩放峰值为10人

（如果我仍然需要压缩，我会将CSV分成几个部分，然后分别压缩每个部分。）

我意外地发现，本例中的问题是CSV的压缩

输入是单个gz压缩的CSV。为了更容易地检查数据，我切换到未压缩的CSV。这将处理时间减少到17分钟以下，Dataflow的自动缩放峰值为10人

（如果我仍然需要压缩，我会将CSV分成几个部分，然后分别压缩每个部分。）

我在Google上看到了这个Python Profiler包：

也许你可以问一下dev@beam.apache.org？也许你可以问一下dev@beam.apache.org?