Python 在GCP数据流上运行脚本_Python_Google Cloud Platform_Google Cloud Dataflow_Pipeline_Apache Beam

Python 在GCP数据流上运行脚本

python google-cloud-platform google-cloud-dataflow

Python 在GCP数据流上运行脚本,python,google-cloud-platform,google-cloud-dataflow,pipeline,apache-beam,Python,Google Cloud Platform,Google Cloud Dataflow,Pipeline,Apache Beam,我开始尝试谷歌云数据流，在经典的wordcount示例之后，我编写了自己的脚本： import argparse import sys import apache_beam as beam from apache_beam.io import ReadFromText from apache_beam.io import WriteToText from apache_beam.options.pipeline_options import PipelineOptions class Sp

我开始尝试谷歌云数据流，在经典的

wordcount

示例之后，我编写了自己的脚本：

import argparse
import sys

import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions


class Split(beam.DoFn):

    def process(self, element):
        (numfact, bag, type, owner, 
         main_owner, client) = element.splt('\t')

        return [{
            'numfact': int(numfact),
            'type': type,
            'owner': owner
        }]


parser = argparse.ArgumentParser()

parser.add_argument('--input')
parser.add_argument('--output')

known_args, extra_args = parser.parse_known_args(sys.argv[1:])

options = PipelineOptions(extra_args)
p = beam.Pipeline(options=options)
print(known_args)
print(extra_args)
csv_lines = (p | "Load" >> ReadFromText(known_args.input, skip_header_lines=1) | "Process" >> beam.ParDo(Split()) | "Write" >> WriteToText(known_args.output))

下面是输入文件中的示例：

Numfact BAG TYPE    OWNER   MAIN OWNER  CLIENT
728632636   CNT Alternativos    Kramer Ortiz    ACCIDENTES PERSONALES TELETICKET    Rimac
704845964   CNT Alternativos    Kramer Ortiz    SOAT    Canal
701387639   CNT SIN ASIGNAR Sin asignar WEB VEHICULOS   Canal
692571746   CNT Concesionarios  Kramer Ortiz    WEB VEHICULOS   Canal
682823453   CNT Alternativos    Kramer Ortiz    WEB VEHICULOS   Canal
682823452   CNT Alternativos    Kramer Ortiz    WEB VEHICULOS   Canal
682823451   CNT Alternativos    Kramer Ortiz    WEB VEHICULOS   Canal
682823454   CNT Alternativos    Kramer Ortiz    WEB VEHICULOS   Canal
706853395   CNT Alternativos    Kramer Ortiz    ACCIDENTES PERSONALES - WEB Canal
706466281   CNT Alternativos    Kramer Ortiz    SOAT    Canal

最后，我调用它以如下方式执行（文件另存为.txt）：

之后，它在控制台上显示打印，但不在数据流控制台中注册执行

更新

这是console的外观：

(gcp) gocht@~/script$ python -m beam --input gs://dummy_bucket/data_entry/pcd/pcd_ensure.txt --output gs://dummy_bucket/outputs --runner DataflowRunner --project dummyproject-268120 --temp_location gs://dummy_bucket/tmp --region us-central1
Namespace(input='gs://dummy_bucket/data_entry/pcd/pcd_ensure.txt', output='gs://dummy_bucket/outputs')   ['--runner', 'DataflowRunner', '--project', 'dummyproject-268120', '--temp_location', 'gs://dummy_bucket/tmp', '--region', 'us-central1']

这仅显示放置在代码脚本上的打印

我错过了什么

谢谢

您将需要

result = p.run()

在文件末尾运行管道

基本上，我认为您已经构建了管道，但并没有真正要求运行它。

因为答案在评论中，所以也将其写在这里：）

您需要通过执行以下操作来实际运行管道：

p.run().wait_until_finish()

如果您觉得卡住了，并且不确定出了什么问题，请尝试查看提供的示例-java版本在开始使用dataflow时确实帮了我很大的忙：）

您可以提供命令的输出吗？@Pievis updated。我第一次看到python版本，但是p.run（）可以吗。请等待\u直到\u finish（）在密码里失踪了吗？@Pievis工作过伙计！如此简单，如此有效。考虑把它作为一个答案。谢谢我很高兴我帮了忙！我听从你的建议，并把它作为一个答案：）

p.run().wait_until_finish()