如何在Python中从Datalab运行数据流作业?
我在从Datalab运行数据流作业时遇到了一些问题。对于这种情况,我可以使用一个最简单的Python代码示例,因为Google云平台或ApacheBeam文档中似乎没有这一示例 如果我能看到一些Python代码,这些代码可以从执行以下操作的Datalab单元中运行,这将非常有帮助如何在Python中从Datalab运行数据流作业?,python,google-cloud-platform,google-cloud-dataflow,apache-beam,google-cloud-datalab,Python,Google Cloud Platform,Google Cloud Dataflow,Apache Beam,Google Cloud Datalab,我在从Datalab运行数据流作业时遇到了一些问题。对于这种情况,我可以使用一个最简单的Python代码示例,因为Google云平台或ApacheBeam文档中似乎没有这一示例 如果我能看到一些Python代码,这些代码可以从执行以下操作的Datalab单元中运行,这将非常有帮助 # 1. Sets up the job # 2. Defines the processing logic to be applied to the input data files # 3. Saves the
# 1. Sets up the job
# 2. Defines the processing logic to be applied to the input data files
# 3. Saves the processed files to an output folder
# 4. Submits the job to Google Cloud Dataflow
为了解决这个问题,我试着使用Google和Apache文档中的单词计数示例,并对它们进行调整,以便在Datalab中使用。这方面的代码如下所示,但我不清楚我可以去掉哪些部分来将其转化为真正最小的工作示例
from __future__ import absolute_import
import argparse
import logging
import re
from past.builtins import unicode
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
def run(argv=None):
"""Main entry point; defines and runs the wordcount pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument('--input',
dest='input',
default='gs://data-analytics/kinglear.txt',
help='Input file to process.')
parser.add_argument('--output',
dest='output',
default='gs://data-analytics/output',
help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_args.extend([
'--runner=DataflowRunner',
'--project=project',
'--staging_location=gs://staging',
'--temp_location=gs://tmp',
'--job_name=your-wordcount-job',
])
# We use the save_main_session option because one or more DoFn's in this
# workflow rely on global context (e.g., a module imported at module level).
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
with beam.Pipeline(options=pipeline_options) as p:
# Read the text file[pattern] into a PCollection.
lines = p | ReadFromText(known_args.input)
# Count the occurrences of each word.
counts = (
lines
| 'Split' >> (beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
.with_output_types(unicode))
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'GroupAndSum' >> beam.CombinePerKey(sum))
# Format the counts into a PCollection of strings.
def format_result(word_count):
(word, count) = word_count
return '%s: %s' % (word, count)
output = counts | 'Format' >> beam.Map(format_result)
# Write the output using a "Write" transform that has side effects.
output | WriteToText(known_args.output)
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
提前谢谢
Josh我认为您混淆了Datalab和Dataflow的功能。这是两个不同的编程平台,您正在将它们混合在一起。您的注释:
定义要应用于输入数据文件的处理逻辑
。处理逻辑是云数据流的源代码(或模板)提供的,而不是运行在云数据实验室笔记本中的代码
作为一个选项:如果安装云数据流库并使用Python 2.x,则可以在Datalab笔记本中编写云数据流(Apache Beam)软件。此代码将在Datalab内部本地运行,不会启动数据流作业
这里有一些链接可以帮助您编写创建云数据流作业的软件。
下面是一个StackOverflow答案,它将向您展示如何在python中启动数据流作业:
Google Dataflow Java文档,但对所需步骤有很好的解释:
这是指向Dataflow Python客户端API的链接:
我认为您混淆了Datalab和Dataflow的功能。这是两个不同的编程平台,您正在将它们混合在一起。您的注释:
定义要应用于输入数据文件的处理逻辑
。处理逻辑是云数据流的源代码(或模板)提供的,而不是运行在云数据实验室笔记本中的代码
作为一个选项:如果安装云数据流库并使用Python 2.x,则可以在Datalab笔记本中编写云数据流(Apache Beam)软件。此代码将在Datalab内部本地运行,不会启动数据流作业
这里有一些链接可以帮助您编写创建云数据流作业的软件。
下面是一个StackOverflow答案,它将向您展示如何在python中启动数据流作业:
Google Dataflow Java文档,但对所需步骤有很好的解释:
这是指向Dataflow Python客户端API的链接:
我在这里的教程的帮助下解决了这个问题:现在可以使用以下代码从Datalab启动数据流作业
import apache_beam as beam
# Pipeline options:
options = beam.options.pipeline_options.PipelineOptions()
gcloud_options = options.view_as(beam.options.pipeline_options.GoogleCloudOptions)
gcloud_options.job_name = 'test'
gcloud_options.project = 'project'
gcloud_options.staging_location = 'gs://staging'
gcloud_options.temp_location = 'gs://tmp'
gcloud_options.region = 'europe-west2'
# Worker options:
worker_options = options.view_as(beam.options.pipeline_options.WorkerOptions)
worker_options.disk_size_gb = 30
worker_options.max_num_workers = 10
# Standard options:
options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DataflowRunner'
# Pipeline:
PL = beam.Pipeline(options=options)
(
PL | 'read' >> beam.io.ReadFromText('gs://input.txt')
| 'write' >> beam.io.WriteToText ('gs://output.txt', num_shards=1)
)
PL.run()
谢谢
Josh我在这里的教程的帮助下解决了这个问题:现在可以使用以下代码从Datalab启动数据流作业
import apache_beam as beam
# Pipeline options:
options = beam.options.pipeline_options.PipelineOptions()
gcloud_options = options.view_as(beam.options.pipeline_options.GoogleCloudOptions)
gcloud_options.job_name = 'test'
gcloud_options.project = 'project'
gcloud_options.staging_location = 'gs://staging'
gcloud_options.temp_location = 'gs://tmp'
gcloud_options.region = 'europe-west2'
# Worker options:
worker_options = options.view_as(beam.options.pipeline_options.WorkerOptions)
worker_options.disk_size_gb = 30
worker_options.max_num_workers = 10
# Standard options:
options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DataflowRunner'
# Pipeline:
PL = beam.Pipeline(options=options)
(
PL | 'read' >> beam.io.ReadFromText('gs://input.txt')
| 'write' >> beam.io.WriteToText ('gs://output.txt', num_shards=1)
)
PL.run()
谢谢
乔希