Python:如何使用ApacheBeam连接到Snowflake?

Python:如何使用ApacheBeam连接到Snowflake?,python,google-cloud-dataflow,pipeline,apache-beam,snowflake-cloud-data-platform,Python,Google Cloud Dataflow,Pipeline,Apache Beam,Snowflake Cloud Data Platform,我看到BigQuery有一个内置的I/O连接器,但是我们的很多数据都存储在Snowflake中。是否有连接到Snowflake的解决方法?我能想到的唯一一件事是使用sqlalchemy运行查询,然后将输出转储到云存储桶中,然后Apache Beam可以从存储在桶中的文件中获取输入数据。谷歌云支持 雪花数据流和云数据流之间没有直接的连接,但一个解决方法就是您提到的。首先将输出转储到云存储,然后将云存储连接到云数据流 我希望这会有所帮助。最近,Beam增加了雪花Python和Java连接器 目前(版

我看到BigQuery有一个内置的I/O连接器,但是我们的很多数据都存储在Snowflake中。是否有连接到Snowflake的解决方法?我能想到的唯一一件事是使用sqlalchemy运行查询,然后将输出转储到云存储桶中,然后Apache Beam可以从存储在桶中的文件中获取输入数据。

谷歌云支持

雪花数据流和云数据流之间没有直接的连接,但一个解决方法就是您提到的。首先将输出转储到云存储,然后将云存储连接到云数据流


我希望这会有所帮助。

最近,Beam增加了雪花Python和Java连接器

目前(版本2.24),它只支持apache_beam.io.external.snowflake中的ReadFromSnowflake操作

在2.25版本的WriteToSnowflake也将在
apache_beam.io.snowflake
模块中提供。您仍然可以使用旧路径,但在此版本中它将被视为不推荐使用

目前,它只在Flink Runner上运行,但也在努力让其他跑步者也能使用它

此外,它是一种跨语言转换,因此可能需要一些额外的设置-在这里的pydoc中有很好的文档记录(我在下面粘贴它):

Snowflake(与大多数便携式IOs一样)有自己的java扩展服务,当您不指定自己的自定义服务时,应该自动下载该服务。我不认为这是必要的,但我只是为了安全起见才提到这一点。您可以下载jar并用
java-jar
启动它,然后将其传递给snowflake.ReadFromSnowflake as
expansion\u service='localhost:'
。链接到2.24版本:


请注意,它仍然是实验性的,您可以随意报告Beam Jira的问题。

对于未来想要了解如何开始使用Snowflake和Apache Beam的人,我可以推荐以下由连接器创建者编写的教程


谢谢亚历杭德罗。您是建议使用Kubernetes Engine+Cloud Scheduler上托管的简单python脚本来促进数据转储到云存储,还是我应该使用Pyspark+DataProc?很高兴听到这个!你一直在玩/参与吗?谢谢,我参与了跨语言的努力:)
Snowflake transforms tested against Flink portable runner.
  **Setup**
  Transforms provided in this module are cross-language transforms
  implemented in the Beam Java SDK. During the pipeline construction, Python SDK
  will connect to a Java expansion service to expand these transforms.
  To facilitate this, a small amount of setup is needed before using these
  transforms in a Beam Python pipeline.
  There are several ways to setup cross-language Snowflake transforms.
  * Option 1: use the default expansion service
  * Option 2: specify a custom expansion service
  See below for details regarding each of these options.
  *Option 1: Use the default expansion service*
  This is the recommended and easiest setup option for using Python Snowflake
  transforms.This option requires following pre-requisites
  before running the Beam pipeline.
  * Install Java runtime in the computer from where the pipeline is constructed
    and make sure that 'java' command is available.
  In this option, Python SDK will either download (for released Beam version) or
  build (when running from a Beam Git clone) a expansion service jar and use
  that to expand transforms. Currently Snowflake transforms use the
  'beam-sdks-java-io-expansion-service' jar for this purpose.
  *Option 2: specify a custom expansion service*
  In this option, you startup your own expansion service and provide that as
  a parameter when using the transforms provided in this module.
  This option requires following pre-requisites before running the Beam
  pipeline.
  * Startup your own expansion service.
  * Update your pipeline to provide the expansion service address when
    initiating Snowflake transforms provided in this module.
  Flink Users can use the built-in Expansion Service of the Flink Runner's
  Job Server. If you start Flink's Job Server, the expansion service will be
  started on port 8097. For a different address, please set the
  expansion_service parameter.
  **More information**
  For more information regarding cross-language transforms see:
  - https://beam.apache.org/roadmap/portability/
  For more information specific to Flink runner see:
  - https://beam.apache.org/documentation/runners/flink/