Python 与卡夫卡和熊猫一起在巨蟒上打滚获胜'；不要让它沉下去_Python_Pandas_Apache Kafka_Apache Flink

Python 与卡夫卡和熊猫一起在巨蟒上打滚获胜'；不要让它沉下去

python pandas apache-kafka apache-flink

Python 与卡夫卡和熊猫一起在巨蟒上打滚获胜'；不要让它沉下去,python,pandas,apache-kafka,apache-flink,Python,Pandas,Apache Kafka,Apache Flink,我正在使用Flink 1.11（通过Python API和Anaconda虚拟环境），Kafka是我的源和汇。我正在向集群提交我的Flink作业。所有这些都在Docker（本地）上运行由于我是新手，现在我已经设置好了它，它基本上是通过一些窗口来实现的，然后慢慢地建立起来到目前为止，我的设置： if __name__ == "__main__": env = StreamExecutionEnvironment.get_execution_environment()

我正在使用Flink 1.11（通过Python API和Anaconda虚拟环境），Kafka是我的源和汇。我正在向集群提交我的Flink作业。所有这些都在Docker（本地）上运行

由于我是新手，现在我已经设置好了它，它基本上是通过一些窗口来实现的，然后慢慢地建立起来

到目前为止，我的设置：

if __name__ == "__main__":

  env = StreamExecutionEnvironment.get_execution_environment()
  env.set_stream_time_characteristic(TimeCharacteristic.IngestionTime)
  table_config = TableConfig()
  table_env = StreamTableEnvironment.create(env, table_config)

  ## Setup environment
  # Use our previously configured Anaconda environment
  table_env.add_python_archive("venv.zip")
  table_env.get_config().set_python_executable("venv.zip/venv/bin/python")

  shared_fields = {'a': DataTypes.STRING(), 'b': DataTypes.STRING(), 'c': DataTypes.STRING()}

  source_data_topic = "eddn_topic"

  table_env.connect( 
      Kafka()
      .version("0.11")
      .topic("test_sink")
      .property("bootstrap.servers", bootstrap_host)
    ) \
    .with_format(
      Json()
      .fail_on_missing_field(False)
    ) \
    .with_schema(
        Schema()
        .fields(shared_fields)
    ) \
    .create_temporary_table("stream_sink") \

  source_ddl = f"""
          CREATE TABLE testSource(
              a STRING,
              b STRING,
              c STRING,
              `timestamp` TIMESTAMP(3),
              WATERMARK FOR `timestamp` AS `timestamp`
          ) with (
              'connector' = 'kafka-0.11',
              'properties.bootstrap.servers' = '{bootstrap_host}',
              'topic' = 'test_source',
              'properties.group.id' = 'testGroup',
              'format' = 'json',
              'scan.startup.mode' = 'latest-offset',
              'json.fail-on-missing-field' = 'false',
              'json.timestamp-format.standard' = 'ISO-8601',
              'json.ignore-parse-errors' = 'false'
          )
          """

  table_env.execute_sql(source_ddl)

  # Setup a 10-second Tumbling window
  table = table_env.from_path("testSource") \
            .select("a, b, c, timestamp") \
            .window(Tumble.over("10.second").on("timestamp").alias("testWindow")) \
            .group_by("testWindow, a, b, c") \
            .select("*")

是的，我混合了

execute\u sql（）

和

connect（）

来设置我的表，但这是为了我的学习目的

从这里开始，这一切都很好，新的卡夫卡主题中出现了一些信息：

  table.insert_into("stream_sink") 
  table_env.execute("TestEnrichmentJob")

但是，即使转换到数据帧并返回，也不会产生消息：

  pandasTable = table.to_pandas()
  enriched_table = table_env.from_pandas(pandasTable, [DataTypes.STRING(), DataTypes.STRING(), DataTypes.STRING()])
  enriched_table.insert_into("stream_sink") 

  table_env.execute("TestEnrichmentJob")

在Flink web界面中查看作业可以看出，此接收器任务正在接收数据，但没有发送任何数据（作业也没有失败，只是继续运行）。卡夫卡表明，消息是从源主题中使用的，而不是在接收主题中生成的

我觉得我错过了一些显而易见的东西，因为我对流媒体数据还不熟悉

我错过什么了吗

一旦我需要执行更高级的操作，我是否需要将其作为一个整体来实施？或者它可以被写为“正常”操作吗

有一件事你遗漏了：Flink sources从未显示任何记录进入，Flink sinks从未显示任何记录流出。换句话说，Flink web UI中显示的

numRecordsIn

、

numRecordsOut

、

numRecordsInPerSecond

、和

numRecordsOutPerSecond

指标仅测量Flink内的流量，而忽略与外部系统（如Kafka）的通信

编辑：

我自己还没有尝试过，但即将发布的教程中有一个示例显示了这一点：

experimented\u表。执行\u插入（“流\u接收器”）。获取\u作业\u客户端（）。获取\u作业\u执行\u结果（）。结果（）

FWIW，正在编写关于此的教程：。希望这将很快被合并，并在主文档中可用。这很公平。卡夫卡仍然显示消息正在从我的源主题中被消费，所以我至少知道它们正在从那里出来。但是，卡夫卡接收器主题中没有任何内容。就我的问题而言，我在Flink中提到的指标有点让人分心。你在使用blink planner吗？我只是使用上面的链接切换到它。不过运气不好。