Apache spark 向Kafka主题发送聚合结果时引发结构化流错误

Apache spark 向Kafka主题发送聚合结果时引发结构化流错误,apache-spark,pyspark,spark-structured-streaming,Apache Spark,Pyspark,Spark Structured Streaming,所有大数据专家 向Kafka Topic发送聚合结果时遇到问题。它可以在没有聚合的情况下进行转换。有人能帮我解决这个问题吗?聚合结果对于触发后续事件和不同逻辑非常重要。下面是模拟问题。下面的所有代码都经过测试并正在运行 Spark版本2.4.4 卡夫卡插件org.apache.spark:spark-sql-Kafka-0-10_2.11:2.4.4 数据源 转换(不按聚合分组) 输出正常工作 /usr/lib/kafka/bin/kafka-console-consumer.sh --

所有大数据专家

向Kafka Topic发送聚合结果时遇到问题。它可以在没有聚合的情况下进行转换。有人能帮我解决这个问题吗?聚合结果对于触发后续事件和不同逻辑非常重要。下面是模拟问题。下面的所有代码都经过测试并正在运行

Spark版本2.4.4

卡夫卡插件org.apache.spark:spark-sql-Kafka-0-10_2.11:2.4.4

数据源 转换(不按聚合分组) 输出正常工作

/usr/lib/kafka/bin/kafka-console-consumer.sh     --bootstrap-server ${CLUSTER_NAME}-w-1:9092 --topic test_output_non_aggregate

{"name":"2844","txn_datetime":"2020-01-29T15:16:36.000+08:00"}
{"name":"2845","txn_datetime":"2020-01-29T15:16:37.000+08:00"}
按聚合分组 我尝试了水印,但没有水印,两者都不起作用

table='test_input'
wallet_txn_log = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092") \
    .option("subscribe", table) \
    .load() \
    .selectExpr("CAST(value AS STRING) as string").select( from_json("string", schema= StructType([StructField("dt",LongType(),True),StructField("name",StringType(),True)])  ).alias("x")).select('x.*')\
    .select(['name',col('dt').cast(TimestampType()).alias("txn_datetime")]) \
    .withWatermark("txn_datetime", "5 seconds") \
    .groupBy('name','txn_datetime').agg( 
     count("name").alias("is_txn_count")) \
    .select([to_json(struct('name','is_txn_count')).alias("value")]) \
    .writeStream \
    .format("kafka") \
    .outputMode("update") \
    .option("kafka.bootstrap.servers", "xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092") \
    .option("topic", "test_aggregated_output") \
    .option("checkpointLocation", "gs://gcp-datawarehouse/streaming/checkpoints/streaming_test1-aggregated_{}".format(table)).start()
错误

[Stage 1:>                                                        (0 + 3) / 200]20/01/29 16:20:57 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, cep-m.asia-southeast1-c.c.tngd-poc.internal, executor 1): org.apache.spark.util.TaskCompletionListenerException: null
        at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:138)
        at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:116)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
纱线原木-应用程序ID xxx

查询验证

分组依据聚合查询是正确的。它在控制台和内存接收器中工作。然而,在卡夫卡接收器中,它不断抛出错误

wallet_txn_log = spark \
...     .readStream \
...     .format("kafka") \
...     .option("kafka.bootstrap.servers", "10.148.15.235:9092,10.148.15.236:9092,10.148.15.233:9092") \
...     .option("subscribe", table) \
...     .load() \
...     .selectExpr("CAST(value AS STRING) as string").select( from_json("string", schema= StructType([StructField("dt",LongType(),True),StructField("name",StringType(),True)])  ).alias("x")).select('x.*')\
...     .select(['name',col('dt').cast(TimestampType()).alias("txn_datetime")]) \
...     .withWatermark("txn_datetime", "5 seconds") \
...     .groupBy('name','txn_datetime').agg( 
...      count("name").alias("is_txn_count")) \
...     .select([to_json(struct('name','is_txn_count')).alias("value")]) 
>>> 
>>> df=wallet_txn_log.writeStream \
...     .outputMode("update") \
...     .option("truncate", False) \
...     .format("console") \
...     .start()
-------------------------------------------                                     
Batch: 0
-------------------------------------------
+-----+
|value|
+-----+
+-----+

-------------------------------------------                                     
Batch: 1
-------------------------------------------
+--------------------------------+
|value                           |
+--------------------------------+
|{"name":"4296","is_txn_count":1}|
|{"name":"4300","is_txn_count":1}|
|{"name":"4297","is_txn_count":1}|
|{"name":"4303","is_txn_count":1}|
|{"name":"4299","is_txn_count":1}|
|{"name":"4305","is_txn_count":1}|
|{"name":"4298","is_txn_count":1}|
|{"name":"4304","is_txn_count":1}|
|{"name":"4307","is_txn_count":1}|
|{"name":"4302","is_txn_count":1}|
|{"name":"4301","is_txn_count":1}|
|{"name":"4306","is_txn_count":1}|
|{"name":"4310","is_txn_count":1}|
|{"name":"4309","is_txn_count":1}|
|{"name":"4308","is_txn_count":1}|
+--------------------------------+

GroupByAggregation代码是正确的,正如spark 2.4.4的卡夫卡插件在本例中有一个小错误。将火花从2.4.4降到2.4.3后。上面的错误已消失。

检查
groupBy('name','txn_datetime').agg(count(“name”).alias(“is_txn_count”)
在这里,分组列包括聚合列,即
name
存在于groupBy和agg中。对吗?我手头没有spark系统,因此无法验证这是否会导致所述问题。您好@xenodevil,查询是正确的,我用调试输出更新了上述问题。@xenodevil
groupBy(“column”)。count()
在使用Kafka接收器时也不起作用。所有三种可用的输出模式也进行了测试。它抛出同样的错误
[Stage 1:>                                                        (0 + 3) / 200]20/01/29 16:20:57 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, cep-m.asia-southeast1-c.c.tngd-poc.internal, executor 1): org.apache.spark.util.TaskCompletionListenerException: null
        at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:138)
        at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:116)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
wallet_txn_log = spark \
...     .readStream \
...     .format("kafka") \
...     .option("kafka.bootstrap.servers", "10.148.15.235:9092,10.148.15.236:9092,10.148.15.233:9092") \
...     .option("subscribe", table) \
...     .load() \
...     .selectExpr("CAST(value AS STRING) as string").select( from_json("string", schema= StructType([StructField("dt",LongType(),True),StructField("name",StringType(),True)])  ).alias("x")).select('x.*')\
...     .select(['name',col('dt').cast(TimestampType()).alias("txn_datetime")]) \
...     .withWatermark("txn_datetime", "5 seconds") \
...     .groupBy('name','txn_datetime').agg( 
...      count("name").alias("is_txn_count")) \
...     .select([to_json(struct('name','is_txn_count')).alias("value")]) 
>>> 
>>> df=wallet_txn_log.writeStream \
...     .outputMode("update") \
...     .option("truncate", False) \
...     .format("console") \
...     .start()
-------------------------------------------                                     
Batch: 0
-------------------------------------------
+-----+
|value|
+-----+
+-----+

-------------------------------------------                                     
Batch: 1
-------------------------------------------
+--------------------------------+
|value                           |
+--------------------------------+
|{"name":"4296","is_txn_count":1}|
|{"name":"4300","is_txn_count":1}|
|{"name":"4297","is_txn_count":1}|
|{"name":"4303","is_txn_count":1}|
|{"name":"4299","is_txn_count":1}|
|{"name":"4305","is_txn_count":1}|
|{"name":"4298","is_txn_count":1}|
|{"name":"4304","is_txn_count":1}|
|{"name":"4307","is_txn_count":1}|
|{"name":"4302","is_txn_count":1}|
|{"name":"4301","is_txn_count":1}|
|{"name":"4306","is_txn_count":1}|
|{"name":"4310","is_txn_count":1}|
|{"name":"4309","is_txn_count":1}|
|{"name":"4308","is_txn_count":1}|
+--------------------------------+