Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何以控制台格式编写具有不同数据帧的相同流?_Apache Spark_Pyspark_Apache Kafka_Spark Structured Streaming - Fatal编程技术网

Apache spark 如何以控制台格式编写具有不同数据帧的相同流?

Apache spark 如何以控制台格式编写具有不同数据帧的相同流?,apache-spark,pyspark,apache-kafka,spark-structured-streaming,Apache Spark,Pyspark,Apache Kafka,Spark Structured Streaming,因为我是spark structure streaming的新手,在一个简单的场景中面临问题: 我试图用两个不同的数据帧编写一个流 from pyspark.sql import functions as f df = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "localhost:9092") \ .option("sub

因为我是spark structure streaming的新手,在一个简单的场景中面临问题:

我试图用两个不同的数据帧编写一个流

from pyspark.sql import functions as f
df = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "localhost:9092") \
        .option("subscribe", "topic1") \
        .option("failOnDataLoss", False)\
        .option("startingOffsets", "earliest") \
        .load()



data1 = df.filter(f.col('status') == 'true')
data2 = df.filter(f.col('status') == 'false')
data2 = data2.select(df.id,f.struct(df.col1, df.col2, df.col3).alias('value')) 
data2 = data2.groupBy("id").agg(f.collect_set('value').alias('history'))
data1 = data1.writeStream.format("console").option("truncate", "False").trigger(processingTime='15 seconds').start()
data2 = data2.writeStream.format("console").option("truncate", "False").trigger(processingTime='15 seconds').start()

spark.streams.awaitAnyTermination()
我得到了以下相同的错误:

Traceback (most recent call last):
File "/home/adarshbajpai/Downloads/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark     /sql/utils.py", line 63, in deco
 File "/home/adarshbajpai/Downloads/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o186.start.
 : org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
 Aggregate [customerid#93L], [customerid#93L, collect_set(hist_value#278, 0, 0) AS histleadstatus#284]
 +- Project [customerid#93L, named_struct(islaststatus, islaststatus#46, statusid, statusid#43, status, statusname#187, createdOn, statusCreatedDate#59, updatedOn, statusUpdatedDate#60) AS hist_value#278]
  +- Filter (islaststatus#46 = 0)
我认为我不应该使用水印,因为我的流媒体没有延迟和任何延迟。


请建议!提前感谢。

此错误并不表示正在打印的两件东西有任何问题。你试过注释一个吗?如果我在注释data2查询,那么是的。但是,如果我对data1查询进行了注释,它仍然会给我相同的错误。这真的是你的全部代码吗?为什么输出认为您正在使用
collect\u set
进行聚合?很抱歉@cricket\u 007响应太晚。是的,我正在做一些转换,最初我只给出了我的代码片段。现在,我用转换逻辑更新了我的问题。