Apache spark 结构化流聚合返回错误的值
我编写了一个结构化流聚合,它从Kafka源中获取事件,执行简单的计数并将它们写回Cassandra数据库。代码如下所示:Apache spark 结构化流聚合返回错误的值,apache-spark,cassandra,spark-structured-streaming,Apache Spark,Cassandra,Spark Structured Streaming,我编写了一个结构化流聚合,它从Kafka源中获取事件,执行简单的计数并将它们写回Cassandra数据库。代码如下所示: class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink { override def addBatch(batchId: Long, data: DataFrame): Unit = { data.write.mode(SaveMode.Ap
class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit = {
data.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
}
}
class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit = {
val ds = data.sparkSession.createDataFrame(
data.sparkSession.sparkContext.parallelize(data.collect()),
data.schema
)
ds.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
}
}
val数据=流
.groupBy(函数.to_date($“timestamp”).as(“date”),$“type”.as(“type”))
.agg(functions.count(“*”).as(“value”))
val查询:StreamingQuery=数据
.writeStream
.queryName(“按类型分组”)
.format(“org.apache.spark.sql.streaming.cassandra.CassandraSinkProvider”)
.outputMode(outputMode.Complete())
.option(“checkpointLocation”,config.getString(“checkpointLocation”)+“/”+“按类型分组”)
.选项(“键空间”、“分析”)
.选项(“表格”、“汇总”)
.选项(“partitionKeyColumns”、“项目、类型”)
.选项(“clusteringKeyColumns”、“日期”)
.start()
问题是,每批产品的数量都刚刚超过。所以我会看到卡桑德拉的计数下降。计数不应该在一天内下降,我如何才能做到这一点
编辑:
我也尝试过使用窗口聚合,同样的事情,所以本例中的错误实际上不在我的查询或Spark中。 为了找出问题出在哪里,我使用了控制台接收器,但它没有显示问题 问题出在我的Cassandra水槽里,看起来像这样:
class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit = {
data.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
}
}
class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit = {
val ds = data.sparkSession.createDataFrame(
data.sparkSession.sparkContext.parallelize(data.collect()),
data.schema
)
ds.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
}
}
它使用Datastax Spark Cassandra连接器写入数据帧
问题是变量data
包含流数据集。在Spark提供的ConsoleLink中,数据集在写入之前被复制到静态数据集中。所以我改变了它,现在它工作了。完成的版本如下所示:
class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit = {
data.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
}
}
class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit = {
val ds = data.sparkSession.createDataFrame(
data.sparkSession.sparkContext.parallelize(data.collect()),
data.schema
)
ds.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
}
}
因此,本例中的错误实际上不在我的查询或Spark中。 为了找出问题出在哪里,我使用了控制台接收器,但它没有显示问题 问题出在我的Cassandra水槽里,看起来像这样:
class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit = {
data.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
}
}
class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit = {
val ds = data.sparkSession.createDataFrame(
data.sparkSession.sparkContext.parallelize(data.collect()),
data.schema
)
ds.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
}
}
它使用Datastax Spark Cassandra连接器写入数据帧
问题是变量data
包含流数据集。在Spark提供的ConsoleLink中,数据集在写入之前被复制到静态数据集中。所以我改变了它,现在它工作了。完成的版本如下所示:
class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit = {
data.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
}
}
class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit = {
val ds = data.sparkSession.createDataFrame(
data.sparkSession.sparkContext.parallelize(data.collect()),
data.schema
)
ds.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
}
}