Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 结构化流聚合返回错误的值_Apache Spark_Cassandra_Spark Structured Streaming - Fatal编程技术网

Apache spark 结构化流聚合返回错误的值

Apache spark 结构化流聚合返回错误的值,apache-spark,cassandra,spark-structured-streaming,Apache Spark,Cassandra,Spark Structured Streaming,我编写了一个结构化流聚合,它从Kafka源中获取事件,执行简单的计数并将它们写回Cassandra数据库。代码如下所示: class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink { override def addBatch(batchId: Long, data: DataFrame): Unit = { data.write.mode(SaveMode.Ap

我编写了一个结构化流聚合,它从Kafka源中获取事件,执行简单的计数并将它们写回Cassandra数据库。代码如下所示:

class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    data.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
  }
}
class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    val ds = data.sparkSession.createDataFrame(
      data.sparkSession.sparkContext.parallelize(data.collect()),
      data.schema
    )
    ds.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
  }
}
val数据=流
.groupBy(函数.to_date($“timestamp”).as(“date”),$“type”.as(“type”))
.agg(functions.count(“*”).as(“value”))
val查询:StreamingQuery=数据
.writeStream
.queryName(“按类型分组”)
.format(“org.apache.spark.sql.streaming.cassandra.CassandraSinkProvider”)
.outputMode(outputMode.Complete())
.option(“checkpointLocation”,config.getString(“checkpointLocation”)+“/”+“按类型分组”)
.选项(“键空间”、“分析”)
.选项(“表格”、“汇总”)
.选项(“partitionKeyColumns”、“项目、类型”)
.选项(“clusteringKeyColumns”、“日期”)
.start()
问题是,每批产品的数量都刚刚超过。所以我会看到卡桑德拉的计数下降。计数不应该在一天内下降,我如何才能做到这一点

编辑:
我也尝试过使用窗口聚合,同样的事情

,所以本例中的错误实际上不在我的查询或Spark中。 为了找出问题出在哪里,我使用了控制台接收器,但它没有显示问题

问题出在我的Cassandra水槽里,看起来像这样:

class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    data.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
  }
}
class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    val ds = data.sparkSession.createDataFrame(
      data.sparkSession.sparkContext.parallelize(data.collect()),
      data.schema
    )
    ds.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
  }
}
它使用Datastax Spark Cassandra连接器写入数据帧

问题是变量
data
包含流数据集。在Spark提供的ConsoleLink中,数据集在写入之前被复制到静态数据集中。所以我改变了它,现在它工作了。完成的版本如下所示:

class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    data.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
  }
}
class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    val ds = data.sparkSession.createDataFrame(
      data.sparkSession.sparkContext.parallelize(data.collect()),
      data.schema
    )
    ds.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
  }
}

因此,本例中的错误实际上不在我的查询或Spark中。 为了找出问题出在哪里,我使用了控制台接收器,但它没有显示问题

问题出在我的Cassandra水槽里,看起来像这样:

class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    data.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
  }
}
class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    val ds = data.sparkSession.createDataFrame(
      data.sparkSession.sparkContext.parallelize(data.collect()),
      data.schema
    )
    ds.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
  }
}
它使用Datastax Spark Cassandra连接器写入数据帧

问题是变量
data
包含流数据集。在Spark提供的ConsoleLink中,数据集在写入之前被复制到静态数据集中。所以我改变了它,现在它工作了。完成的版本如下所示:

class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    data.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
  }
}
class CassandraSink(sqlContext: SQLContext, keyspace: String, table: String) extends Sink {
  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    val ds = data.sparkSession.createDataFrame(
      data.sparkSession.sparkContext.parallelize(data.collect()),
      data.schema
    )
    ds.write.mode(SaveMode.Append).cassandraFormat(table, keyspace).save()
  }
}