Spark steaming从Kafka中读取并在Java中应用Spark SQL聚合
我有一个Spark工作,它从数据库读取数据并应用Spark SQL聚合。代码如下(仅省略conf选项): 现在我想创建另一个作业,它通过Spark streaming从Kafka读取消息,然后通过Spark SQL应用相同的聚合。我的代码如下:Spark steaming从Kafka中读取并在Java中应用Spark SQL聚合,java,apache-spark,apache-kafka,apache-spark-sql,spark-streaming,Java,Apache Spark,Apache Kafka,Apache Spark Sql,Spark Streaming,我有一个Spark工作,它从数据库读取数据并应用Spark SQL聚合。代码如下(仅省略conf选项): 现在我想创建另一个作业,它通过Spark streaming从Kafka读取消息,然后通过Spark SQL应用相同的聚合。我的代码如下: Map<String, Object> kafkaParams = new HashMap<>(); kafkaParams.put("bootstrap.servers", "192.168.99.100:909
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "192.168.99.100:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", KafkaStatisticsPayloadDeserializer.class);
kafkaParams.put("group.id", "Group1");
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList(topic);
SparkConf conf = new SparkConf().setAppName(topic).setMaster("local");
/*
* Spark streaming context
*/
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, Durations.seconds(2));
/*
* Create an input DStream for Receiving data from socket
*/
JavaInputDStream<ConsumerRecord<String, StatisticsRecord>> stream =
KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, StatisticsRecord>Subscribe(topics, kafkaParams)
);
Map kafkaParams=new HashMap();
kafkaParams.put(“bootstrap.servers”,“192.168.99.100:9092”);
kafkaParams.put(“key.deserializer”,StringDeserializer.class);
kafkaParams.put(“value.deserializer”,kafkastatisticsPayloAddSerializer.class);
kafkaParams.put(“group.id”,“Group1”);
kafkaParams.put(“自动偏移重置”、“最早”);
kafkaParams.put(“enable.auto.commit”,false);
集合主题=Arrays.asList(主题);
SparkConf conf=new SparkConf().setAppName(主题).setMaster(“本地”);
/*
*火花流上下文
*/
JavaStreamingContext streamingContext=新的JavaStreamingContext(conf,Durations.seconds(2));
/*
*创建用于从套接字接收数据的输入数据流
*/
JavaInputDStream流=
KafkaUtils.createDirectStream(
流线型背景,
LocationStrategies.PreferConsistent(),
订阅(主题,卡夫卡帕拉)
);
到目前为止,我已经成功地阅读并反序列化了这些消息。因此,我的问题是如何在它们上实际应用Spark SQL聚合。我尝试了以下方法,但不起作用。我想我需要首先隔离包含实际消息的“value”字段
SQLContext sqlContext = new SQLContext(streamingContext.sparkContext());
stream.foreachRDD(rdd -> {
Dataset<Row> df = sqlContext.createDataFrame(rdd.rdd(), StatisticsRecord.class);
df.createOrReplaceTempView("data");
df.cache();
Dataset aggregators = sqlContext.sql(SQLContextAggregations.ORDER_TYPE_DB);
aggregators.show();
});
SQLContext SQLContext=newsqlcontext(streamingContext.sparkContext());
stream.foreachRDD(rdd->{
Dataset df=sqlContext.createDataFrame(rdd.rdd(),StatisticsRecord.class);
df.createOrReplaceTempView(“数据”);
df.cache();
数据集聚合器=sqlContext.sql(SQLContextAggregations.ORDER\u TYPE\u DB);
聚合器。show();
});
您应该在应用于流的函数中调用上下文。我已经用以下代码解决了这个问题。请注意,我现在以JSON格式存储消息,而不是实际对象
SparkConf conf = new SparkConf().setAppName(topic).setMaster("local");
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, Durations.seconds(2));
SparkSession spark = SparkSession.builder().appName(topic).getOrCreate();
/*
* Kafka conf
*/
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", dbUri);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "Group4");
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("Statistics");
/*
* Create an input DStream for Receiving data from socket
*/
JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
/*
* Keep only the actual message in JSON format
*/
JavaDStream<String> recordStream = stream.flatMap(record -> Arrays.asList(record.value()).iterator());
/*
* Extract RDDs from stream and apply aggregation in each one
*/
recordStream.foreachRDD(rdd -> {
if (rdd.count() > 0) {
Dataset<Row> df = spark.read().json(rdd.rdd());
df.createOrReplaceTempView("data");
df.cache();
Dataset aggregators = spark.sql(SQLContextAggregations.ORDER_TYPE_DB);
aggregators.show();
}
});
SparkConf conf=new SparkConf().setAppName(主题).setMaster(“本地”);
JavaStreamingContext streamingContext=新的JavaStreamingContext(conf,Durations.seconds(2));
SparkSession spark=SparkSession.builder().appName(主题).getOrCreate();
/*
*卡夫卡形态
*/
Map kafkaParams=新HashMap();
kafkaParams.put(“bootstrap.servers”,dbUri);
kafkaParams.put(“key.deserializer”,StringDeserializer.class);
kafkaParams.put(“value.deserializer”,StringDeserializer.class);
kafkaParams.put(“group.id”、“Group4”);
kafkaParams.put(“自动偏移重置”、“最早”);
kafkaParams.put(“enable.auto.commit”,false);
集合主题=Arrays.asList(“统计”);
/*
*创建用于从套接字接收数据的输入数据流
*/
JavaInputDStream流=
KafkaUtils.createDirectStream(
流线型背景,
LocationStrategies.PreferConsistent(),
订阅(主题,卡夫卡帕拉)
);
/*
*仅保留JSON格式的实际消息
*/
JavaDStream recordStream=stream.flatMap(记录->数组.asList(记录.value()).iterator());
/*
*从流中提取RDD并在每个流中应用聚合
*/
recordStream.foreachRDD(rdd->{
如果(rdd.count()>0){
数据集df=spark.read().json(rdd.rdd());
df.createOrReplaceTempView(“数据”);
df.cache();
数据集聚合器=spark.sql(SQLContextAggregations.ORDER\u TYPE\u DB);
聚合器。show();
}
});
看到了吗?或
SparkConf conf = new SparkConf().setAppName(topic).setMaster("local");
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, Durations.seconds(2));
SparkSession spark = SparkSession.builder().appName(topic).getOrCreate();
/*
* Kafka conf
*/
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", dbUri);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "Group4");
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("Statistics");
/*
* Create an input DStream for Receiving data from socket
*/
JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
/*
* Keep only the actual message in JSON format
*/
JavaDStream<String> recordStream = stream.flatMap(record -> Arrays.asList(record.value()).iterator());
/*
* Extract RDDs from stream and apply aggregation in each one
*/
recordStream.foreachRDD(rdd -> {
if (rdd.count() > 0) {
Dataset<Row> df = spark.read().json(rdd.rdd());
df.createOrReplaceTempView("data");
df.cache();
Dataset aggregators = spark.sql(SQLContextAggregations.ORDER_TYPE_DB);
aggregators.show();
}
});