如何将JavaPairDStream的结果写入Spark Streaming上的输出kafka主题?
我正在寻找一种在输出卡夫卡主题中编写数据流的方法,只有当微批处理RDD吐出一些东西时 我在Java8中使用Spark Streaming和Spark Streaming kafka连接器(都是最新版本) 我搞不懂如何将JavaPairDStream的结果写入Spark Streaming上的输出kafka主题?,java,apache-spark,apache-kafka,spark-streaming,Java,Apache Spark,Apache Kafka,Spark Streaming,我正在寻找一种在输出卡夫卡主题中编写数据流的方法,只有当微批处理RDD吐出一些东西时 我在Java8中使用Spark Streaming和Spark Streaming kafka连接器(都是最新版本) 我搞不懂 感谢您的帮助。在我的示例中,我想将某个卡夫卡主题的事件发送到另一个主题。我做一个简单的字数统计。这意味着,我从卡夫卡输入主题中获取数据,对它们进行计数,然后将它们输出到输出卡夫卡主题中。不要忘记,目标是使用Spark流将JavaPairDStream的结果写入输出kafka主题 //S
感谢您的帮助。在我的示例中,我想将某个卡夫卡主题的事件发送到另一个主题。我做一个简单的字数统计。这意味着,我从卡夫卡输入主题中获取数据,对它们进行计数,然后将它们输出到输出卡夫卡主题中。不要忘记,目标是使用Spark流将JavaPairDStream的结果写入输出kafka主题
//Spark Configuration
SparkConf sparkConf = new SparkConf().setAppName("SendEventsToKafka");
String brokerUrl = "locahost:9092"
String inputTopic = "receiverTopic";
String outputTopic = "producerTopic";
//Create the java streaming context
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
//Prepare the list of topics we listen for
Set<String> topicList = new TreeSet<>();
topicList.add(inputTopic);
//Kafka direct stream parameters
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", brokerUrl);
kafkaParams.put("group.id", "kafka-cassandra" + new SecureRandom().nextInt(100));
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
//Kafka output topic specific properties
Properties props = new Properties();
props.put("bootstrap.servers", brokerUrl);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("acks", "1");
props.put("retries", "3");
props.put("linger.ms", 5);
//Here we create a direct stream for kafka input data.
final JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topicList, kafkaParams));
JavaPairDStream<String, String> results = messages
.mapToPair(new PairFunction<ConsumerRecord<String, String>, String, String>() {
@Override
public Tuple2<String, String> call(ConsumerRecord<String, String> record) {
return new Tuple2<>(record.key(), record.value());
}
});
JavaDStream<String> lines = results.map(new Function<Tuple2<String, String>, String>() {
@Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterator<String> call(String x) {
log.info("Line retrieved {}", x);
return Arrays.asList(SPACE.split(x)).iterator();
}
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String s) {
log.info("Word to count {}", s);
return new Tuple2<>(s, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer i1, Integer i2) {
log.info("Count with reduceByKey {}", i1 + i2);
return i1 + i2;
}
});
//Here we iterrate over the JavaPairDStream to write words and their count into kafka
wordCounts.foreachRDD(new VoidFunction<JavaPairRDD<String, Integer>>() {
@Override
public void call(JavaPairRDD<String, Integer> arg0) throws Exception {
Map<String, Integer> wordCountMap = arg0.collectAsMap();
List<WordOccurence> topicList = new ArrayList<>();
for (String key : wordCountMap.keySet()) {
//Here we send event to kafka output topic
publishToKafka(key, wordCountMap.get(key), outputTopic);
}
JavaRDD<WordOccurence> WordOccurenceRDD = jssc.sparkContext().parallelize(topicList);
CassandraJavaUtil.javaFunctions(WordOccurenceRDD)
.writerBuilder(keyspace, table, CassandraJavaUtil.mapToRow(WordOccurence.class))
.saveToCassandra();
log.info("Words successfully added : {}, keyspace {}, table {}", words, keyspace, table);
}
});
jssc.start();
jssc.awaitTermination();
希望有帮助 如果数据流包含要发送给卡夫卡的数据:
dStream.foreachRDD(rdd -> {
rdd.foreachPartition(iter ->{
Producer producer = createKafkaProducer();
while (iter.hasNext()){
sendToKafka(producer, iter.next())
}
}
});
因此,您可以为每个RDD分区创建一个生产者。到目前为止您尝试了什么?在线上只有Scala片段,在官方文档中我找不到这一个:(
dStream.foreachRDD(rdd -> {
rdd.foreachPartition(iter ->{
Producer producer = createKafkaProducer();
while (iter.hasNext()){
sendToKafka(producer, iter.next())
}
}
});