Apache spark 基于列值拆分数据集
我有一个Apache spark 基于列值拆分数据集,apache-spark,apache-kafka,Apache Spark,Apache Kafka,我有一个数据集,它是KafkareadStream的结果,如下面的Java代码片段所示 m_oKafkaEvents = getSparkSession().readStream().format("kafka") .option("kafka.bootstrap.servers", strKafkaAddress) .option("subscribe", getInsightEvent().getTopic()) .option("maxOffsetsPerTrigg
数据集
,它是KafkareadStream
的结果,如下面的Java代码片段所示
m_oKafkaEvents = getSparkSession().readStream().format("kafka")
.option("kafka.bootstrap.servers", strKafkaAddress)
.option("subscribe", getInsightEvent().getTopic())
.option("maxOffsetsPerTrigger", "100000")
.option("startingOffsets", "latest")
.option("failOnDataLoss", false)
.load()
.select(functions.from_json(functions.col("value").cast("string"), oSchema).as("events"))
.select("events.*");
m_oKafkaEvents
{
{"EventTime":"1527005246864000000","InstanceID":"231","Model":"Opportunity_1","Milestone":"OrderProcessed"},
{"EventTime":"1527005246864000002","InstanceID":"232","Model":"Opportunity_2","Milestone":"OrderProcessed"},
{"EventTime":"1527005246864000001","InstanceID":"233","Model":"Opportunity_1","Milestone":"OrderProcessed"},
{"EventTime":"1527005246864000002","InstanceID":"234","Model":"Opportunity_2","Milestone":"OrderProcessed"}
}
我需要根据“Model”列拆分此数据集,这将导致两个数据集,如下所示
m_oKafkaEvents_for_Opportunity_1_topic
{
{"EventTime":"1527005246864000000","InstanceID":"231","Model":"Opportunity_1","Milestone":"OrderProcessed"},
{"EventTime":"1527005246864000001","InstanceID":"233","Model":"Opportunity_1","Milestone":"OrderProcessed"}
}
m_oKafkaEvents_for_Opportunity_2_topic
{
{"EventTime":"1527005246864000002","InstanceID":"232","Model":"Opportunity_2","Milestone":"OrderProcessed"},
{"EventTime":"1527005246864000002","InstanceID":"234","Model":"Opportunity_2","Milestone":"OrderProcessed"}
}
这些数据集将发布到卡夫卡接收器中。主题名称将是模型值。i、 eOpportunity_1
和Opportunity_2
因此,我需要一个句柄列“Model”值和相应的事件列表。
由于我是spark的新手,我正在寻求有关如何通过java代码实现这一点的帮助。
谢谢你的帮助 最简单的解决方案如下所示:
allEvents.selectExpr("topic", "CONCAT('m_oKafkaEvents_for_', Model, '_topic')")
.write()
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.save();
你可以在这里看到一个例子。但在查看Spark的代码后,似乎我们只能有1个主题/写入,即,它将选择遇到的第一行作为主题:
def write(
sparkSession: SparkSession,
queryExecution: QueryExecution,
kafkaParameters: ju.Map[String, Object],
topic: Option[String] = None): Unit = {
val schema = queryExecution.analyzed.output
validateQuery(schema, kafkaParameters, topic)
queryExecution.toRdd.foreachPartition { iter =>
val writeTask = new KafkaWriteTask(kafkaParameters, schema, topic)
Utils.tryWithSafeFinally(block = writeTask.execute(iter))(
finallyBlock = writeTask.close())
}
您可以尝试这种方法,并告诉这里,如果它的工作原理如上所述?如果不起作用,您可以选择其他解决方案,如:
SparkSession spark = SparkSession
.builder()
.appName("JavaStructuredNetworkWordCount")
.getOrCreate();
Dataset<Row> allEvents = spark.readStream().format("kafka")
.option("kafka.bootstrap.servers", "")
.option("subscribe", "event")
.option("maxOffsetsPerTrigger", "100000")
.option("startingOffsets", "latest")
.option("failOnDataLoss", false)
.load()
.select(functions.from_json(functions.col("value").cast("string"), null).as("events"))
.select("events.*");
// First solution
Dataset<Row> opportunity1Events = allEvents.filter("Model = 'Opportunity_1'");
opportunity1Events.write().format("kafka").option("kafka.bootstrap.servers", "")
.option("topic", "m_oKafkaEvents_for_Opportunity_1_topic").save();
Dataset<Row> opportunity2Events = allEvents.filter("Model = 'Opportunity_2'");
opportunity2Events.write().format("kafka").option("kafka.bootstrap.servers", "")
.option("topic", "m_oKafkaEvents_for_Opportunity_2_topic").save();
// Note: Kafka writer was added in 2.2.0 https://github.com/apache/spark/commit/b0a5cd89097c563e9949d8cfcf84d18b03b8d24c
// Another approach with iteration throughout messages accumulated within each partition
allEvents.foreachPartition(new ForeachPartitionFunction<Row>() {
private KafkaProducer<String, Row> localProducer = new KafkaProducer<>(new HashMap<>());
private final Map<String, String> modelsToTopics = new HashMap<>();
{
modelsToTopics.put("Opportunity_1", "m_oKafkaEvents_for_Opportunity_1_topic");
modelsToTopics.put("Opportunity_2", "m_oKafkaEvents_for_Opportunity_2_topic");
}
@Override
public void call(Iterator<Row> rows) throws Exception {
// If your message is Opportunity1 => add to messagesOpportunity1
// otherwise it goes to Opportunity2
while (rows.hasNext()) {
Row currentRow = rows.next();
// you can reformat your row here or directly in Spark's map transformation
localProducer.send(new ProducerRecord<>(modelsToTopics.get(currentRow.getAs("Model")),
"some_message_key", currentRow));
}
// KafkaProducer accumulates messages in a in-memory buffer and sends when a threshold was reached
// Flush them synchronously here to be sure that every stored message was correctly
// delivered
// You can also play with features added in Kafka 0.11: the idempotent producer and the transactional producer
localProducer.flush();
}
});
SparkSession spark=SparkSession
.builder()
.appName(“JavaStructuredNetworkWordCount”)
.getOrCreate();
Dataset allEvents=spark.readStream().format(“卡夫卡”)
.option(“kafka.bootstrap.servers”,“”)
.期权(“认购”、“事件”)
.选项(“maxOffsetsPerTrigger”、“100000”)
.选项(“起始偏移量”、“最新”)
.选项(“failOnDataLoss”,false)
.load()
.select(functions.from_json(functions.col(“value”).cast(“string”),null.as(“事件”))
。选择(“事件”。);
//第一种解决方案
Dataset opportunity1Events=allEvents.filter(“模型='Opportunity_1'”);
opportunity1Events.write().format(“kafka”).option(“kafka.bootstrap.servers”,”)
.option(“主题”,“m_oKafkaEvents_for_Opportunity_1_topic”).save();
Dataset opportunity2Events=allEvents.filter(“模型='Opportunity_2'”);
opportunity2Events.write()
.option(“主题”,“m_oKafkaEvents_for_Opportunity_2_topic”).save();
//注:卡夫卡编剧是在2.2.0中添加的https://github.com/apache/spark/commit/b0a5cd89097c563e9949d8cfcf84d18b03b8d24c
//另一种方法是在每个分区内累积的消息中进行迭代
allEvents.foreachPartition(新的ForeachPartitionFunction(){
private KafkaProducer localProducer=new KafkaProducer(new HashMap());
私有最终映射modelsToTopics=newhashmap();
{
modelsToTopics.put(“机遇1”,“m_oKafkaEvents_for_机遇1_topic”);
modelsToTopics.put(“Opportunity_2”,“m_oKafkaEvents_for_Opportunity_2_topic”);
}
@凌驾
公共void调用(迭代器行)引发异常{
//如果您的消息是Opportunity1=>add to messagesoportunity1
//否则,它就会变成机遇2
while(rows.hasNext()){
Row currentRow=rows.next();
//您可以在此处重新格式化行,也可以直接在Spark的映射转换中重新格式化行
localProducer.send(新ProducerRecord(modelstototopics.get(currentRow.getAs(“Model”)),
“某些消息键”,currentRow);
}
//KafkaProducer在内存缓冲区中累积消息,并在达到阈值时发送消息
//在此处同步刷新它们,以确保每个存储的消息都正确无误
//交付
//您还可以使用Kafka 0.11中添加的功能:幂等生产者和事务生产者
localProducer.flush();
}
});
更正拼写,改进格式