Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/apache-kafka/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 基于列值拆分数据集_Apache Spark_Apache Kafka - Fatal编程技术网

Apache spark 基于列值拆分数据集

Apache spark 基于列值拆分数据集,apache-spark,apache-kafka,Apache Spark,Apache Kafka,我有一个数据集,它是KafkareadStream的结果,如下面的Java代码片段所示 m_oKafkaEvents = getSparkSession().readStream().format("kafka") .option("kafka.bootstrap.servers", strKafkaAddress) .option("subscribe", getInsightEvent().getTopic()) .option("maxOffsetsPerTrigg

我有一个
数据集
,它是Kafka
readStream
的结果,如下面的Java代码片段所示

m_oKafkaEvents = getSparkSession().readStream().format("kafka")  
  .option("kafka.bootstrap.servers", strKafkaAddress)  
  .option("subscribe", getInsightEvent().getTopic())  
  .option("maxOffsetsPerTrigger", "100000")  
  .option("startingOffsets", "latest")  
  .option("failOnDataLoss", false)  
  .load()  
  .select(functions.from_json(functions.col("value").cast("string"), oSchema).as("events"))  
  .select("events.*");  

m_oKafkaEvents  
{  
    {"EventTime":"1527005246864000000","InstanceID":"231","Model":"Opportunity_1","Milestone":"OrderProcessed"},  
    {"EventTime":"1527005246864000002","InstanceID":"232","Model":"Opportunity_2","Milestone":"OrderProcessed"},  
    {"EventTime":"1527005246864000001","InstanceID":"233","Model":"Opportunity_1","Milestone":"OrderProcessed"},  
    {"EventTime":"1527005246864000002","InstanceID":"234","Model":"Opportunity_2","Milestone":"OrderProcessed"}  
}  
我需要根据“Model”列拆分此数据集,这将导致两个数据集,如下所示

 m_oKafkaEvents_for_Opportunity_1_topic 
   {  
       {"EventTime":"1527005246864000000","InstanceID":"231","Model":"Opportunity_1","Milestone":"OrderProcessed"},  
       {"EventTime":"1527005246864000001","InstanceID":"233","Model":"Opportunity_1","Milestone":"OrderProcessed"}   
   }  

   m_oKafkaEvents_for_Opportunity_2_topic  
   {  
      {"EventTime":"1527005246864000002","InstanceID":"232","Model":"Opportunity_2","Milestone":"OrderProcessed"},  
      {"EventTime":"1527005246864000002","InstanceID":"234","Model":"Opportunity_2","Milestone":"OrderProcessed"}  
   }  
这些数据集将发布到卡夫卡接收器中。主题名称将是模型值。i、 e
Opportunity_1
Opportunity_2

因此,我需要一个句柄列“Model”值和相应的事件列表。
由于我是spark的新手,我正在寻求有关如何通过java代码实现这一点的帮助。

谢谢你的帮助

最简单的解决方案如下所示:

allEvents.selectExpr("topic", "CONCAT('m_oKafkaEvents_for_', Model, '_topic')")
        .write()
        .format("kafka")
        .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
        .save();
你可以在这里看到一个例子。但在查看Spark的代码后,似乎我们只能有1个主题/写入,即,它将选择遇到的第一行作为主题:

def write(
  sparkSession: SparkSession,
  queryExecution: QueryExecution,
  kafkaParameters: ju.Map[String, Object],
  topic: Option[String] = None): Unit = {
val schema = queryExecution.analyzed.output
validateQuery(schema, kafkaParameters, topic)
queryExecution.toRdd.foreachPartition { iter =>
  val writeTask = new KafkaWriteTask(kafkaParameters, schema, topic)
  Utils.tryWithSafeFinally(block = writeTask.execute(iter))(
    finallyBlock = writeTask.close())
}
您可以尝试这种方法,并告诉这里,如果它的工作原理如上所述?如果不起作用,您可以选择其他解决方案,如:

  • 缓存主数据帧并创建2个其他数据帧,按模型属性过滤
  • 使用foreachPartition和Kafka writer发送消息,而不拆分主数据集
  • 第一个解决方案非常容易实现,您可以使用所有Spark工具来实现这一点。另一方面,至少在理论上,拆分数据集应该比第二个方案稍微慢一点。但是,在选择一个或另一个选项之前,试着衡量一下,也许差别真的很小,最好使用清晰且得到社区认可的方法

    下面您可以找到一些显示这两种情况的代码:

    SparkSession spark = SparkSession
                .builder()
                .appName("JavaStructuredNetworkWordCount")
                .getOrCreate();
        Dataset<Row> allEvents = spark.readStream().format("kafka")
                .option("kafka.bootstrap.servers", "")
                .option("subscribe", "event")
                .option("maxOffsetsPerTrigger", "100000")
                .option("startingOffsets", "latest")
                .option("failOnDataLoss", false)
                .load()
                .select(functions.from_json(functions.col("value").cast("string"), null).as("events"))
                .select("events.*");
    
    
        // First solution
        Dataset<Row> opportunity1Events = allEvents.filter("Model = 'Opportunity_1'");
        opportunity1Events.write().format("kafka").option("kafka.bootstrap.servers", "")
                .option("topic", "m_oKafkaEvents_for_Opportunity_1_topic").save();
        Dataset<Row> opportunity2Events = allEvents.filter("Model = 'Opportunity_2'");
        opportunity2Events.write().format("kafka").option("kafka.bootstrap.servers", "")
                .option("topic", "m_oKafkaEvents_for_Opportunity_2_topic").save();
        // Note: Kafka writer was added in 2.2.0 https://github.com/apache/spark/commit/b0a5cd89097c563e9949d8cfcf84d18b03b8d24c
    
        // Another approach with iteration throughout messages accumulated within each partition
        allEvents.foreachPartition(new ForeachPartitionFunction<Row>() {
            private KafkaProducer<String, Row> localProducer = new KafkaProducer<>(new HashMap<>());
    
            private final Map<String, String> modelsToTopics = new HashMap<>();
            {
                modelsToTopics.put("Opportunity_1", "m_oKafkaEvents_for_Opportunity_1_topic");
                modelsToTopics.put("Opportunity_2", "m_oKafkaEvents_for_Opportunity_2_topic");
            }
    
            @Override
            public void call(Iterator<Row> rows) throws Exception {
                // If your message is Opportunity1 => add to messagesOpportunity1
                // otherwise it goes to Opportunity2
                while (rows.hasNext()) {
                    Row currentRow = rows.next();
                    // you can reformat your row here or directly in Spark's map transformation
                    localProducer.send(new ProducerRecord<>(modelsToTopics.get(currentRow.getAs("Model")),
                            "some_message_key", currentRow));
                }
                // KafkaProducer accumulates messages in a in-memory buffer and sends when a threshold was reached
                // Flush them synchronously here to be sure that every stored message was correctly
                // delivered
                // You can also play with features added in Kafka 0.11: the idempotent producer and the transactional producer
                localProducer.flush();
            }
        });
    
    SparkSession spark=SparkSession
    .builder()
    .appName(“JavaStructuredNetworkWordCount”)
    .getOrCreate();
    Dataset allEvents=spark.readStream().format(“卡夫卡”)
    .option(“kafka.bootstrap.servers”,“”)
    .期权(“认购”、“事件”)
    .选项(“maxOffsetsPerTrigger”、“100000”)
    .选项(“起始偏移量”、“最新”)
    .选项(“failOnDataLoss”,false)
    .load()
    .select(functions.from_json(functions.col(“value”).cast(“string”),null.as(“事件”))
    。选择(“事件”。);
    //第一种解决方案
    Dataset opportunity1Events=allEvents.filter(“模型='Opportunity_1'”);
    opportunity1Events.write().format(“kafka”).option(“kafka.bootstrap.servers”,”)
    .option(“主题”,“m_oKafkaEvents_for_Opportunity_1_topic”).save();
    Dataset opportunity2Events=allEvents.filter(“模型='Opportunity_2'”);
    opportunity2Events.write()
    .option(“主题”,“m_oKafkaEvents_for_Opportunity_2_topic”).save();
    //注:卡夫卡编剧是在2.2.0中添加的https://github.com/apache/spark/commit/b0a5cd89097c563e9949d8cfcf84d18b03b8d24c
    //另一种方法是在每个分区内累积的消息中进行迭代
    allEvents.foreachPartition(新的ForeachPartitionFunction(){
    private KafkaProducer localProducer=new KafkaProducer(new HashMap());
    私有最终映射modelsToTopics=newhashmap();
    {
    modelsToTopics.put(“机遇1”,“m_oKafkaEvents_for_机遇1_topic”);
    modelsToTopics.put(“Opportunity_2”,“m_oKafkaEvents_for_Opportunity_2_topic”);
    }
    @凌驾
    公共void调用(迭代器行)引发异常{
    //如果您的消息是Opportunity1=>add to messagesoportunity1
    //否则,它就会变成机遇2
    while(rows.hasNext()){
    Row currentRow=rows.next();
    //您可以在此处重新格式化行,也可以直接在Spark的映射转换中重新格式化行
    localProducer.send(新ProducerRecord(modelstototopics.get(currentRow.getAs(“Model”)),
    “某些消息键”,currentRow);
    }
    //KafkaProducer在内存缓冲区中累积消息,并在达到阈值时发送消息
    //在此处同步刷新它们,以确保每个存储的消息都正确无误
    //交付
    //您还可以使用Kafka 0.11中添加的功能:幂等生产者和事务生产者
    localProducer.flush();
    }
    });
    
    更正拼写,改进格式