Scala 卡夫卡+;spark streaming:单个作业中的多主题处理

Scala 卡夫卡+;spark streaming:单个作业中的多主题处理,scala,hadoop,apache-spark,apache-kafka,spark-streaming,Scala,Hadoop,Apache Spark,Apache Kafka,Spark Streaming,卡夫卡中有40个主题,每个主题有5个表格,每个表格处理5个书面的spark流作业。 spark streaming作业的唯一目标是读取5个kafka主题并将其写入相应的5个hdfs路径。大多数情况下,它工作正常,但有时它会将主题1数据写入其他hdfs路径 下面的代码试图将一个spark流媒体作业归档以处理5 topic并将其写入相应的hdfs,但这将把topic 1数据写入hdfs 5而不是hdfs 1 请提供您的建议: import java.text.SimpleDateFormat im

卡夫卡中有40个主题,每个主题有5个表格,每个表格处理5个书面的spark流作业。 spark streaming作业的唯一目标是读取5个kafka主题并将其写入相应的5个hdfs路径。大多数情况下,它工作正常,但有时它会将主题1数据写入其他hdfs路径

下面的代码试图将一个spark流媒体作业归档以处理5 topic并将其写入相应的hdfs,但这将把topic 1数据写入hdfs 5而不是hdfs 1

请提供您的建议:

import java.text.SimpleDateFormat
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{ SparkConf, TaskContext }
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.streaming.kafka010._
import org.apache.kafka.common.serialization.StringDeserializer

object SparkKafkaMultiConsumer extends App {

  override def main(args: Array[String]) {
    if (args.length < 1) {
      System.err.println(s"""
        |Usage: KafkaStreams auto.offset.reset latest/earliest table1,table2,etc 
        |
        """.stripMargin)
      System.exit(1)
    }

    val date_today = new SimpleDateFormat("yyyy_MM_dd");
    val date_today_hour = new SimpleDateFormat("yyyy_MM_dd_HH");
    val PATH_SEPERATOR = "/";

    import com.typesafe.config.ConfigFactory

    val conf = ConfigFactory.load("env.conf")
    val topicconf = ConfigFactory.load("topics.conf")


// Create context with custom second batch interval
val sparkConf = new SparkConf().setAppName("pt_streams")
val ssc = new StreamingContext(sparkConf, Seconds(conf.getString("kafka.duration").toLong))
var kafka_topics="kafka.topics"



// Create direct kafka stream with brokers and topics
var topicsSet = topicconf.getString(kafka_topics).split(",").toSet
if(args.length==2 ) {
  print ("This stream job will process table(s) : "+ args(1)) 
  topicsSet=args {1}.split(",").toSet
}


val topicList = topicsSet.toList

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> conf.getString("kafka.brokers"),
  "zookeeper.connect" -> conf.getString("kafka.zookeeper"),
  "group.id" -> conf.getString("kafka.consumergroups"),
  "auto.offset.reset" -> args { 0 },
  "enable.auto.commit" -> (conf.getString("kafka.autoCommit").toBoolean: java.lang.Boolean),
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "security.protocol" -> "SASL_PLAINTEXT")



val messages = KafkaUtils.createDirectStream[String, String](
  ssc,
  LocationStrategies.PreferConsistent,
  ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))

for (i <- 0 until topicList.length) {
  /**
   *  set timer to see how much time takes for the filter operation for each topics
   */
  val topicStream = messages.filter(_.topic().equals(topicList(i)))


  val data = topicStream.map(_.value())
  data.foreachRDD((rdd, batchTime) => {
    //        val data = rdd.map(_.value())
    if (!rdd.isEmpty()) {
      rdd.coalesce(1).saveAsTextFile(conf.getString("hdfs.streamoutpath") + PATH_SEPERATOR + topicList(i) + PATH_SEPERATOR + date_today.format(System.currentTimeMillis())
        + PATH_SEPERATOR + date_today_hour.format(System.currentTimeMillis()) + PATH_SEPERATOR + System.currentTimeMillis())
    }
  })
}



 try{
     // After all successful processing, commit the offsets to kafka
    messages.foreachRDD { rdd =>
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      messages.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
    }
  } catch {
    case e: Exception => 
      e.printStackTrace()
      print("error while commiting the offset")

  }
// Start the computation
ssc.start()
ssc.awaitTermination()

  }

}
导入java.text.simpleDataFormat
导入org.apache.kafka.common.serialization.StringDeserializer
导入org.apache.spark.{SparkConf,TaskContext}
导入org.apache.spark.streaming.{Seconds,StreamingContext}
导入org.apache.spark.streaming.kafka010_
导入org.apache.kafka.common.serialization.StringDeserializer
对象SparkKafkaMultiConsumer扩展应用程序{
覆盖def main(参数:数组[字符串]){
如果(参数长度<1){
系统错误打印项次“”
|用法:KafkaStreams auto.offset.reset最新/最早表1、表2等
|
“.stripMargin)
系统出口(1)
}
val date_today=新的简化格式(“yyyy_MM_dd”);
val date_today_hour=新的简化格式(“yyyy_MM_dd_HH”);
val PATH_分隔符=“/”;
导入com.typesafe.config.ConfigFactory
val conf=ConfigFactory.load(“env.conf”)
val topicconf=ConfigFactory.load(“topics.conf”)
//创建具有自定义第二批处理间隔的上下文
val sparkConf=new sparkConf().setAppName(“pt_streams”)
val ssc=newstreamingcontext(sparkConf,秒(conf.getString(“kafka.duration”).toLong))
var kafka_topics=“kafka.topics”
//创建带有代理和主题的直接卡夫卡流
var topicsSet=topicconf.getString(kafka_主题).split(“,”).toSet
如果(参数长度==2){
打印(“此流作业将处理表):“+args(1))
TopicSet=args{1}.split(“,”).toSet
}
val topicList=topicsSet.toList
val kafkaParams=Map[字符串,对象](
“bootstrap.servers”->conf.getString(“kafka.brokers”),
“zookeeper.connect”->conf.getString(“kafka.zookeeper”),
“group.id”->conf.getString(“kafka.consumergroups”),
“auto.offset.reset”->args{0},
“enable.auto.commit”->(conf.getString(“kafka.autoCommit”).toBoolean:java.lang.Boolean),
“key.deserializer”->classOf[StringDeserializer],
“value.deserializer”->classOf[StringDeserializer],
“安全协议”->“SASL_明文”)
val messages=KafkaUtils.createDirectStream[字符串,字符串](
ssc,
位置策略。一致性,
ConsumerStrategies.Subscribe[String,String](主题集,kafkaParams))
为了{
//val数据=rdd.map(u.value())
如果(!rdd.isEmpty()){
coalesce(1).saveAsTextFile(conf.getString(“hdfs.streamoutpath”)+路径分隔符+主题列表(i)+路径分隔符+日期分隔符+今日格式(System.currentTimeMillis())
+路径分隔符+日期分隔符+日期分隔符+日期分隔符+日期分隔符+日期分隔符+日期分隔符+时间分隔符
}
})
}
试一试{
//在所有成功处理之后,将偏移提交给kafka
messages.foreachRDD{rdd=>
val offsetRanges=rdd.asInstanceOf[HASSOFFSETRANGES].offsetRanges
messages.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
}抓住{
案例e:例外=>
e、 printStackTrace()
打印(“提交抵销时出错”)
}
//开始计算
ssc.start()
ssc.终止协议()
}
}

您最好使用for Kafka Connect。它是开源的,可用或作为其一部分。将Kafka主题流式传输到HDFS的简单配置文件,如果您有数据模式,它将为您创建配置单元表


如果你试图自己编写代码,你就是在发明轮子;这是一个解决的问题:

听起来像是一个问题,你的代码> CON.GETSHIPE(“HDFS.SurrutOutPATH”)在任何情况下,你可能想考虑用HDFS Connect查看卡夫卡Connect API。这个程序在大多数情况下工作,只有一段时间加载到其他路径上。这是问题在下面的链接
messages.filter(u.topic().equals(topicList(i))
,但请认真记住,它主要是Avro和JSON数据。字符串格式是一个开放的PR。感谢您的建议。现在我们不做任何数据转换,但在几个月内计划在流式数据上实现数据转换。所以,你能建议,**如何使用spark流媒体高效地消费150+卡夫卡主题?**注意:所有150多个主题都需要写入HDFS_PATH/topic_NAME/yymmdd/HH/SSSS/file?@NSSaravanan-Kafka-Connect可以写入分区。如果您想做进一步的处理,您可能还想看看KSQL或Kafka流