Scala Spark w Kafka-can'；没有足够的并行化_Scala_Apache Spark_Apache Kafka

Scala Spark w Kafka-can'；没有足够的并行化

scala apache-spark apache-kafka

Scala Spark w Kafka-can'；没有足够的并行化,scala,apache-spark,apache-kafka,Scala,Apache Spark,Apache Kafka,我正在使用本地[8]配置运行spark。输入是一个包含8个代理的卡夫卡流。但正如在系统监视器中看到的，它不够并行，似乎只有一个节点在运行。卡夫卡拖缆的输入大约为1.6GB，因此处理速度应该更快卡夫卡制作人： import java.io.{BufferedReader, FileReader} import java.util import java.util.{Collections, Properties} import logparser.LogEvent import org.ap

我正在使用本地[8]配置运行spark。输入是一个包含8个代理的卡夫卡流。但正如在系统监视器中看到的，它不够并行，似乎只有一个节点在运行。卡夫卡拖缆的输入大约为1.6GB，因此处理速度应该更快

卡夫卡制作人：

import java.io.{BufferedReader, FileReader}
import java.util
import java.util.{Collections, Properties}

import logparser.LogEvent
import org.apache.hadoop.conf.Configuration
import org.apache.kafka.clients.producer.{KafkaProducer, Producer,     ProducerRecord}
import org.apache.kafka.common.serialization.StringDeserializer


object sparkStreaming{

  private val NUMBER_OF_LINES = 100000000

  val brokers ="localhost:9092,localhost:9093,localhost:9094,localhost:9095,localhost:9096,localhost:9097,localhost:9098,localhost:9099"
  val topicName = "log-1"
  val fileName = "data/HDFS.log"
  val producer = getProducer()

  // no hdfs , read from text file.
  def produce(): Unit = {
    try { //1. Get the instance of Configuration
      val configuration = new Configuration


      val fr = new FileReader(fileName)
      val br = new BufferedReader(fr)
      var line = ""
      line = br.readLine
      var count = 1
      //while (line != null){


      while ( {
        line != null && count < NUMBER_OF_LINES
      }) {
        System.out.println("Sending batch " + count + "  " + line)
        producer.send(new ProducerRecord[String, LogEvent](topicName,     new LogEvent(count,line,System.currentTimeMillis())))
        line = br.readLine
        count = count + 1
      }
      producer.close()
      System.out.println("Producer exited successfully for " +     fileName)
    } catch {
      case e: Exception =>
        System.out.println("Exception while producing for " +     fileName)
        System.out.println(e)
    }
  }

  private def getProducer() : KafkaProducer[String,LogEvent] = { // create instance for properties to access producer configs
    val props = new Properties
    //Assign localhost id
    props.put("bootstrap.servers", brokers)
    props.put("auto.create.topics.enable", "true")
    //Set acknowledgements for producer requests.
    props.put("acks", "all")
    //If the request fails, the producer can automatically retry,
    props.put("retries", "100")
    //Specify buffer size in config
      props.put("batch.size", "16384")
    //Reduce the no of requests less than 0
    props.put("linger.ms", "1")
    //The buffer.memory controls the total amount of memory available     to the producer for buffering.
    props.put("buffer.memory", "33554432")
    props.put("key.serializer",     "org.apache.kafka.common.serialization.StringSerializer")
    props.put("value.serializer", "logparser.LogEventSerializer")
    props.put("topic.metadata.refresh.interval.ms", "1")
    val producer = new KafkaProducer[String, LogEvent](props)
    producer
  }

  def sendBackToKafka(logEvent: LogEvent): Unit ={
    producer.send(new ProducerRecord[String, LogEvent]    ("times",logEvent))
  }


  def main (args: Array[String]): Unit = {
    println("Starting to produce");
    this.produce();
  }
}

def main（参数：数组[字符串]）{

}

使用Kafka的所有内容都受到主题的分区数量的限制。每个分区一个使用者。你有多少钱

虽然Spark可以重新分发工作，但不建议使用它，因为您可能会在执行者之间交换信息的时间比实际处理信息的时间要长。

使用Kafka的所有内容都受到主题的分区数的限制。每个分区一个使用者。你有多少钱

虽然Spark可以重新分配工作，但不建议这样做，因为在执行者之间交换信息的时间可能比实际处理信息的时间要长。

问题陈述中缺少一条信息：您的输入主题

log-1

有多少个分区

我猜这样的主题只有不到8个分区

Spark流的并行性（在Kafka源的情况下）与它使用的Kafka分区总数（即RDD的分区取自Kafka分区）相关联（模重分区）

如果，正如我所怀疑的，您的输入主题只有几个分区，那么对于每个微批处理，Spark Streaming将只为相同数量的节点分配计算任务。所有其他节点都将处于空闲状态

事实上，您看到所有节点都以几乎循环的方式工作，这是因为Spark并不总是选择同一个节点来处理同一分区的数据，但实际上它会主动地将事情弄混

为了更好地了解正在发生的事情，我需要查看Spark UI流媒体页面中的一些统计数据

但是，根据您目前提供的信息，卡夫卡分区不足是我进行此行为的最佳选择。

您的问题陈述中缺少一条信息：您的输入主题

log-1

有多少个分区

我猜这样的主题只有不到8个分区

Spark流的并行性（在Kafka源的情况下）与它使用的Kafka分区总数（即RDD的分区取自Kafka分区）相关联（模重分区）

事实上，您看到所有节点都以几乎循环的方式工作，这是因为Spark并不总是选择同一个节点来处理同一分区的数据，但实际上它会主动地将事情弄混

为了更好地了解正在发生的事情，我需要查看Spark UI流媒体页面中的一些统计数据

但是，根据您目前提供的信息，卡夫卡分区不足将是我进行此行为的最佳选择。

我猜卡夫卡正在使用8核处理器与Spark共享资源我猜卡夫卡正在使用8核处理器与Spark共享资源我想请您提供详细答案，我还发现了一篇很好的文本，解释了案例和理想的分区数，以防有人需要：感谢您提供的详细答案，我还发现了一篇很好的文本，解释了案例和理想的分区数，以防有人需要：

package logparser


import java.io._

import java.util.Properties


import kafka.serializer.StringDecoder

import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}

import org.apache.kafka.common.serialization.StringDeserializer

import org.apache.spark.SparkConf

import org.apache.spark.rdd.RDD

import org.apache.spark.streaming._

import org.apache.spark.streaming.kafka010._




object consumer {


  var tFromKafkaToSpark: Long = 0

  var tParsing :  Long = 0

  val startTime = System.currentTimeMillis()

  val CPUNumber = 8

  val pw = new PrintWriter(new FileOutputStream("data/Streaming"+CPUNumber+"config2x.txt",false))

  pw.write("Writing Started")



  def printstarttime(): Unit ={

    pw.print("StartTime : " + System.currentTimeMillis())

  }

  def printendtime(): Unit ={

    pw.print("EndTime : " + System.currentTimeMillis());
  }
  val producer = getProducer()


  private def getProducer() : KafkaProducer[String,TimeList] = { // create instance for properties to access producer configs
    val props = new Properties
    val brokers ="localhost:9090,"
//Assign localhost id
    props.put("bootstrap.servers", brokers)
    props.put("auto.create.topics.enable", "true")
    //Set acknowledgements for producer requests.
    props.put("acks", "all")
    //If the request fails, the producer can automatically retry,
    props.put("retries", "100")
    //Specify buffer size in config
    props.put("batch.size", "16384")
    //Reduce the no of requests less than 0
    props.put("linger.ms", "1")
    //The buffer.memory controls the total amount of memory available     to the producer for buffering.
    props.put("buffer.memory", "33554432")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "logparser.TimeListSerializer")
props.put("topic.metadata.refresh.interval.ms", "1")
val producer = new KafkaProducer[String, TimeList](props)
producer
  }

  def sendBackToKafka(timeList: TimeList): Unit ={
    producer.send(new ProducerRecord[String, TimeList]("times",timeList))
  }

val topics = "log-1"
//val Array(brokers, ) = Array("localhost:9092","log-1")
val brokers = "localhost:9092"
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[" + CPUNumber + "]")
val ssc = new StreamingContext(sparkConf, Seconds(1))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
var kafkaParams = Map[String, AnyRef]("metadata.broker.list" -> brokers)
kafkaParams = kafkaParams + ("bootstrap.servers" -> "localhost:9092,localhost:9093,localhost:9094,localhost:9095,localhost:9096,localhost:9097,localhost:9098,localhost:9099")
kafkaParams = kafkaParams + ("auto.offset.reset"-> "latest")
kafkaParams = kafkaParams + ("group.id" -> "test-consumer-group")
kafkaParams = kafkaParams + ("key.deserializer" -> classOf[StringDeserializer])
kafkaParams = kafkaParams + ("value.deserializer"-> "logparser.LogEventDeserializer")
//kafkaParams.put("zookeeper.connect", "192.168.101.165:2181");
kafkaParams = kafkaParams + ("enable.auto.commit"-> "true")
kafkaParams = kafkaParams + ("auto.commit.interval.ms"-> "1000")
kafkaParams = kafkaParams + ("session.timeout.ms"-> "20000")
kafkaParams = kafkaParams + ("metadata.max.age.ms"-> "1000")
val messages = KafkaUtils.createDirectStream[String, LogEvent](
  ssc,
  LocationStrategies.PreferConsistent,
  ConsumerStrategies.Subscribe[String, LogEvent](topicsSet, kafkaParams))

var started = false
val lines = messages.map(_.value)
val lineswTime = lines.map(event =>
  {
    event.addNextEventTime(System.currentTimeMillis())
    event
  }
)

lineswTime.foreachRDD(a => a.foreach(e => println(e.getTimeList)))

val logLines = lineswTime.map(
  (event) => {
    //println(event.getLogline.stringMessages.toString)
    event.setLogLine(event.getContent)
    println("Got event with id   =  " + event.getId)
    event.addNextEventTime(System.currentTimeMillis())
    println(event.getLogline.stringMessages.toString)
    event
  }
)
//logLines.foreachRDD(a => a.foreach(e => println(e.getTimeList  +  e.getLogline.stringMessages.toString)))


val x = logLines.map(le => {
  le.addNextEventTime(System.currentTimeMillis())
  sendBackToKafka(new TimeList(le.getTimeList))
  le
})

x.foreachRDD(a => a.foreach(e => println(e.getTimeList)))


//logLines.map(ll => ll.addNextEventTime(System.currentTimeMillis()))
println("--------------***///*****-------------------")

//logLines.print(10)
/*
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
*/

// Start the computation
ssc.start()
ssc.awaitTermination()
ssc.stop(false)

pw.close()