Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala DSE Spark Streaming:长活动批处理队列_Scala_Apache Spark_Spark Streaming_Cassandra 3.0_Spark Cassandra Connector - Fatal编程技术网

Scala DSE Spark Streaming:长活动批处理队列

Scala DSE Spark Streaming:长活动批处理队列,scala,apache-spark,spark-streaming,cassandra-3.0,spark-cassandra-connector,Scala,Apache Spark,Spark Streaming,Cassandra 3.0,Spark Cassandra Connector,我有以下代码: val conf = new SparkConf() .setAppName("KafkaReceiver") .set("spark.cassandra.connection.host", "192.168.0.78") .set("spark.cassandra.connection.keep_alive_ms", "20000") .set("spark.executor.memory", "2g") .set("spark.driver.memory"

我有以下代码:

val conf = new SparkConf()
  .setAppName("KafkaReceiver")
  .set("spark.cassandra.connection.host", "192.168.0.78")
  .set("spark.cassandra.connection.keep_alive_ms", "20000")
  .set("spark.executor.memory", "2g")
  .set("spark.driver.memory", "4g")
  .set("spark.submit.deployMode", "cluster")
  .set("spark.executor.instances", "3")
  .set("spark.executor.cores", "3")
  .set("spark.shuffle.service.enabled", "false")
  .set("spark.dynamicAllocation.enabled", "false")
  .set("spark.io.compression.codec", "snappy")
  .set("spark.rdd.compress", "true")
  .set("spark.streaming.backpressure.enabled", "true")
  .set("spark.streaming.backpressure.initialRate", "200")
  .set("spark.streaming.receiver.maxRate", "500")

val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
val kafkaParams = Map[String, String](
  "bootstrap.servers" -> "192.168.0.113:9092",
  "group.id" -> "test-group-aditya",
  "auto.offset.reset" -> "largest")

val topics = Set("random")
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
我正在使用以下命令通过
spark submit
运行代码:

dse> bin/dse spark-submit --class test.kafkatesting /home/aditya/test.jar

我在不同的机器上安装了一个三节点的Cassandra DSE集群。每当我运行应用程序时,它都会占用大量数据并开始创建一个活动批处理队列,这反过来会造成积压和长时间的调度延迟。如何提高性能并控制队列,使其仅在完成当前批处理后才接收新批处理?

我找到了解决方案,并在代码中进行了一些优化。不要保存RDD,而是尝试创建数据帧,与RDD相比,将DF保存到Cassandra的速度要快得多。另外,增加内核和执行器内存的数量,以达到良好的效果

谢谢