Java Spark Streaming dynamic executors在群集模式下高估了kafka参数_Java_Apache Spark_Apache Kafka_Spark Streaming_Spark Streaming Kafka

Java Spark Streaming dynamic executors在群集模式下高估了kafka参数

java apache-spark apache-kafka

Java Spark Streaming dynamic executors在群集模式下高估了kafka参数,java,apache-spark,apache-kafka,spark-streaming,spark-streaming-kafka,Java,Apache Spark,Apache Kafka,Spark Streaming,Spark Streaming Kafka,我已经编写了一个spark流媒体消费者来使用Kafka的数据。我在日志中发现了一个奇怪的行为。Kafka主题有3个分区，每个分区由Spark Streaming job启动一个执行器第一个执行器id始终采用我在创建流上下文时提供的参数，但id为2和3的执行器始终覆盖kafka参数 20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for this application. Enabling Dynami

我已经编写了一个spark流媒体消费者来使用Kafka的数据。我在日志中发现了一个奇怪的行为。Kafka主题有3个分区，每个分区由Spark Streaming job启动一个执行器

第一个执行器id始终采用我在创建流上下文时提供的参数，但id为2和3的执行器始终覆盖kafka参数

20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for this application. Enabling Dynamic allocation for Spark Streaming applications can cause data loss if Write Ahead Log is not enabled for non-replayable sour
ces like Flume. See the programming guide for details on how to enable the Write Ahead Log.
20/01/14 12:15:05 INFO FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 2 write ahead log files from hdfs://tlabnamenode/checkpoint/receivedBlockMetadata
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Slide time = 5000 ms
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Storage level = Serialized 1x Replicated
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Checkpoint interval = null
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Remember interval = 5000 ms
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@12665f3f
20/01/14 12:15:05 INFO ForEachDStream: Slide time = 5000 ms
20/01/14 12:15:05 INFO ForEachDStream: Storage level = Serialized 1x Replicated
20/01/14 12:15:05 INFO ForEachDStream: Checkpoint interval = null
20/01/14 12:15:05 INFO ForEachDStream: Remember interval = 5000 ms
20/01/14 12:15:05 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream@a4d83ac
20/01/14 12:15:05 INFO ConsumerConfig: ConsumerConfig values: 
        auto.commit.interval.ms = 5000
        auto.offset.reset = latest
        bootstrap.servers = [1,2,3]
        check.crcs = true
        client.id = client-0
        connections.max.idle.ms = 540000
        default.api.timeout.ms = 60000
        enable.auto.commit = false
        exclude.internal.topics = true
        fetch.max.bytes = 52428800
        fetch.max.wait.ms = 500
        fetch.min.bytes = 1
        group.id = telemetry-streaming-service
        heartbeat.interval.ms = 3000
        interceptor.classes = []
        internal.leave.group.on.close = true
        isolation.level = read_uncommitted
        key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer

这是其他执行者的日志

20/01/14 12:15:04 INFO Executor: Starting executor ID 2 on host 1
20/01/14 12:15:04 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40324.
20/01/14 12:15:04 INFO NettyBlockTransferService: Server created on 1
20/01/14 12:15:04 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/01/14 12:15:04 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None)
20/01/14 12:15:04 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None)
20/01/14 12:15:04 INFO BlockManager: external shuffle service port = 7447
20/01/14 12:15:04 INFO BlockManager: Registering executor with local external shuffle service.
20/01/14 12:15:04 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:7447 after 1 ms (0 ms spent in bootstraps)
20/01/14 12:15:04 INFO BlockManager: Initialized BlockManager: BlockManagerId(2, matrix-hwork-data-05, 40324, None)
20/01/14 12:15:19 INFO CoarseGrainedExecutorBackend: Got assigned task 1
20/01/14 12:15:19 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
20/01/14 12:15:19 INFO TorrentBroadcast: Started reading broadcast variable 0
20/01/14 12:15:19 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:38759 after 2 ms (0 ms spent in bootstraps)
20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 8.1 KB, free 6.2 GB)
20/01/14 12:15:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 163 ms
20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 17.9 KB, free 6.2 GB)
20/01/14 12:15:20 INFO KafkaRDD: Computing topic telemetry, partition 1 offsets 237352170 -> 237352311
20/01/14 12:15:20 INFO CachedKafkaConsumer: Initializing cache 16 64 0.75
20/01/14 12:15:20 INFO CachedKafkaConsumer: Cache miss for CacheKey(spark-executor-telemetry-streaming-service,telemetry,1)
20/01/14 12:15:20 INFO ConsumerConfig: ConsumerConfig values: 
        auto.commit.interval.ms = 5000
        auto.offset.reset = none
        bootstrap.servers = [1,2,3]
        check.crcs = true
        client.id = client-0
        connections.max.idle.ms = 540000
        default.api.timeout.ms = 60000
        enable.auto.commit = false
        exclude.internal.topics = true
        fetch.max.bytes = 52428800
        fetch.max.wait.ms = 500

如果我们密切观察第一个执行器中的auto.offset.reset是最新的，但对于其他执行器，auto.offset.reset=none

下面是我如何创建流媒体上下文的

public void init() throws Exception {

        final String BOOTSTRAP_SERVERS = PropertyFileReader.getInstance()
                .getProperty("spark.streaming.kafka.broker.list");
        final String DYNAMIC_ALLOCATION_ENABLED = PropertyFileReader.getInstance()
                .getProperty("spark.streaming.dynamicAllocation.enabled");
        final String DYNAMIC_ALLOCATION_SCALING_INTERVAL = PropertyFileReader.getInstance()
                .getProperty("spark.streaming.dynamicAllocation.scalingInterval");
        final String DYNAMIC_ALLOCATION_MIN_EXECUTORS = PropertyFileReader.getInstance()
                .getProperty("spark.streaming.dynamicAllocation.minExecutors");
        final String DYNAMIC_ALLOCATION_MAX_EXECUTORS = PropertyFileReader.getInstance()
                .getProperty("spark.streaming.dynamicAllocation.maxExecutors");
        final String DYNAMIC_ALLOCATION_EXECUTOR_IDLE_TIMEOUT = PropertyFileReader.getInstance()
                .getProperty("spark.streaming.dynamicAllocation.executorIdleTimeout");
        final String DYNAMIC_ALLOCATION_CACHED_EXECUTOR_IDLE_TIMEOUT = PropertyFileReader.getInstance()
                .getProperty("spark.streaming.dynamicAllocation.cachedExecutorIdleTimeout");
        final String SPARK_SHUFFLE_SERVICE_ENABLED = PropertyFileReader.getInstance()
                .getProperty("spark.shuffle.service.enabled");
        final String SPARK_LOCALITY_WAIT = PropertyFileReader.getInstance().getProperty("spark.locality.wait");
        final String SPARK_KAFKA_CONSUMER_POLL_INTERVAL = PropertyFileReader.getInstance()
                .getProperty("spark.streaming.kafka.consumer.poll.ms");
        final String SPARK_KAFKA_MAX_RATE_PER_PARTITION = PropertyFileReader.getInstance()
                .getProperty("spark.streaming.kafka.maxRatePerPartition");
        final String SPARK_BATCH_DURATION_IN_SECONDS = PropertyFileReader.getInstance()
                .getProperty("spark.batch.duration.in.seconds");
        final String KAFKA_TOPIC = PropertyFileReader.getInstance().getProperty("spark.streaming.kafka.topic");

        LOGGER.debug("connecting to brokers ::" + BOOTSTRAP_SERVERS);
        LOGGER.debug("bootstrapping properties to create consumer");

        kafkaParams = new HashMap<>();
        kafkaParams.put("bootstrap.servers", BOOTSTRAP_SERVERS);
        kafkaParams.put("key.deserializer", StringDeserializer.class);
        kafkaParams.put("value.deserializer", StringDeserializer.class);
        kafkaParams.put("group.id", "telemetry-streaming-service");
        kafkaParams.put("auto.offset.reset", "latest");
        kafkaParams.put("enable.auto.commit", false);
        kafkaParams.put("client.id","client-0");
        // Below property should be enabled in properties and changed based on
        // performance testing
        kafkaParams.put("max.poll.records",
                PropertyFileReader.getInstance().getProperty("spark.streaming.kafka.max.poll.records"));

        LOGGER.info("registering as a consumer with the topic :: " + KAFKA_TOPIC);
        topics = Arrays.asList(KAFKA_TOPIC);
        sparkConf = new SparkConf()
//                .setMaster(PropertyFileReader.getInstance().getProperty("spark.master.url"))
                .setAppName(PropertyFileReader.getInstance().getProperty("spark.application.name"))
                .set("spark.streaming.dynamicAllocation.enabled", DYNAMIC_ALLOCATION_ENABLED)
                .set("spark.streaming.dynamicAllocation.scalingInterval", DYNAMIC_ALLOCATION_SCALING_INTERVAL)
                .set("spark.streaming.dynamicAllocation.minExecutors", DYNAMIC_ALLOCATION_MIN_EXECUTORS)
                .set("spark.streaming.dynamicAllocation.maxExecutors", DYNAMIC_ALLOCATION_MAX_EXECUTORS)
                .set("spark.streaming.dynamicAllocation.executorIdleTimeout", DYNAMIC_ALLOCATION_EXECUTOR_IDLE_TIMEOUT)
                .set("spark.streaming.dynamicAllocation.cachedExecutorIdleTimeout",
                        DYNAMIC_ALLOCATION_CACHED_EXECUTOR_IDLE_TIMEOUT)
                .set("spark.shuffle.service.enabled", SPARK_SHUFFLE_SERVICE_ENABLED)
                .set("spark.locality.wait", SPARK_LOCALITY_WAIT)
                .set("spark.streaming.kafka.consumer.poll.ms", SPARK_KAFKA_CONSUMER_POLL_INTERVAL)
                .set("spark.streaming.kafka.maxRatePerPartition", SPARK_KAFKA_MAX_RATE_PER_PARTITION);

        LOGGER.debug("creating streaming context with minutes batch interval  ::: " + SPARK_BATCH_DURATION_IN_SECONDS);
        streamingContext = new JavaStreamingContext(sparkConf,
                Durations.seconds(Integer.parseInt(SPARK_BATCH_DURATION_IN_SECONDS)));

        /*
         * todo: add checkpointing to the streaming context to recover from driver
         * failures and also for offset management
         */
        LOGGER.info("checkpointing the streaming transactions at hdfs path :: /checkpoint");
        streamingContext.checkpoint("/checkpoint");
        streamingContext.addStreamingListener(new DataProcessingListener());
    }

    @Override
    public void execute() throws InterruptedException {
        LOGGER.info("started telemetry pipeline executor to consume data");
        // Data Consume from the Kafka topic
        JavaInputDStream<ConsumerRecord<String, String>> telemetryStream = KafkaUtils.createDirectStream(
                streamingContext, LocationStrategies.PreferConsistent(),
                ConsumerStrategies.Subscribe(topics, kafkaParams));

        telemetryStream.foreachRDD(rawRDD -> {
            if (!rawRDD.isEmpty()) {
                OffsetRange[] offsetRanges = ((HasOffsetRanges) rawRDD.rdd()).offsetRanges();
                LOGGER.debug("list of OffsetRanges getting processed as a string :: "
                        + Arrays.asList(offsetRanges).toString());
                System.out.println("offsetRanges : " + offsetRanges.length);
                SparkSession spark = JavaSparkSessionSingleton.getInstance(rawRDD.context().getConf());
                JavaPairRDD<String, String> flattenedRawRDD = rawRDD.mapToPair(record -> {
                    //LOGGER.debug("flattening JSON record with telemetry json value ::: " + record.value());
                    ObjectMapper om = new ObjectMapper();
                    JsonNode root = om.readTree(record.value());
                    Map<String, JsonNode> flattenedMap = new FlatJsonGenerator(root).flatten();
                    JsonNode flattenedRootNode = om.convertValue(flattenedMap, JsonNode.class);
                    //LOGGER.debug("creating Tuple for the JSON record Key :: " + flattenedRootNode.get("/name").asText()
                    //      + ", value :: " + flattenedRootNode.toString());
                    return new Tuple2<String, String>(flattenedRootNode.get("/name").asText(),
                            flattenedRootNode.toString());
                });

                Dataset<Row> rawFlattenedDataRDD = spark
                        .createDataset(flattenedRawRDD.rdd(), Encoders.tuple(Encoders.STRING(), Encoders.STRING()))
                        .toDF("sensor_path", "sensor_data");
                Dataset<Row> groupedDS = rawFlattenedDataRDD.groupBy(col("sensor_path"))
                        .agg(collect_list(col("sensor_data").as("sensor_data")));

                Dataset<Row> lldpGroupedDS = groupedDS.filter((FilterFunction<Row>) r -> r.getString(0).equals("Cisco-IOS-XR-ethernet-lldp-oper:lldp/nodes/node/neighbors/devices/device"));

                LOGGER.info("printing the LLDP GROUPED DS ------------------>");
                lldpGroupedDS.show(2);
                LOGGER.info("creating telemetry pipeline to process the telemetry data");

                HashMap<Object, Object> params = new HashMap<>();
                params.put(DPConstants.OTSDB_CONFIG_F_PATH, ExternalizedConfigsReader.getPropertyValueFromCache("/opentsdb.config.file.path"));
                params.put(DPConstants.OTSDB_CLIENT_TYPE, ExternalizedConfigsReader.getPropertyValueFromCache("/opentsdb.client.type"));

                try {
                    LOGGER.info("<-------------------processing lldp data and write to hive STARTED ----------------->");
                    Pipeline lldpPipeline = PipelineFactory.getPipeline(PipelineType.LLDPTELEMETRY);
                    lldpPipeline.process(lldpGroupedDS, null);
                    LOGGER.info("<-------------------processing lldp data and write to hive COMPLETED ----------------->");

                    LOGGER.info("<-------------------processing groupedDS data and write to OPENTSDB STARTED ----------------->");
                    Pipeline pipeline = PipelineFactory.getPipeline(PipelineType.TELEMETRY);
                    pipeline.process(groupedDS, params);
                    LOGGER.info("<-------------------processing groupedDS data and write to OPENTSDB COMPLETED ----------------->");

                }catch (Throwable t){
                    t.printStackTrace();
                }

                LOGGER.info("commiting offsets after processing the batch");
                ((CanCommitOffsets) telemetryStream.inputDStream()).commitAsync(offsetRanges);

            }
        });

        streamingContext.start();
        streamingContext.awaitTermination();
    }

public void init（）引发异常{
最后一个字符串BOOTSTRAP_SERVERS=PropertyFileReader.getInstance（）
.getProperty（“spark.streaming.kafka.broker.list”）；
最终字符串DYNAMIC\u ALLOCATION\u ENABLED=PropertyFileReader.getInstance（）
.getProperty（“spark.streaming.dynamicAllocation.enabled”）；
最终字符串DYNAMIC\u ALLOCATION\u SCALING\u INTERVAL=PropertyFileReader.getInstance（）
.getProperty（“spark.streaming.DynamicLocation.scalingInterval”）；
最终字符串DYNAMIC\u ALLOCATION\u MIN\u EXECUTORS=PropertyFileReader.getInstance（）
.getProperty（“spark.streaming.DynamicLocation.minExecutors”）；
最终字符串DYNAMIC\u ALLOCATION\u MAX\u EXECUTORS=PropertyFileReader.getInstance（）
.getProperty（“spark.streaming.DynamicLocation.maxExecutors”）；
最终字符串DYNAMIC\u ALLOCATION\u EXECUTOR\u IDLE\u TIMEOUT=PropertyFileReader.getInstance（）
.getProperty（“spark.streaming.dynamicAllocation.ExecutionIdleTimeout”）；
最终字符串DYNAMIC\u ALLOCATION\u CACHED\u EXECUTOR\u IDLE\u TIMEOUT=PropertyFileReader.getInstance（）
.getProperty（“spark.streaming.DynamicLocation.cachedExecutorIdleTimeout”）；
最后一个字符串SPARK\u SHUFFLE\u SERVICE\u ENABLED=PropertyFileReader.getInstance（）
.getProperty（“spark.shuffle.service.enabled”）；
最后一个字符串SPARK\u LOCALITY\u WAIT=PropertyFileReader.getInstance（）.getProperty（“SPARK.LOCALITY.WAIT”）；
最终字符串SPARK\u KAFKA\u CONSUMER\u POLL\u INTERVAL=PropertyFileReader.getInstance（）
.getProperty（“spark.streaming.kafka.consumer.poll.ms”）；
最终字符串SPARK\u KAFKA\u MAX\u RATE\u PER\u PARTITION=PropertyFileReader.getInstance（）
.getProperty（“spark.streaming.kafka.maxRatePerPartition”）；
最终字符串SPARK\u BATCH\u DURATION\u以秒为单位=PropertyFileReader.getInstance（）
.getProperty（“spark.batch.duration.in.秒”）；
最后一个字符串KAFKA_TOPIC=PropertyFileReader.getInstance（）.getProperty（“spark.streaming.KAFKA.TOPIC”）；
debug（“连接到代理：：”+引导服务器）；
debug（“引导属性以创建使用者”）；
kafkaParams=新HashMap（）；
kafkaParams.put（“bootstrap.servers”，bootstrap\u servers）；
kafkaParams.put（“key.deserializer”，StringDeserializer.class）；
kafkaParams.put（“value.deserializer”，StringDeserializer.class）；
kafkaParams.put（“group.id”，“遥测流媒体服务”）；
kafkaParams.put（“自动偏移重置”、“最新”）；
kafkaParams.put（“enable.auto.commit”，false）；
kafkaParams.put（“client.id”，“client-0”）；
//应在属性中启用以下属性，并根据
//性能测试
kafkaParams.put（“max.poll.records”，
PropertyFileReader.getInstance（）.getProperty（“spark.streaming.kafka.max.poll.records”）；
LOGGER.info（“使用主题注册为消费者：：”+KAFKA_topic）；
topics=Arrays.asList（卡夫卡主题）；
sparkConf=新sparkConf（）
//.setMaster（PropertyFileReader.getInstance（）.getProperty（“spark.master.url”））
.setAppName（PropertyFileReader.getInstance（）.getProperty（“spark.application.name”））
.set（“spark.streaming.DynamicLocation.enabled”，动态分配已启用）
.set（“spark.streaming.DynamicLocation.scalingInterval”，动态分配\U缩放间隔）
.set（“spark.streaming.DynamicLocation.minExecutors”，动态\u分配\u最小执行器）
.set（“spark.streaming.dynamicAllocation.maxExecutors”，动态分配\u最大执行器）
.set（“spark.streaming.dynamicAllocation.ExecutionIdleTimeout”，动态\u分配\u执行器\u空闲\u超时）
.set（“spark.streaming.DynamicLocation.cachedExecutorIdleTimeout”，
动态\分配\缓存\执行器\空闲\超时）
.set（“spark.shuffle.service.enabled”，spark\u shuffle\u service\u enabled）
.set（“spark.locality.wait”，spark\u locality\u wait）
.set（“spark.streaming.kafka.consumer.poll.ms”，spark\u kafka\u consumer\u poll\u INTERVAL）
.set（“spark.streaming.kafka.maxRatePerPartition”，spark\kafka\u MAX\u RATE\u perpartition）；
debug（“创建具有分钟批处理间隔的流式上下文：：”+SPARK\u batch\u DURATION（以秒为单位））；
streamingContext=新的JavaStreamingContext（sparkConf，
Durations.seconds（Integer.parseInt（SPARK_BATCH_DURATION_IN_seconds））；
/*
*todo:将检查点添加到流上下文以从驱动程序恢复
*故障以及补偿管理
*/
LOGGER.info（“在hdfs路径上检查流事务：：/checkpoint”）；
streamingContext.checkpoint（“/checkpoint”）；
streamingContext.addStreamingListener（新的DataProcessingListener（））；
}
@凌驾
public void execute（）引发InterruptedException{
info（“已启动遥测管道执行器以使用数据”）；
//卡夫卡主题的数据消耗
JavaInputDStream遥测