Java Spark v3.0.0-调度程序：广播大小为xx的大型任务二进制文件_Java_Apache Spark_Apache Spark Mllib_Apache Spark Ml

Java Spark v3.0.0-调度程序：广播大小为xx的大型任务二进制文件

java apache-spark

Java Spark v3.0.0-调度程序：广播大小为xx的大型任务二进制文件,java,apache-spark,apache-spark-mllib,apache-spark-ml,Java,Apache Spark,Apache Spark Mllib,Apache Spark Ml,我是新手。我正在Spark standalone（v3.0.0）中使用以下配置集编写机器学习算法： SparkConf conf = new SparkConf(); conf.setMaster("local[*]"); conf.set("spark.driver.memory", "8g"); conf.set("spark.driver.maxResultSize", "8g"); con

我是新手。我正在Spark standalone（v3.0.0）中使用以下配置集编写机器学习算法：

SparkConf conf = new SparkConf();
conf.setMaster("local[*]");
conf.set("spark.driver.memory", "8g");
conf.set("spark.driver.maxResultSize", "8g");
conf.set("spark.memory.fraction", "0.6");
conf.set("spark.memory.storageFraction", "0.5");
conf.set("spark.sql.shuffle.partitions", "5");
conf.set("spark.memory.offHeap.enabled", "false");
conf.set("spark.reducer.maxSizeInFlight", "96m");
conf.set("spark.shuffle.file.buffer", "256k");
conf.set("spark.sql.debug.maxToStringFields", "100");

这就是我创建CrossValidator的方式

ParamMap[] paramGrid = new ParamGridBuilder()
            .addGrid(gbt.maxBins(), new int[]{50})
            .addGrid(gbt.maxDepth(), new int[]{2, 5, 10})
            .addGrid(gbt.maxIter(), new int[]{5, 20, 40})
            .addGrid(gbt.minInfoGain(), new double[]{0.0d, .1d, .5d})
            .build();

    CrossValidator gbcv = new CrossValidator()
            .setEstimator(gbt)
            .setEstimatorParamMaps(paramGrid)
            .setEvaluator(gbevaluator)
            .setNumFolds(5)
            .setParallelism(8)
            .setSeed(session.getArguments().getTrainingRandom());

问题是，当（在paramGrid中）maxDepth仅为{2,5}且maxIter{5,20}时，所有这些都可以正常工作，但当它与上面的代码类似时，它会保持日志记录：

调度程序：广播大小为xx的大型任务二进制文件，
xx从1000 KiB变为2.9 MiB，通常会导致超时异常
为了避免这个问题，我应该改变哪些火花参数？
< P>超时问题考虑改变以下配置：
spark.sql.autoBroadcastJoinThreshold为-1
这将取消10MB的广播大小限制。
对我有效的解决方案是：
减少任务大小=>减少其处理的数据量
首先，通过df.rdd.getNumPartitions（）
之后，增加分区：df.repartition（100）
考虑增加分区。。因此，您的任务是轻量级的。。。。。每个任务将不会处理更少的数据量。。。检查这个-->。。