Java 如何测量Spark在分区RDD上运行操作所需的时间？_Java_Caching_Apache Spark

Java 如何测量Spark在分区RDD上运行操作所需的时间？

java caching apache-spark

Java 如何测量Spark在分区RDD上运行操作所需的时间？,java,caching,apache-spark,Java,Caching,Apache Spark,我编写了一个小的Spark应用程序，它应该测量Spark在分区RDD上运行操作所需的时间（combineByKey函数求和一个值）我的问题是，第一次迭代似乎工作正常（计算的持续时间约为25毫秒），但下一次迭代花费的时间要少得多（约5毫秒）。在我看来，Spark在没有任何请求的情况下保存数据！？我可以通过编程避免这种情况吗我必须知道Spark计算新RDD所需的持续时间（没有任何缓存/保留早期迭代）->我认为持续时间应该始终为20-25 ms左右为了确保重新计算，我将SparkContext生

我编写了一个小的Spark应用程序，它应该测量Spark在分区RDD上运行操作所需的时间（combineByKey函数求和一个值）

我的问题是，第一次迭代似乎工作正常（计算的持续时间约为25毫秒），但下一次迭代花费的时间要少得多（约5毫秒）。在我看来，Spark在没有任何请求的情况下保存数据！？我可以通过编程避免这种情况吗

我必须知道Spark计算新RDD所需的持续时间（没有任何缓存/保留早期迭代）->我认为持续时间应该始终为20-25 ms左右

为了确保重新计算，我将SparkContext生成移动到for循环中，但这没有带来任何更改

谢谢你的建议

下面是我的代码，它似乎保存了任何数据：

public static void main(String[] args) {

    switchOffLogging();

    // jetzt

    try {
        // Setup: Read out parameters & initialize SparkContext
        String path = args[0];
        SparkConf conf = new SparkConf(true);
        JavaSparkContext sc;

        // Create output file & writer
        System.out.println("\npar.\tCount\tinput.p\tcons.p\tTime");

        // The RDDs used for the benchmark
        JavaRDD<String> input = null;
        JavaPairRDD<Integer, String> pairRDD = null;
        JavaPairRDD<Integer, String> partitionedRDD = null;
        JavaPairRDD<Integer, Float> consumptionRDD = null;

        // Do the tasks iterative (10 times the same benchmark for testing)
        for (int i = 0; i < 10; i++) {
            boolean partitioning = true;
            int partitionsCount = 8;

            sc = new JavaSparkContext(conf);
            setS3credentials(sc, path);

            input = sc.textFile(path);
            pairRDD = mapToPair(input);

            partitionedRDD = partition(pairRDD, partitioning, partitionsCount);

            // Measure the duration
            long duration = System.currentTimeMillis();
            // Do the relevant function
            consumptionRDD = partitionedRDD.combineByKey(createCombiner, mergeValue, mergeCombiners);
            duration = System.currentTimeMillis() - duration;

            // So some action to invoke the calculation
            System.out.println(consumptionRDD.collect().size());

            // Print the results
            System.out.println("\n" + partitioning + "\t" + partitionsCount + "\t" + input.partitions().size() + "\t" + consumptionRDD.partitions().size() + "\t" + duration + " ms");

            input = null;
            pairRDD = null;
            partitionedRDD = null;
            consumptionRDD = null;

            sc.close();
            sc.stop();

        }
    } catch (Exception e) {
        e.printStackTrace();
        System.out.println(e.getMessage());
    }
}

如果shuffle输出足够小，那么Spark shuffle文件将写入操作系统缓冲区缓存，因为没有显式调用fsync……这意味着，只要有空间，您的数据将保留在内存中

如果确实需要进行冷性能测试，那么您可以尝试类似的方法，但这会减慢每次测试之间的速度。你能把上下文上下颠倒一下吗？这可能会解决您的需要。

我现在找到了一个解决方案：我编写了一个单独的类，在一个新进程上调用spark submit命令。这可以在一个循环中完成，因此每个基准测试都在一个新线程中启动，sparkContext也在每个进程中分离。所以垃圾收集完成了，一切正常

String submitCommand = "/root/spark/bin/spark-submit " + submitParams + " --   class partitioning.PartitionExample /root/partitioning.jar " + javaFlags;
Process p = Runtime.getRuntime().exec(submitCommand);

BufferedReader reader;
String line;

System.out.println(p.waitFor());
reader = new BufferedReader(new InputStreamReader(p.getInputStream()));         
while ((line = reader.readLine())!= null) {
  System.out.println(line);
}

尝试了来自的linux命令，上面说运行1）“sudo sync”和2）“echo 3>sudo/proc/sys/vm/drop_caches”，但没有成功。。。我还尝试了SparkConf.set方法SparkConf conf=newsparkconf（true.set）（“spark.files.useFetchCache”，“false”）；但也没有效果。。。

part.   Count   input.p cons.p  Time
true    8       6       8       20 ms
true    8       6       8       23 ms
true    8       6       8       7 ms        // Too less!!!
true    8       6       8       21 ms
true    8       6       8       13 ms
true    8       6       8       6 ms        // Too less!!!
true    8       6       8       5 ms        // Too less!!!
true    8       6       8       6 ms        // Too less!!!
true    8       6       8       4 ms        // Too less!!!
true    8       6       8       7 ms        // Too less!!!

String submitCommand = "/root/spark/bin/spark-submit " + submitParams + " --   class partitioning.PartitionExample /root/partitioning.jar " + javaFlags;
Process p = Runtime.getRuntime().exec(submitCommand);

BufferedReader reader;
String line;

System.out.println(p.waitFor());
reader = new BufferedReader(new InputStreamReader(p.getInputStream()));         
while ((line = reader.readLine())!= null) {
  System.out.println(line);
}