Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java sparksql连接性能差_Java_Apache Spark_Mapreduce_Apache Spark Sql - Fatal编程技术网

Java sparksql连接性能差

Java sparksql连接性能差,java,apache-spark,mapreduce,apache-spark-sql,Java,Apache Spark,Mapreduce,Apache Spark Sql,我正在使用SparkSQL计算5维的事实表。我面临着性能问题(这项工作需要几个小时才能完成),即使在彻底搜索谷歌之后,我也看不到解决方案。这些是我尝试图灵的设置,但没有成功 sqlContext.sql(“set spark.sql.shuffle.partitions=10”);//从10到5000不等 sqlContext.sql(“set spark.sql.autoBroadcastJoinThreshold=500000000”);//500 MB,也尝试了1 GB 我怀疑数据倾斜问

我正在使用SparkSQL计算5维的事实表。我面临着性能问题(这项工作需要几个小时才能完成),即使在彻底搜索谷歌之后,我也看不到解决方案。这些是我尝试图灵的设置,但没有成功

sqlContext.sql(“set spark.sql.shuffle.partitions=10”);//从10到5000不等
sqlContext.sql(“set spark.sql.autoBroadcastJoinThreshold=500000000”);//500 MB,也尝试了1 GB
我怀疑数据倾斜问题,因为我看到了下面的任务和记录分布问题。 大多数RDD都是很好的分区(每个分区500个),但是最大的维度根本没有分区()。也许这能带来解决方案?下面是我用来计算尺寸和事实的代码

`

`
在此计算之前,Dmn1有56行、dmn2 11、dmn3 10、dmn4 12和dmn5 1275533行。一切都在AWS EMR集群上运行,集群中有3个m3.2x大型节点(主节点+2个从节点)。

您能在SQL上发布调用.explain()的结果吗?最后,在这里。
    resultDmn1.registerTempTable("Dmn1");
    resultDmn2.registerTempTable("Dmn2");
    resultDmn3.registerTempTable("Dmn3");
    resultDmn4.registerTempTable("Dmn4");
    resultDmn5.registerTempTable("Dmn5");

    DataFrame resultFact = sqlContext.sql("SELECT DISTINCT\n" +
            "    0 AS FactId,\n" +
            "    rs.c28 AS c28,\n" +
            "    dop.DmnId AS dmn_id_dim4,\n" +
            "    dh.DmnId AS dmn_id_dim5,\n" +
            "    op.DmnId AS dmn_id_dim3,\n" +
            "    du.DmnId AS dmn_id_dim2,\n" +
            "    dc.DmnId AS dmn_id_dim1\n" +
            "FROM\n" +
            "    t10 rs\n" +
            "        JOIN\n" +
            "    t11 r ON rs.c29 = r.id\n" +
            "        JOIN\n" +
            "    Dmn4 dop ON dop.c26 = r.c25\n" +
            "        JOIN\n" +
            "    Dmn5 dh ON dh.Date = r.c27\n" +
            "        JOIN\n" +
            "    Dmn3 du ON du.c9 = r.c16\n" +
            "        JOIN\n" +
            "    t1 d ON r.c5 = d.id\n" +
            "        JOIN\n" +
            "    t2 di ON d.id = di.c5\n" +
            "        JOIN\n" +
            "    t3 s ON d.c6 = s.id\n" +
            "        JOIN\n" +
            "    t4 p ON s.c7 = p.id\n" +
            "        JOIN\n" +
            "    t5 o ON p.c8 = o.id\n" +
            "        JOIN\n" +
            "    Dmn1 op ON op.c1 = di.c1\n" +
            "        JOIN\n" +
            "    t9 ci ON ci.id = r.c24\n" +
            "        JOIN\n" +
            "    Dmn3 dc ON dc.c18 = ci.c23\n" +
            "WHERE\n" +
            "    op.c2 = di.c2\n" +
            "        AND o.name = op.c30\n" +
            "        AND di.c3 = op.c3\n" +
            "        AND di.c4 = op.c4").toSchemaRDD();

     resultFact.count();
     resultFact.cache();