Scala EMR集群上的spark连接性能_Scala_Apache Spark_Hdfs

Scala EMR集群上的spark连接性能

scala apache-spark

Scala EMR集群上的spark连接性能,scala,apache-spark,hdfs,Scala,Apache Spark,Hdfs,我们有3节点spark EMR集群（m3Xlarge）。我们正在尝试连接一些大小为4GB（250+列）的大表，而一些小的可引用表（15）每个表有2-3列。因为我们使用的是在EMR中默认启用的spark DynamicLocation 因此，在写入HDFS时，保存结果需要1个多小时（这是因为我们在最终数据帧上使用coalesce（1））甚至我们也尝试使用广播连接，但还没有成功。我们如何改进上述方面的性能上述流程的最终执行时间是多少有哪些可能的方法可以提高性能任何帮助都将不胜感激这是我的连

我们有3节点spark EMR集群（m3Xlarge）。我们正在尝试连接一些大小为4GB（250+列）的大表，而一些小的可引用表（15）每个表有2-3列。因为我们使用的是在EMR中默认启用的spark DynamicLocation

因此，在写入HDFS时，保存结果需要1个多小时（这是因为我们在最终数据帧上使用coalesce（1））

甚至我们也尝试使用广播连接，但还没有成功。我们如何改进上述方面的性能

上述流程的最终执行时间是多少

有哪些可能的方法可以提高性能

任何帮助都将不胜感激

这是我的连接函数

def multiJoins(MasterTablesDF: DataFrame, tmpReferenceTablesDF_List: MutableList[DataFrame], tmpReferenceTableJoinDetailsList: MutableList[Array[String]], DrivingTable: String): DataFrame = {

// Define final output of Driving Table
var final_df: DataFrame = null

if (MasterTablesDF != null) {

  if (!MasterTablesDF.head(1).isEmpty && tmpReferenceTablesDF_List.length >= 1) {

    for (i <- 0 until tmpReferenceTablesDF_List.length) {

      val eachReferenceTableDF = tmpReferenceTablesDF_List(i)
      var eachJoinDetails = tmpReferenceTableJoinDetailsList(i)

      //for first ref table Join
      if (i == 0) {
        println("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
        if (eachJoinDetails(0).equals(eachJoinDetails(1))) {
          println("############## Driving table and Ref table Joining columns are same joining first Drive table ==>" + DrivingTable + "With Ref table ==>" + eachJoinDetails(3))
          //if reftable and Driving table have same join columns using seq() to remove duplicate columns after Joins
          final_df = MasterTablesDF.join(broadcast(eachReferenceTableDF), Seq(eachJoinDetails(0)), eachJoinDetails(2)) //.select(ReqCols.head, ReqCols.tail: _*)
        } else {
          //if the joining column names of the driving and ref tables are not same then
          //using  driving table join col and reftable join cols
          println("############### Driving table and Ref table joining columns are not same joining first Drive table ==>" + DrivingTable + "With Ref table ==>" + eachJoinDetails(3) + "\n")
          final_df = MasterTablesDF.join(broadcast(eachReferenceTableDF), MasterTablesDF(eachJoinDetails(0)) === eachReferenceTableDF(eachJoinDetails(1)), eachJoinDetails(2))

        }

      } //Joining Next reference table dataframes with final DF
      else {
        if (eachJoinDetails(0).equals(eachJoinDetails(1))) {
          println("###### drive table and another ref table join cols are same joining driving table ==>" + DrivingTable + "With RefTable" + eachJoinDetails(3))
          final_df = final_df.join(broadcast(eachReferenceTableDF), Seq(eachJoinDetails(0)), eachJoinDetails(2)) //.select(ReqCols.head, ReqCols.tail: _*)
          // final_df.unpersist()
        } else {
          println("######  drive table and another ref table join cols are not same joining driving table ==>" + DrivingTable + "With RefTable" + eachJoinDetails(3) + "\n")
          final_df = final_df.join(broadcast(eachReferenceTableDF), MasterTablesDF(eachJoinDetails(0)) === eachReferenceTableDF(eachJoinDetails(1)), eachJoinDetails(2))

        }
      }
    }

  }
}

return final_df

//Writing is too slow
//final_df.coalesce(1).write.format("com.databricks.spark.csv").option("delimiter", "|").option("header", "true")
      .csv(hdfsPath)

}

def多重联接（MasterTablesDF:DataFrame，tmpReferenceTablesDF_List:MutableList[DataFrame]，tmpReferenceTableJoinDetailsList:MutableList[Array[String]]，DrivingTable:String）：DataFrame={
//定义驱动台的最终输出
var final_df:DataFrame=null
if（MasterTablesDF！=null）{
如果（！MasterTablesDF.head（1.isEmpty&&tmpReferenceTablesDF\u List.length>=1）{
对于（i可能Spark无法尽可能好地优化非常长的执行计划。我也遇到过同样的情况，我们进行了一系列优化：
1） 尽快移除所有不必要的色谱柱和过滤器
2） 在加入之前“具体化”一些表，这将有助于Spark打破沿袭，并以某种方式优化您的流程（在我们的示例2中，SortJoin被广播连接取代，因为Spark意识到数据帧非常小）
3） 我们使用相同的键和分区数（读取之后）对所有数据集进行分区。
以及其他一些优化。它将作业时间从45分钟减少到了4分钟。您需要仔细查看Spark UI，在那里我们发现了许多有用的优化见解（我们的一个executor worled而不是10个，因为所有数据都被划分在一个部分中。）等等。祝您好运！
谢谢@Andrei删除coalese好吗（1）让spark稍后调用一些hdfs命令/shell脚本将数据写入hdfs，以将所有部件文件合并为单个文件联接中的表。我考虑过这一点，但快速的实验表明它没有得到太多的改进，所以我跳过了这个优化，因为我们以另一种方式实现了它。尝试一下并分享结果-这可能很有趣。但是，无论如何，它不会花费太多时间，您需要在保存之前找到错误。为了进行比较，请使用coalesce（1）编写3 GB的输出到一个csv文件，在我的EMR集群上最多需要1分钟，通过删除coalese（1）并使用一些外部脚本合并部分文件，可以将执行时间从1小时减少到18分钟。此外，我正在使用s3LoadToDF=spark.read.option（“头”）从s3读取主文件，即大表文件，“false”）.option（“分隔符”、“|”）.csv（s3ReadPath）在我使用的spark submit下面，您可以建议一些编辑。EMR cluster4node m3.xlarge实例类型，spark submit--class sparkJoins.main--master thread--deploy mode cluster--driver memory 10g--num executors 4--executor cores 3--executor memory 4g--conf sparklocation.enabled=false--conf spark.sql.shuffle.partitions=1000--files spark_join.conf，/home/hadoop/scripts/hdpScript.sh sparkJoins.jar什么是sql.shuffle partion/parallelizm。您能建议在spark submit中进行编辑吗？