spark java中的性能问题_Java_Performance_Apache Spark_Apache Spark Sql

spark java中的性能问题

java performance apache-spark

spark java中的性能问题,java,performance,apache-spark,apache-spark-sql,Java,Performance,Apache Spark,Apache Spark Sql,我使用的是spark 2.11版本，在我的应用程序中只执行3个基本操作：从数据库中获取记录：220万使用contains检查数据库（220万）中存在的文件（5000）中的记录将匹配的记录写入CSV格式的文件但对于这3个操作，几乎需要20分钟。如果我在SQL中执行相同的操作，它将花费不到1分钟的时间我已经开始使用spark，因为它会很快产生结果，但它花费了太多的时间。如何提高绩效步骤1：从数据库中获取记录 Properties connectionProperties

我使用的是spark 2.11版本，在我的应用程序中只执行3个基本操作：

从数据库中获取记录：220万

使用contains检查数据库（220万）中存在的文件（5000）中的记录

将匹配的记录写入CSV格式的文件

但对于这3个操作，几乎需要20分钟。如果我在SQL中执行相同的操作，它将花费不到1分钟的时间

我已经开始使用spark，因为它会很快产生结果，但它花费了太多的时间。如何提高绩效

步骤1：从数据库中获取记录

        Properties connectionProperties = new Properties();
        connectionProperties.put("user", "test");
        connectionProperties.put("password", "test##");
        String query="(SELECT * from items)
        dataFileContent= spark.read().jdbc("jdbc:oracle:thin:@//172.20.0.11/devad", query,connectionProperties);

步骤2：使用contains检查文件B（2M）中存在的文件A（5k）的记录

Dataset<Row> NewSet=source.join(target,target.col("ItemIDTarget").contains(source.col("ItemIDSource")),"inner");

为了提高性能，我尝试了几种方法，如设置缓存，数据序列化

set("spark.serializer","org.apache.spark.serializer.KryoSerializer")),

洗牌时间

sqlContext.setConf("spark.sql.shuffle.partitions", "10"),

数据结构调整

-XX:+UseCompressedOops ,

所有这些方法都不能产生更好的性能。

提高性能更像是提高并行性

  int num_partitions;
  num_partitions = 10;
  Properties connectionProperties = new Properties();
  connectionProperties.put("user", "test");
  connectionProperties.put("password", "test##");
  connectionProperties.put("partitionColumn", "hash_code");
  String query = "(SELECT  mod(A.id,num_partitions)  as hash_code, A.* from items A)";
  dataFileContent = spark.read()
    .jdbc("jdbc:oracle:thin:@//172.20.0.11/devad",
      dbtable = query,
      columnName = "hash_code",
      lowerBound = 0,
      upperBound = num_partitions,
      numPartitions = num_partitions,
      connectionProperties);

并行性取决于RDD中的分区数

确保Dataset/Dataframe/RDD既没有太多的分区，也没有太少的分区

请检查以下建议，以便改进代码。我更喜欢scala，所以我在scala中提供建议

步骤1：通过提及numPartitions，确保您可以控制与数据库的连接

连接数=分区数

下面我刚刚为num_分区分配了10个分区，您必须对此进行调整以获得更高的性能

  int num_partitions;
  num_partitions = 10;
  Properties connectionProperties = new Properties();
  connectionProperties.put("user", "test");
  connectionProperties.put("password", "test##");
  connectionProperties.put("partitionColumn", "hash_code");
  String query = "(SELECT  mod(A.id,num_partitions)  as hash_code, A.* from items A)";
  dataFileContent = spark.read()
    .jdbc("jdbc:oracle:thin:@//172.20.0.11/devad",
      dbtable = query,
      columnName = "hash_code",
      lowerBound = 0,
      upperBound = num_partitions,
      numPartitions = num_partitions,
      connectionProperties);

步骤2：

  Dataset<Row> NewSet = source.join(target,
    target.col("ItemIDTarget").contains(source.col("ItemIDSource")),
    "inner");

步骤3：使用coalesce减少分区数，以避免完全洗牌

NewSet.coalesce(1).select("*")
        .write().format("com.databricks.spark.csv")
        .option("delimiter", ",")
        .option("header", "true")
        .option("treatEmptyValuesAsNulls", "true")  
        .option("nullValue", "")  
        .save(fileAbsolutePath);

希望我的答案对您有所帮助。

是否有理由在本用例中使用spark？在我看来，将5k记录写入数据库并在数据库中发出SQL联接将是最有效的方法。我的意思是，将此查询具体化为Spark需要多长时间：

SELECT*from items）

？很好，可能

fetchsize

选项可以在列表中。此选项会影响每次往返获取多行的行为。警告：过高的数字可能导致OOM异常。

NewSet.coalesce(1).select("*")
        .write().format("com.databricks.spark.csv")
        .option("delimiter", ",")
        .option("header", "true")
        .option("treatEmptyValuesAsNulls", "true")  
        .option("nullValue", "")  
        .save(fileAbsolutePath);