ApacheSpark：使用普通SQL查询与使用Spark SQL方法_Sql_Apache Spark_Apache Spark Sql

ApacheSpark：使用普通SQL查询与使用Spark SQL方法

sql apache-spark

ApacheSpark：使用普通SQL查询与使用Spark SQL方法,sql,apache-spark,apache-spark-sql,Sql,Apache Spark,Apache Spark Sql,我对Apache Spark很陌生。我有一个非常基本的问题：以下两种语法在性能方面哪一种最好：使用普通SQL查询还是使用诸如select、filter等Spark SQL方法。下面是一个Java的简短示例，它将使您更好地理解我的问题 private static void queryVsSparkSQL() throws AnalysisException { SparkConf conf = new SparkConf(); SparkSessio

我对Apache Spark很陌生。我有一个非常基本的问题：以下两种语法在性能方面哪一种最好：使用普通SQL查询还是使用诸如select、filter等Spark SQL方法。下面是一个Java的简短示例，它将使您更好地理解我的问题

    private static void queryVsSparkSQL() throws AnalysisException {
        SparkConf conf = new SparkConf();

        SparkSession spark = SparkSession
                .builder()
                .master("local[4]")
                .config(conf)
                .appName("queryVsSparkSQL")
                .getOrCreate();

        //using predefined query
        Dataset<Row> ds1 = spark
                .read()
                .format("jdbc")
                .option("url", "jdbc:oracle:thin:hr/hr@localhost:1521/orcl")
                .option("user", "hr")
                .option("password", "hr")
                .option("query","select * from hr.employees t where t.last_name = 'King'")
                .load();
        ds1.show();

        //using spark sql methods: select, filter
        Dataset<Row> ds2 = spark
                .read()
                .format("jdbc")
                .option("url", "jdbc:oracle:thin:hr/hr@localhost:1521/orcl")
                .option("user", "hr")
                .option("password", "hr")
                .option("dbtable", "hr.employees")
                .load()
                .select("*")
                .filter(col("last_name").equalTo("King"));

        ds2.show();
    }

解释并检查第二个查询是否使用了下推谓词

应该是在第二种情况下。如果是这样的话，它在性能上等同于使用查询选项中已经包含的下推传递显式查询

根据您的方法，查看针对mySQL的模拟版本

案例1：通过包含筛选器的已传递查询选择语句

这里PushedFilters不被使用，因为查询只被使用；它包含实际传递到db查询中的筛选器

案例2：没有select语句，而是使用引用过滤器的Spark SQL API

PushedFilter设置为条件，以便在将数据返回到Spark之前在数据库本身中应用过滤。请注意PushedFilters上的*符号，它表示在数据源=数据库处进行筛选

总结

我运行了两个选项，时间很快。就DB处理的方式而言，它们是等效的，只将过滤后的结果返回给Spark，但通过两种不同的机制，会产生相同的性能和物理结果。

我希望你会接受答案。

val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://mysql-rfam-public.ebi.ac.uk:4497/Rfam").option("driver", "org.mariadb.jdbc.Driver").option("query","select * from family where rfam_acc = 'RF01527'").option("user", "rfamro").load().explain()

== Physical Plan ==
*(1) Scan JDBCRelation((select * from family where rfam_acc = 'RF01527') SPARK_GEN_SUBQ_4) [numPartitions=1] #[rfam_acc#867,rfam_id#868,auto_wiki#869L,description#870,author#871,seed_source#872,gathering_cutoff#873,trusted_cutoff#874,noise_cutoff#875,comment#876,previous_id#877,cmbuild#878,cmcalibrate#879,cmsearch#880,num_seed#881L,num_full#882L,num_genome_seq#883L,num_refseq#884L,type#885,structure_source#886,number_of_species#887L,number_3d_structures888,num_pseudonokts#889,tax_seed#890,... 11 more fields] PushedFilters: [], ReadSchema: struct<rfam_acc:string,rfam_id:string,auto_wiki:bigint,description:string,author:string,seed_sour...

val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://mysql-rfam-public.ebi.ac.uk:4497/Rfam").option("driver", "org.mariadb.jdbc.Driver").option("dbtable", "family").option("user", "rfamro").load().select("*").filter(col("rfam_acc").equalTo("RF01527")).explain()

== Physical Plan ==
*(1) Scan JDBCRelation(family) [numPartitions=1] [rfam_acc#1149,rfam_id#1150,auto_wiki#1151L,description#1152,author#1153,seed_source#1154,gathering_cutoff#1155,trusted_cutoff#1156,noise_cutoff#1157,comment#1158,previous_id#1159,cmbuild#1160,cmcalibrate#1161,cmsearch#1162,num_seed#1163L,num_full#1164L,num_genome_seq#1165L,num_refseq#1166L,type#1167,structure_source#1168,number_of_species#1169L,number_3d_structures#1170,num_pseudonokts#1171,tax_seed#1172,... 11 more fields] PushedFilters: [*IsNotNull(rfam_acc), *EqualTo(rfam_acc,RF01527)], ReadSchema: struct<rfam_acc:string,rfam_id:string,auto_wiki:bigint,description:string,author:string,seed_sour...