Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 将big spark sql查询分解为较小的查询并将其合并_Scala_Apache Spark_Apache Spark Sql_Spark Streaming_Spark Dataframe - Fatal编程技术网

Scala 将big spark sql查询分解为较小的查询并将其合并

Scala 将big spark sql查询分解为较小的查询并将其合并,scala,apache-spark,apache-spark-sql,spark-streaming,spark-dataframe,Scala,Apache Spark,Apache Spark Sql,Spark Streaming,Spark Dataframe,我有一个大的sparksql语句,我试图将它分成更小的块,以提高代码的可读性。我不想加入它,只是合并结果 当前工作sql语句- val dfs = x.map(field => spark.sql(s" select ‘test’ as Table_Name, '$field' as Column_Name, min($field) as Min_Value, max($field) as Max_Value,

我有一个大的sparksql语句,我试图将它分成更小的块,以提高代码的可读性。我不想加入它,只是合并结果

当前工作sql语句-

val dfs = x.map(field => spark.sql(s"
   select ‘test’ as Table_Name,
          '$field' as Column_Name, 
          min($field) as Min_Value, 
          max($field) as Max_Value, 
          approx_count_distinct($field) as Unique_Value_Count,
          (
            SELECT 100 * approx_count_distinct($field)/count(1) 
            from tempdftable
          ) as perc 
   from tempdftable
”))
我试图从上面的sql中删除下面的查询

(SELECT 100 * approx_count_distinct($field)/count(1) from tempdftable) as perc
按照这个逻辑-

 val Perce = x.map(field => spark.sql(s"(SELECT 100 * approx_count_distinct($field)/count(1) from parquetDFTable)"))
然后将这个val Perce与第一个大SQL语句合并到下面的语句中,但它不起作用-

val dfs = x.map(field => spark.sql(s"
  select ‘test’ as Table_Name,
         '$field' as Column_Name, 
         min($field) as Min_Value, 
         max($field) as Max_Value, 
         approx_count_distinct($field) as Unique_Value_Count,
         '"+Perce+ "'
  from tempdftable
”))

我们怎么写呢?

为什么不全力以赴,将整个表达式转换为Spark代码

import spark.implicits._
import org.apache.spark.sql.functions._

val fraction = udf((approxCount: Double, totalCount: Double) => 100 * approxCount/totalCount)

val fields = Seq("colA", "colB", "colC")

val dfs = fields.map(field => {
  tempdftable
    .select(min(field) as "Min_Value", max(field) as "Max_Value", approx_count_distinct(field) as "Unique_Value_Count", count(field) as "Total_Count")
    .withColumn("Table_Name", lit("test"))
    .withColumn("Column_Name", lit(field))
    .withColumn("Perc", fraction('Unique_Value_Count, 'Total_Count))
    .select('Table_Name, 'Column_Name, 'Min_Value, 'Max_Value, 'Unique_Value_Count, 'Perc)
})

val df = dfs.reduce(_ union _)
在这样的测试示例中:

val tempdftable = spark.sparkContext.parallelize(List((3.0, 7.0, 2.0), (1.0, 4.0, 10.0), (3.0, 7.0, 2.0), (5.0, 0.0, 2.0))).toDF("colA", "colB", "colC")

tempdftable.show

+----+----+----+
|colA|colB|colC|
+----+----+----+
| 3.0| 7.0| 2.0|
| 1.0| 4.0|10.0|
| 3.0| 7.0| 2.0|
| 5.0| 0.0| 2.0|
+----+----+----+
我们得到

df.show

+----------+-----------+---------+---------+------------------+----+
|Table_Name|Column_Name|Min_Value|Max_Value|Unique_Value_Count|Perc|
+----------+-----------+---------+---------+------------------+----+
|      test|       colA|      1.0|      5.0|                 3|75.0|
|      test|       colB|      0.0|      7.0|                 3|75.0|
|      test|       colC|      2.0|     10.0|                 2|50.0|
+----------+-----------+---------+---------+------------------+----+

为什么不全力以赴将整个表达式转换为Spark代码

import spark.implicits._
import org.apache.spark.sql.functions._

val fraction = udf((approxCount: Double, totalCount: Double) => 100 * approxCount/totalCount)

val fields = Seq("colA", "colB", "colC")

val dfs = fields.map(field => {
  tempdftable
    .select(min(field) as "Min_Value", max(field) as "Max_Value", approx_count_distinct(field) as "Unique_Value_Count", count(field) as "Total_Count")
    .withColumn("Table_Name", lit("test"))
    .withColumn("Column_Name", lit(field))
    .withColumn("Perc", fraction('Unique_Value_Count, 'Total_Count))
    .select('Table_Name, 'Column_Name, 'Min_Value, 'Max_Value, 'Unique_Value_Count, 'Perc)
})

val df = dfs.reduce(_ union _)
在这样的测试示例中:

val tempdftable = spark.sparkContext.parallelize(List((3.0, 7.0, 2.0), (1.0, 4.0, 10.0), (3.0, 7.0, 2.0), (5.0, 0.0, 2.0))).toDF("colA", "colB", "colC")

tempdftable.show

+----+----+----+
|colA|colB|colC|
+----+----+----+
| 3.0| 7.0| 2.0|
| 1.0| 4.0|10.0|
| 3.0| 7.0| 2.0|
| 5.0| 0.0| 2.0|
+----+----+----+
我们得到

df.show

+----------+-----------+---------+---------+------------------+----+
|Table_Name|Column_Name|Min_Value|Max_Value|Unique_Value_Count|Perc|
+----------+-----------+---------+---------+------------------+----+
|      test|       colA|      1.0|      5.0|                 3|75.0|
|      test|       colB|      0.0|      7.0|                 3|75.0|
|      test|       colC|      2.0|     10.0|                 2|50.0|
+----------+-----------+---------+---------+------------------+----+

谢谢你,格伦尼!这很有帮助,我接受了这个答案,但我擅长SQL,而且很少有表达式使用过分析函数,比如RANK,纯Spark,不知道如何实现这些结果。导入
org.apache.Spark.SQL.functions.\u
可以提供大部分的SQL函数(??)。包括
排名
;)再次感谢你,格伦尼!我可以问一下,我可以从哪里获得这些信息,我可以参考的任何文件成为大师:P:)嗯,是的,嗯。。。我不得不说Spark文档不是我见过的最好的;)就我个人而言,我从阅读博客博文中学到了很多东西(从Databricks、Cloudera等网站),但大部分时间我刚刚与Spark合作了近3年。不过,我要说的是,学习Spark语法——即使您擅长sql——也是非常值得的,如果您熟悉LINQ或Java Streams API,那么您将不需要几周就能熟练使用Spark:)谢谢Glennie!这当然有帮助……!:)谢谢你,格伦尼!这很有帮助,我接受了这个答案,但我擅长SQL,而且很少有表达式使用过分析函数,比如RANK,纯Spark,不知道如何实现这些结果。导入
org.apache.Spark.SQL.functions.\u
可以提供大部分的SQL函数(??)。包括
排名
;)再次感谢你,格伦尼!我可以问一下,我可以从哪里获得这些信息,我可以参考的任何文件成为大师:P:)嗯,是的,嗯。。。我不得不说Spark文档不是我见过的最好的;)就我个人而言,我从阅读博客博文中学到了很多东西(从Databricks、Cloudera等网站),但大部分时间我刚刚与Spark合作了近3年。不过,我要说的是,学习Spark语法——即使您擅长sql——也是非常值得的,如果您熟悉LINQ或Java Streams API,那么您将不需要几周就能熟练使用Spark:)谢谢Glennie!这当然有帮助……!:)