Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/sql/77.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Spark SQL:如何将特定函数应用于所有指定列_Sql_Apache Spark_Apache Spark Sql - Fatal编程技术网

Spark SQL:如何将特定函数应用于所有指定列

Spark SQL:如何将特定函数应用于所有指定列,sql,apache-spark,apache-spark-sql,Sql,Apache Spark,Apache Spark Sql,我发现有一种简单的方法可以在spark sql的多列上调用sql 例如,假设我有一个应该应用于大多数列的查询 select min(c1) as min, max(c1) as max, max(c1) - min(c1) range from table tb1 如果有多个列,是否有方法对所有列执行查询,并一次性获得结果 类似于df.description所做的操作。使用数据帧中包含的元数据(本例中为列)(可通过spark.table(“”)获取,如果您尚未获得列名,则应用所需函数并传递给d

我发现有一种简单的方法可以在spark sql的多列上调用sql

例如,假设我有一个应该应用于大多数列的查询

select
min(c1) as min,
max(c1) as max,
max(c1) - min(c1) range
from table tb1
如果有多个列,是否有方法对所有列执行查询,并一次性获得结果

类似于df.description所做的操作。

使用数据帧中包含的元数据(本例中为列)(可通过
spark.table(“”)
获取,如果您尚未获得列名,则应用所需函数并传递给
df.select
(或
df.selectExpr

构建一些测试数据:

scala> var seq = Seq[(Int, Int, Float)]()
seq: Seq[(Int, Int, Float)] = List()

scala> (1 to 1000).foreach(n => { seq = seq :+ (n,r.nextInt,r.nextFloat) })

scala> val df = seq.toDF("id", "some_int", "some_float")
表示我们希望在所有列上运行的一些函数:

scala> val functions_to_apply = Seq("min", "max")
functions_to_apply: Seq[String] = List(min, max)
设置SQL列的最终顺序:

scala> var select_columns = Seq[org.apache.spark.sql.Column]()
select_columns: Seq[org.apache.spark.sql.Column] = List()
迭代要应用的列和函数以填充select_列:

scala> val cols = df.columns

scala> cols.foreach(col => { functions_to_apply.foreach(f => {select_columns = select_columns :+ expr(s"$f($col)")})})
运行实际查询:

scala> df.select(select_columns:_*).show

+-------+-------+-------------+-------------+---------------+---------------+
|min(id)|max(id)|min(some_int)|max(some_int)|min(some_float)|max(some_float)|
+-------+-------+-------------+-------------+---------------+---------------+
|      1|   1000|  -2143898568|   2147289642|   1.8781424E-4|     0.99964607|
+-------+-------+-------------+-------------+---------------+---------------+