根据最大列值查询spark dataframe
我有一个具有以下数据结构的配置单元外部分区表:根据最大列值查询spark dataframe,dataframe,apache-spark,Dataframe,Apache Spark,我有一个具有以下数据结构的配置单元外部分区表: hdfs://my_server/stg/my_table/project=foo/project_version=2.0/dt=20210105/file1.parquet hdfs://my_server/stg/my_table/project=foo/project_version=2.0/dt=20210110/file2.parquet hdfs://my_server/stg/my_table/project=foo/project_
hdfs://my_server/stg/my_table/project=foo/project_version=2.0/dt=20210105/file1.parquet
hdfs://my_server/stg/my_table/project=foo/project_version=2.0/dt=20210110/file2.parquet
hdfs://my_server/stg/my_table/project=foo/project_version=2.1/dt=20210201/file3.parquet
hdfs://my_server/stg/my_table/project=bar/project_version=2.0/dt=20210103/file4.parquet
hdfs://my_server/stg/my_table/project=bar/project_version=2.1/dt=20210210/file5.parquet
hdfs://my_server/stg/my_table/project=bar/project_version=2.1/dt=20210311/file6.parquet
hdfs://my_server/stg/my_table/project=big_project/project_version=1.1/dt=20210401/file7.parquet
hdfs://my_server/stg/my_table/project=big_project/project_version=1.1/dt=20210401/file8.parquet
hdfs://my_server/stg/my_table/project=big_project/project_version=1.1/dt=20210401/file9.parquet
对于max版本,我想返回一个包含foo项目数据的数据框。我希望避免阅读任何其他项目的记录。
由于etl过程中的限制,我无法直接查询这个表,所以我尝试直接从拼花地板中读取
val df_foo = spark.read.parquet("hdfs://my_server/stg/my_table/project=foo")
df_foo.printSchema
root
|-- clientid: string (nullable = true)
|-- some_field_i_care_about: string (nullable = true)
|-- project_version: double (nullable = true)
|-- dt: string (nullable = true)
df_foo.groupBy("project_version", "dt").count().show
+---------------+--------+------+
|project_version| dt| count|
+---------------+--------+------+
| 2.0|20210105|187234|
| 2.0|20210110|188356|
| 2.1|20210201|188820|
+---------------+--------+------+
val max_version = df_foo.groupBy().max("project_version")
max_version.show
+--------------------+
|max(project_version)|
+--------------------+
| 2.1|
+--------------------+
val df_foo_latest = df_foo.filter($"project_version" === max_version).count()
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset [max(project_version): double]
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
at scala.util.Try.getOrElse(Try.scala:79)
project_version列是双精度的,max_version值也是双精度的,为什么我不能在筛选器中比较这些值
感谢您的帮助
max\u版本
的类型为org.apache.spark.sql.DataFrame
它不是双精度
。您必须从数据帧中提取值
检查下面的代码
scala> val max_version = df.groupBy().agg(max("project_version").as("version")).as[Double].collect.head
max_version: Double = 2.1
scala> val df_foo = Seq((2.0,20210105,187234),(2.0,20210110,188356),(2.1,20210201,188820)).toDF("project_version","dt","count")
df_foo: org.apache.spark.sql.DataFrame = [project_version: double, dt: int ... 1 more field]
scala> val max_version = df_foo.groupBy().agg(max("project_version").as("version")).as[Double].collect.head
max_version: Double = 2.1
scala> val df_foo_latest = df_foo.filter($"project_version" === max_version).count()
df_foo_latest: Long = 1
scala> val df_foo_latest = df_foo.filter($"project_version" === max_version)
df_foo_latest: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [project_version: double, dt: int ... 1 more field]
scala> df_foo_latest.count
res1: Long = 1
scala> df_foo_latest.show(false)
+---------------+--------+------+
|project_version|dt |count |
+---------------+--------+------+
|2.1 |20210201|188820|
+---------------+--------+------+
不要从数据帧中提取值,而是尝试使用internal
join。这样更安全
scala> val max_version = df_foo.groupBy().max("project_version")
max_version: org.apache.spark.sql.DataFrame = [max(project_version): double]
scala> val max_version = df_foo.groupBy().agg(max("project_version").as("project_version"))
scala> val df_foo_latest = df_foo.join(max_version,Seq($"project_version"),"inner")
scala> df_foo_latest.show(false)
+---------------+--------+------+
|project_version|dt |count |
+---------------+--------+------+
|2.1 |20210201|188820|
+---------------+--------+------+
谢谢@srinivas,这很有道理。在最后一行代码中,
Seq($“project\u version”)
的目的是什么?它是否允许您指定两个数据帧共用的联接列?谢谢,当两个数据框中的列相同时,可以指定Seq($“project\u version”)
。