Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
根据最大列值查询spark dataframe_Dataframe_Apache Spark - Fatal编程技术网

根据最大列值查询spark dataframe

根据最大列值查询spark dataframe,dataframe,apache-spark,Dataframe,Apache Spark,我有一个具有以下数据结构的配置单元外部分区表: hdfs://my_server/stg/my_table/project=foo/project_version=2.0/dt=20210105/file1.parquet hdfs://my_server/stg/my_table/project=foo/project_version=2.0/dt=20210110/file2.parquet hdfs://my_server/stg/my_table/project=foo/project_

我有一个具有以下数据结构的配置单元外部分区表:

hdfs://my_server/stg/my_table/project=foo/project_version=2.0/dt=20210105/file1.parquet
hdfs://my_server/stg/my_table/project=foo/project_version=2.0/dt=20210110/file2.parquet
hdfs://my_server/stg/my_table/project=foo/project_version=2.1/dt=20210201/file3.parquet
hdfs://my_server/stg/my_table/project=bar/project_version=2.0/dt=20210103/file4.parquet
hdfs://my_server/stg/my_table/project=bar/project_version=2.1/dt=20210210/file5.parquet
hdfs://my_server/stg/my_table/project=bar/project_version=2.1/dt=20210311/file6.parquet
hdfs://my_server/stg/my_table/project=big_project/project_version=1.1/dt=20210401/file7.parquet
hdfs://my_server/stg/my_table/project=big_project/project_version=1.1/dt=20210401/file8.parquet
hdfs://my_server/stg/my_table/project=big_project/project_version=1.1/dt=20210401/file9.parquet
对于max版本,我想返回一个包含foo项目数据的数据框。
我希望避免阅读任何其他项目的记录。
由于etl过程中的限制,我无法直接查询这个表,所以我尝试直接从拼花地板中读取

val df_foo = spark.read.parquet("hdfs://my_server/stg/my_table/project=foo")
df_foo.printSchema

root
 |-- clientid: string (nullable = true)
 |-- some_field_i_care_about: string (nullable = true)
 |-- project_version: double (nullable = true)
 |-- dt: string (nullable = true)

df_foo.groupBy("project_version", "dt").count().show

+---------------+--------+------+
|project_version|      dt| count|
+---------------+--------+------+
|            2.0|20210105|187234|
|            2.0|20210110|188356|
|            2.1|20210201|188820|
+---------------+--------+------+

val max_version = df_foo.groupBy().max("project_version")
max_version.show

+--------------------+
|max(project_version)|
+--------------------+
|                 2.1|
+--------------------+

val df_foo_latest = df_foo.filter($"project_version" === max_version).count()

java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset [max(project_version): double]
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
  at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
  at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
  at scala.util.Try.getOrElse(Try.scala:79)
project_version列是双精度的,max_version值也是双精度的,为什么我不能在筛选器中比较这些值


感谢您的帮助

max\u版本
的类型为
org.apache.spark.sql.DataFrame
它不是
双精度
。您必须从数据帧中提取值

检查下面的代码

scala> val max_version = df.groupBy().agg(max("project_version").as("version")).as[Double].collect.head
max_version: Double = 2.1

scala> val df_foo = Seq((2.0,20210105,187234),(2.0,20210110,188356),(2.1,20210201,188820)).toDF("project_version","dt","count")
df_foo: org.apache.spark.sql.DataFrame = [project_version: double, dt: int ... 1 more field]

scala> val max_version = df_foo.groupBy().agg(max("project_version").as("version")).as[Double].collect.head
max_version: Double = 2.1

scala> val df_foo_latest = df_foo.filter($"project_version" === max_version).count()
df_foo_latest: Long = 1

scala> val df_foo_latest = df_foo.filter($"project_version" === max_version)
df_foo_latest: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [project_version: double, dt: int ... 1 more field]

scala> df_foo_latest.count
res1: Long = 1

scala> df_foo_latest.show(false)
+---------------+--------+------+
|project_version|dt      |count |
+---------------+--------+------+
|2.1            |20210201|188820|
+---------------+--------+------+
不要从数据帧中提取值,而是尝试使用
internal
join。这样更安全

scala> val max_version = df_foo.groupBy().max("project_version")
max_version: org.apache.spark.sql.DataFrame = [max(project_version): double]

scala> val max_version = df_foo.groupBy().agg(max("project_version").as("project_version"))

scala> val df_foo_latest = df_foo.join(max_version,Seq($"project_version"),"inner")


scala> df_foo_latest.show(false)
+---------------+--------+------+
|project_version|dt      |count |
+---------------+--------+------+
|2.1            |20210201|188820|
+---------------+--------+------+

谢谢@srinivas,这很有道理。在最后一行代码中,
Seq($“project\u version”)
的目的是什么?它是否允许您指定两个数据帧共用的联接列?谢谢,当两个数据框中的列相同时,可以指定
Seq($“project\u version”)