Apache spark 将SQL查询转换为Spark数据帧_Apache Spark_Apache Spark Sql

Apache spark 将SQL查询转换为Spark数据帧

apache-spark

Apache spark 将SQL查询转换为Spark数据帧,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我想将下面的查询转换为spark数据帧（我对spark非常陌生）： --创建组号 select distinct *, DENSE_RANK() OVER(ORDER BY person_id, trust_id) AS group_number; --这就是我到目前为止所得到的 df = self.spark.sql("select person_id, trust_id, insurance_id, amount, time_of_app, place_of_app from {}".fo

我想将下面的查询转换为spark数据帧（我对spark非常陌生）：

--创建组号

select distinct *, DENSE_RANK() OVER(ORDER BY person_id, trust_id) AS group_number;

--这就是我到目前为止所得到的

df = self.spark.sql("select person_id, trust_id, insurance_id, amount, time_of_app, place_of_app from {}".format(self.tables['people']))

df = df.withColumn("group_number", dense_rank().over(Window.partitionBy("person_id", "trust_id").OrderBy("person_id", "trust_id")))

--不同的查询1

where group_number in (select group_number from etl_table_people where code like 'H%') group by group_number having count(distinct amount) > 1;

--不同的问题2

where insurance_id = 'V94.12'
group by group_number having count(distinct amount) = 2;

您正在寻找的是spark的窗口规范功能

val windowSpec = Window.partitionBy("person_id","trust_id").orderBy(col("person_id").desc).orderBy(col("trust_id").desc)

df.withColumn("group_number", dense_rank() over windowSpec)

您可以根据数据源使用spark获得数据帧。您可以参考您的源代码是否为Hive

是否要显示任何SparkSQL代码？这是到目前为止我得到的，但我不确定它是否正确：df=df.withColumn（“group_number”，densite_rank（）。over（Window.partitionBy（“person_id”，“trust_id”）。orderBy（“person_id”，“trust_id”））我希望它位于pyspark df中，我目前使用的是来自pyspark.sql包/库的。请编辑您的问题以包含格式化代码。而pyspark.sql是创建数据帧的唯一方法。为什么不打印出来看看它是否正确呢？此外，您还可以在Spark中键入原始sql。你不必使用任何函数谢谢！你对另外两个问题有什么建议吗？这就是我主要困惑的地方。