用于生成新的基于列的匹配字段值的SQL逻辑
如果同一客户有多个税务类型,我应该将其标记为“n”;如果同一客户有相同的税务类型,我应该在有效列中将其标记为“y”。是否有人可以建议此场景的优化spark sql或标准sql(以便我可以转换为spark sql查询)逻辑查询使用用于生成新的基于列的匹配字段值的SQL逻辑,sql,apache-spark-sql,hiveql,Sql,Apache Spark Sql,Hiveql,如果同一客户有多个税务类型,我应该将其标记为“n”;如果同一客户有相同的税务类型,我应该在有效列中将其标记为“y”。是否有人可以建议此场景的优化spark sql或标准sql(以便我可以转换为spark sql查询)逻辑查询使用case和窗口函数: Output: CUST TAX_TYPE VALID a TIN n a TIN n a SSN n b TIN y b TIN y b TIN
case
和窗口函数:
Output:
CUST TAX_TYPE VALID
a TIN n
a TIN n
a SSN n
b TIN y
b TIN y
b TIN y
c SSN n
c SSN n
c null n
第二个条件是验证是否没有
NULL
值。尝试使用,否则
以获得所需的输出
select t.*,
(case when min(TAX_TYPE) over (partition by cust) = max(tax_type) over (partition by cust) and
count(*) over (partition by cust) = count(tax_type) over (partition by cust)
then 'y' else 'n'
end) as valid
from t;
它将避免每行的所有最小/最大计算。感谢您提供的逻辑,它消除了我们现在用于解决此问题的联接
select t.*,
(case when min(TAX_TYPE) over (partition by cust) = max(tax_type) over (partition by cust) and
count(*) over (partition by cust) = count(tax_type) over (partition by cust)
then 'y' else 'n'
end) as valid
from t;
scala> import org.apache.spark.sql.expressions.Window
scala> var df =Seq(("a", "TIN" ), ("a", "TIN" ), ("a", "SSN" ), ("b", "TIN" ), ("b", "TIN" ), ("b", "TIN" ), ("c", "SSN" ), ("c", "SSN" ), ("c","null")).toDF("cust","tax_type")
scala> df.withColumn("valid",when(size(collect_set(col("tax_type")).over(Window.partitionBy(col("cust")).orderBy(col("cust"))))>1,"N").otherwise("Y")).orderBy("cust").show()
+----+--------+-----+
|cust|tax_type|valid|
+----+--------+-----+
| a| TIN| N|
| a| SSN| N|
| a| TIN| N|
| b| TIN| Y|
| b| TIN| Y|
| b| TIN| Y|
| c| SSN| N|
| c| SSN| N|
| c| null| N|
+----+--------+-----+