用于生成新的基于列的匹配字段值的SQL逻辑

用于生成新的基于列的匹配字段值的SQL逻辑,sql,apache-spark-sql,hiveql,Sql,Apache Spark Sql,Hiveql,如果同一客户有多个税务类型,我应该将其标记为“n”;如果同一客户有相同的税务类型,我应该在有效列中将其标记为“y”。是否有人可以建议此场景的优化spark sql或标准sql(以便我可以转换为spark sql查询)逻辑查询使用case和窗口函数: Output: CUST TAX_TYPE VALID a TIN n a TIN n a SSN n b TIN y b TIN y b TIN

如果同一客户有多个税务类型,我应该将其标记为“n”;如果同一客户有相同的税务类型,我应该在有效列中将其标记为“y”。是否有人可以建议此场景的优化spark sql或标准sql(以便我可以转换为spark sql查询)逻辑查询使用
case
和窗口函数:

Output:
CUST TAX_TYPE VALID 
a      TIN     n
a      TIN     n
a      SSN     n
b      TIN     y
b      TIN     y 
b      TIN     y
c      SSN     n
c      SSN     n
c      null    n

第二个条件是验证是否没有
NULL
值。

尝试使用
,否则
以获得所需的输出

select t.*,
       (case when min(TAX_TYPE) over (partition by cust) = max(tax_type) over (partition by cust) and
                  count(*) over (partition by cust) = count(tax_type) over (partition by cust)
             then 'y' else 'n'
        end) as valid
from t;

它将避免每行的所有最小/最大计算。

感谢您提供的逻辑,它消除了我们现在用于解决此问题的联接
select t.*,
       (case when min(TAX_TYPE) over (partition by cust) = max(tax_type) over (partition by cust) and
                  count(*) over (partition by cust) = count(tax_type) over (partition by cust)
             then 'y' else 'n'
        end) as valid
from t;
scala> import org.apache.spark.sql.expressions.Window

scala> var df =Seq(("a", "TIN" ), ("a", "TIN" ), ("a", "SSN" ), ("b", "TIN" ), ("b", "TIN" ), ("b", "TIN" ), ("c", "SSN" ), ("c", "SSN" ), ("c","null")).toDF("cust","tax_type")

scala> df.withColumn("valid",when(size(collect_set(col("tax_type")).over(Window.partitionBy(col("cust")).orderBy(col("cust"))))>1,"N").otherwise("Y")).orderBy("cust").show()
+----+--------+-----+
|cust|tax_type|valid|
+----+--------+-----+
|   a|     TIN|    N|
|   a|     SSN|    N|
|   a|     TIN|    N|
|   b|     TIN|    Y|
|   b|     TIN|    Y|
|   b|     TIN|    Y|
|   c|     SSN|    N|
|   c|     SSN|    N|
|   c|    null|    N|
+----+--------+-----+