Pyspark-SQL：使用case-when语句_Sql_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql

Pyspark-SQL：使用case-when语句

sql apache-spark pyspark

Pyspark-SQL：使用case-when语句,sql,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我有一个像这样的数据框 >>> df_w_cluster.select('high_income', 'aml_cluster_id').show(10) +-----------+--------------+ |high_income|aml_cluster_id| +-----------+--------------+ | 0| 0| | 0| 0| | 0|

我有一个像这样的数据框

>>> df_w_cluster.select('high_income', 'aml_cluster_id').show(10)
+-----------+--------------+
|high_income|aml_cluster_id|
+-----------+--------------+
|          0|             0|
|          0|             0|
|          0|             1|
|          0|             1|
|          0|             0|
|          0|             0|
|          0|             1|
|          1|             1|
|          1|             0|
|          1|             0|
+-----------+--------------+
only showing top 10 rows

high\u income

列是一个二进制列，可保存

或

。

aml\u集群id

保存从

到

的值。我想创建一个新列，其值取决于该特定行中

高收入

和

反洗钱集群id

的值。我正在尝试使用SQL实现这一点

df_w_cluster.createTempView('event_rate_holder')

为了实现这一点，我编写了这样一个查询-

q = """select * , case 
 when "aml_cluster_id" = 0 and  "high_income" = 1 then "high_income_encoded" = 0.162 else 
 when "aml_cluster_id" = 0 and  "high_income" = 0 then "high_income_encoded" = 0.337 else 
 when "aml_cluster_id" = 1 and  "high_income" = 1 then "high_income_encoded" = 0.049 else 
 when "aml_cluster_id" = 1 and  "high_income" = 0 then "high_income_encoded" = 0.402 else 
 when "aml_cluster_id" = 2 and  "high_income" = 1 then "high_income_encoded" = 0.005 else 
 when "aml_cluster_id" = 2 and  "high_income" = 0 then "high_income_encoded" = 0.0 else 
 when "aml_cluster_id" = 3 and  "high_income" = 1 then "high_income_encoded" = 0.023 else 
 when "aml_cluster_id" = 3 and  "high_income" = 0 then "high_income_encoded" = 0.022 else 
 from event_rate_holder"""

当我使用spark运行它时

spark.sql(q)

我得到以下错误

mismatched input 'aml_cluster_id' expecting <EOF>(line 1, pos 22)

但我还是会出错

== SQL ==
select * , case 
when aml_cluster_id = 0 and  high_income = 1 then high_income_encoded = 0.162 else 
-----^^^

接

pyspark.sql.utils.ParseException: "\nmismatched input 'aml_cluster_id' expecting <EOF>(line 2, pos 5)\n\n== SQL ==\nselect * ,

pyspark.sql.utils.ParseException:“\n应为不匹配的输入'aml\u cluster\u id'（第2行，位置5）\n\n==sql===\n选择*，

您使用的

案例

变体的正确语法为

CASE  
   WHEN e1 THEN e2 [ ...n ]   
   [ ELSE else_result_expression ]   
END

所以

然后应该跟在表达式后面。这里没有放置
```
name=something
```
的位置

ELSE

允许在每个

案例中使用一次，而不是在每次之后使用


您的原始代码丢失结束END
最后，不应引用列


你可能是说
案例
当反洗钱集群id=0且高收入=1时，则为0.162
当反洗钱集群id=0且高收入=0时，则为0.337
...
以高收入结束
您使用的CASE的正确语法为
CASE  
   WHEN e1 THEN e2 [ ...n ]   
   [ ELSE else_result_expression ]   
END  

所以

然后应该跟在表达式后面。这里没有放置name=something
的位置
ELSE
允许在每个案例中使用一次，而不是在每次之后使用

您的原始代码丢失结束END
最后，不应引用列

你可能是说
案例
当反洗钱集群id=0且高收入=1时，则为0.162
当反洗钱集群id=0且高收入=0时，则为0.337
...
以高收入结束
查询中的每个when条件都需要case end。并且需要在列名（）后面打勾，
high_income_encoded`列名应在末尾加上别名。因此正确的查询如下所示
q = """select * ,
case when `aml_cluster_id` = 0 and  `high_income` = 1 then 0.162 else
  case when `aml_cluster_id` = 0 and  `high_income` = 0 then 0.337 else
    case when `aml_cluster_id` = 1 and  `high_income` = 1 then 0.049 else
      case when `aml_cluster_id` = 1 and  `high_income` = 0 then 0.402 else
        case when `aml_cluster_id` = 2 and  `high_income` = 1 then 0.005 else
          case when `aml_cluster_id` = 2 and  `high_income` = 0 then 0.0 else
            case when `aml_cluster_id` = 3 and  `high_income` = 1 then 0.023 else
              case when `aml_cluster_id` = 3 and  `high_income` = 0 then 0.022
              end
            end
          end
        end
      end
    end
  end
end as `high_income_encoded`
from event_rate_holder"""

查询中的每个when条件都需要case end。并且需要在列名（）和
high\u income\u encoded`列名的末尾加上别名。因此正确的查询如下
q = """select * ,
case when `aml_cluster_id` = 0 and  `high_income` = 1 then 0.162 else
  case when `aml_cluster_id` = 0 and  `high_income` = 0 then 0.337 else
    case when `aml_cluster_id` = 1 and  `high_income` = 1 then 0.049 else
      case when `aml_cluster_id` = 1 and  `high_income` = 0 then 0.402 else
        case when `aml_cluster_id` = 2 and  `high_income` = 1 then 0.005 else
          case when `aml_cluster_id` = 2 and  `high_income` = 0 then 0.0 else
            case when `aml_cluster_id` = 3 and  `high_income` = 1 then 0.023 else
              case when `aml_cluster_id` = 3 and  `high_income` = 0 then 0.022
              end
            end
          end
        end
      end
    end
  end
end as `high_income_encoded`
from event_rate_holder"""

不相关的问题，您是如何导入数据的？查询字符串中是否有换行符？请在执行查询之前尝试执行q=q.replace（“\n”，”）
。不相关的问题，您是如何导入数据的？查询字符串中是否有换行符？请尝试执行q=q.replace（“\n”，”）
在执行查询之前。答案是否有帮助？我尝试了Ramesh ans给出的答案，效果非常好！！谢谢。答案是否有帮助？我尝试了Ramesh ans给出的答案，效果非常好！！谢谢。