Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/sql/79.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Pyspark-SQL:使用case-when语句_Sql_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql - Fatal编程技术网

Pyspark-SQL:使用case-when语句

Pyspark-SQL:使用case-when语句,sql,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我有一个像这样的数据框 >>> df_w_cluster.select('high_income', 'aml_cluster_id').show(10) +-----------+--------------+ |high_income|aml_cluster_id| +-----------+--------------+ | 0| 0| | 0| 0| | 0|

我有一个像这样的数据框

>>> df_w_cluster.select('high_income', 'aml_cluster_id').show(10)
+-----------+--------------+
|high_income|aml_cluster_id|
+-----------+--------------+
|          0|             0|
|          0|             0|
|          0|             1|
|          0|             1|
|          0|             0|
|          0|             0|
|          0|             1|
|          1|             1|
|          1|             0|
|          1|             0|
+-----------+--------------+
only showing top 10 rows
high\u income
列是一个二进制列,可保存
0
1
aml\u集群id
保存从
0
3
的值。我想创建一个新列,其值取决于该特定行中
高收入
反洗钱集群id
的值。我正在尝试使用SQL实现这一点

df_w_cluster.createTempView('event_rate_holder')
为了实现这一点,我编写了这样一个查询-

q = """select * , case 
 when "aml_cluster_id" = 0 and  "high_income" = 1 then "high_income_encoded" = 0.162 else 
 when "aml_cluster_id" = 0 and  "high_income" = 0 then "high_income_encoded" = 0.337 else 
 when "aml_cluster_id" = 1 and  "high_income" = 1 then "high_income_encoded" = 0.049 else 
 when "aml_cluster_id" = 1 and  "high_income" = 0 then "high_income_encoded" = 0.402 else 
 when "aml_cluster_id" = 2 and  "high_income" = 1 then "high_income_encoded" = 0.005 else 
 when "aml_cluster_id" = 2 and  "high_income" = 0 then "high_income_encoded" = 0.0 else 
 when "aml_cluster_id" = 3 and  "high_income" = 1 then "high_income_encoded" = 0.023 else 
 when "aml_cluster_id" = 3 and  "high_income" = 0 then "high_income_encoded" = 0.022 else 
 from event_rate_holder"""
当我使用spark运行它时

spark.sql(q)
我得到以下错误

mismatched input 'aml_cluster_id' expecting <EOF>(line 1, pos 22)
但我还是会出错

== SQL ==
select * , case 
when aml_cluster_id = 0 and  high_income = 1 then high_income_encoded = 0.162 else 
-----^^^

pyspark.sql.utils.ParseException: "\nmismatched input 'aml_cluster_id' expecting <EOF>(line 2, pos 5)\n\n== SQL ==\nselect * ,
pyspark.sql.utils.ParseException:“\n应为不匹配的输入'aml\u cluster\u id'(第2行,位置5)\n\n==sql===\n选择*,

您使用的
案例
变体的正确语法为

CASE  
   WHEN e1 THEN e2 [ ...n ]   
   [ ELSE else_result_expression ]   
END  
所以

  • 然后应该跟在表达式后面。这里没有放置
    name=something
    的位置
  • ELSE
    允许在每个
    案例中使用一次,而不是在每次
    之后使用
  • 您的原始代码丢失结束
    END
  • 最后,不应引用列
你可能是说

案例
当反洗钱集群id=0且高收入=1时,则为0.162
当反洗钱集群id=0且高收入=0时,则为0.337
...
以高收入结束

您使用的
CASE的正确语法为

CASE  
   WHEN e1 THEN e2 [ ...n ]   
   [ ELSE else_result_expression ]   
END  
所以

  • 然后应该跟在表达式后面。这里没有放置
    name=something
    的位置
  • ELSE
    允许在每个
    案例中使用一次,而不是在每次
    之后使用
  • 您的原始代码丢失结束
    END
  • 最后,不应引用列
你可能是说

案例
当反洗钱集群id=0且高收入=1时,则为0.162
当反洗钱集群id=0且高收入=0时,则为0.337
...
以高收入结束
查询中的每个when条件都需要case end。并且需要在列名(
)后面打勾,
high_income_encoded`列名应在末尾加上别名。因此正确的查询如下所示

q = """select * ,
case when `aml_cluster_id` = 0 and  `high_income` = 1 then 0.162 else
  case when `aml_cluster_id` = 0 and  `high_income` = 0 then 0.337 else
    case when `aml_cluster_id` = 1 and  `high_income` = 1 then 0.049 else
      case when `aml_cluster_id` = 1 and  `high_income` = 0 then 0.402 else
        case when `aml_cluster_id` = 2 and  `high_income` = 1 then 0.005 else
          case when `aml_cluster_id` = 2 and  `high_income` = 0 then 0.0 else
            case when `aml_cluster_id` = 3 and  `high_income` = 1 then 0.023 else
              case when `aml_cluster_id` = 3 and  `high_income` = 0 then 0.022
              end
            end
          end
        end
      end
    end
  end
end as `high_income_encoded`
from event_rate_holder"""
查询中的每个when条件都需要case end。并且需要在列名(
)和
high\u income\u encoded`列名的末尾加上别名。因此正确的查询如下

q = """select * ,
case when `aml_cluster_id` = 0 and  `high_income` = 1 then 0.162 else
  case when `aml_cluster_id` = 0 and  `high_income` = 0 then 0.337 else
    case when `aml_cluster_id` = 1 and  `high_income` = 1 then 0.049 else
      case when `aml_cluster_id` = 1 and  `high_income` = 0 then 0.402 else
        case when `aml_cluster_id` = 2 and  `high_income` = 1 then 0.005 else
          case when `aml_cluster_id` = 2 and  `high_income` = 0 then 0.0 else
            case when `aml_cluster_id` = 3 and  `high_income` = 1 then 0.023 else
              case when `aml_cluster_id` = 3 and  `high_income` = 0 then 0.022
              end
            end
          end
        end
      end
    end
  end
end as `high_income_encoded`
from event_rate_holder"""

不相关的问题,您是如何导入数据的?查询字符串中是否有换行符?请在执行查询之前尝试执行
q=q.replace(“\n”,”)
。不相关的问题,您是如何导入数据的?查询字符串中是否有换行符?请尝试执行
q=q.replace(“\n”,”)
在执行查询之前。答案是否有帮助?我尝试了Ramesh ans给出的答案,效果非常好!!谢谢。答案是否有帮助?我尝试了Ramesh ans给出的答案,效果非常好!!谢谢。