pyspark:将列转换为小写后出现withcolumn分析错误_Pyspark_Apache Spark Sql

pyspark:将列转换为小写后出现withcolumn分析错误

pyspark

pyspark:将列转换为小写后出现withcolumn分析错误,pyspark,apache-spark-sql,Pyspark,Apache Spark Sql,我有一个如下所示的数据框 +------------+------+ | food|pounds| +------------+------+ | bacon| 4.0| |STRAWBERRIES| 3.5| | Bacon| 7.0| |STRAWBERRIES| 3.0| | BACON| 6.0| |strawberries| 9.0| |Strawberries| 1.0| | pecans|

我有一个如下所示的数据框

+------------+------+
|        food|pounds|
+------------+------+
|       bacon|   4.0|
|STRAWBERRIES|   3.5|
|       Bacon|   7.0|
|STRAWBERRIES|   3.0|
|       BACON|   6.0|
|strawberries|   9.0|
|Strawberries|   1.0|
|      pecans|   3.0|
+------------+------+

预期产量为

+------------+------+---------+
|        food|pounds|food_type|
+------------+------+---------+
|       bacon|   4.0|     meat|
|STRAWBERRIES|   3.5|    fruit|
|       Bacon|   7.0|     meat|
|STRAWBERRIES|   3.0|    fruit|
|       BACON|   6.0|     meat|
|strawberries|   9.0|    fruit|
|Strawberries|   1.0|    fruit|
|      pecans|   3.0|    other|
+------------+------+---------+

因此，我根据我的逻辑定义了一个新的_列，并将其应用于.withcolumn

new_column = when((col('food') == 'bacon') | (col('food') == 'BACON') | (col('food') == 'Bacon'), 'meat'
                   ).when((col('food') == 'STRAWBERRIES') | (col('food') == 'strawberries') | (col('food') == 'Strawberries'), 'fruit'
                   ).otherwise('other')

然后

df.withColumn("food_type", new_column).show()

这很好用。但是我想用更少的代码更新

new_列

语句，所以重写如下

new_column = when(lower(col('food') == 'bacon') , 'meat'
                   ).when(lower(col('food') == 'strawberries'), 'fruit'
                   ).otherwise('other')

现在，当我做

df.withColumn（“食物类型”，新的列）.show（）

我犯了一个错误

AnalysisException: "cannot resolve 'CASE WHEN lower(CAST((`food` = 'bacon') AS STRING)) THEN 'meat' WHEN lower(CAST((`food` = 'strawberries') AS STRING)) THEN 'fruit' ELSE 'other' END' due to data type mismatch: WHEN expressions in CaseWhen should all be boolean type, but the 1th when expression's type is lower(cast((food#165 = bacon) as string));;\n'Project [food#165, pounds#166, CASE WHEN lower(cast((food#165 = bacon) as string)) THEN meat WHEN lower(cast((food#165 = strawberries) as string)) THEN fruit ELSE other END AS food_type#197]\n+- Relation[food#165,pounds#166] csv\n"

我遗漏了什么？

您的括号不匹配

new_column=when（较低（col（'food'））=='bacon'，'meat'）。when（较低（col（'food'））=='strawberry'，'fruit'）。否则（'other'）

您的括号不匹配

new_column=when（lower（col（'food'））=='bacon'，'meat'）。when（lower（col（'food'））=='草莓'，'fruit'）。否则（'other'）

简化：

新列=何时（较低（列（“食物”）=“培根”、“肉”）。何时（较低（列（“食物”）=“草莓”、“水果”）。否则（“其他”）

df.withColumn（“食品类型”，新列）.show（）

简化：

新列=何时（较低（列（“食物”）=“培根”、“肉”）。何时（较低（列（“食物”）=“草莓”、“水果”）。否则（“其他”）

df.withColumn（“food_type”，new_column）.show（）

我想分享另一种方法，它更类似于SQL查询，也更适合于更复杂和嵌套的条件

from pyspark.sql.functions import *
cond = """case when lower(food) in ('bacon') then 'meat'
            else case when lower(food) in ('strawberries') then 'fruit'
                 else 'other'
                end
            end"""

newdf = df.withColumn("food_type", expr(cond))

希望能有帮助

问候,

Neeraj

我想分享另一种方法，它更类似于SQL查询，也更适合于更复杂和嵌套的条件

from pyspark.sql.functions import *
cond = """case when lower(food) in ('bacon') then 'meat'
            else case when lower(food) in ('strawberries') then 'fruit'
                 else 'other'
                end
            end"""

newdf = df.withColumn("food_type", expr(cond))

希望能有帮助

问候,

尼拉吉