pyspark:将列转换为小写后出现withcolumn分析错误
我有一个如下所示的数据框pyspark:将列转换为小写后出现withcolumn分析错误,pyspark,apache-spark-sql,Pyspark,Apache Spark Sql,我有一个如下所示的数据框 +------------+------+ | food|pounds| +------------+------+ | bacon| 4.0| |STRAWBERRIES| 3.5| | Bacon| 7.0| |STRAWBERRIES| 3.0| | BACON| 6.0| |strawberries| 9.0| |Strawberries| 1.0| | pecans|
+------------+------+
| food|pounds|
+------------+------+
| bacon| 4.0|
|STRAWBERRIES| 3.5|
| Bacon| 7.0|
|STRAWBERRIES| 3.0|
| BACON| 6.0|
|strawberries| 9.0|
|Strawberries| 1.0|
| pecans| 3.0|
+------------+------+
预期产量为
+------------+------+---------+
| food|pounds|food_type|
+------------+------+---------+
| bacon| 4.0| meat|
|STRAWBERRIES| 3.5| fruit|
| Bacon| 7.0| meat|
|STRAWBERRIES| 3.0| fruit|
| BACON| 6.0| meat|
|strawberries| 9.0| fruit|
|Strawberries| 1.0| fruit|
| pecans| 3.0| other|
+------------+------+---------+
因此,我根据我的逻辑定义了一个新的_列,并将其应用于.withcolumn
new_column = when((col('food') == 'bacon') | (col('food') == 'BACON') | (col('food') == 'Bacon'), 'meat'
).when((col('food') == 'STRAWBERRIES') | (col('food') == 'strawberries') | (col('food') == 'Strawberries'), 'fruit'
).otherwise('other')
然后
df.withColumn("food_type", new_column).show()
这很好用。但是我想用更少的代码更新new_列
语句,所以重写如下
new_column = when(lower(col('food') == 'bacon') , 'meat'
).when(lower(col('food') == 'strawberries'), 'fruit'
).otherwise('other')
现在,当我做df.withColumn(“食物类型”,新的列).show()
我犯了一个错误
AnalysisException: "cannot resolve 'CASE WHEN lower(CAST((`food` = 'bacon') AS STRING)) THEN 'meat' WHEN lower(CAST((`food` = 'strawberries') AS STRING)) THEN 'fruit' ELSE 'other' END' due to data type mismatch: WHEN expressions in CaseWhen should all be boolean type, but the 1th when expression's type is lower(cast((food#165 = bacon) as string));;\n'Project [food#165, pounds#166, CASE WHEN lower(cast((food#165 = bacon) as string)) THEN meat WHEN lower(cast((food#165 = strawberries) as string)) THEN fruit ELSE other END AS food_type#197]\n+- Relation[food#165,pounds#166] csv\n"
我遗漏了什么?您的括号不匹配
new_column=when(较低(col('food'))=='bacon','meat')。when(较低(col('food'))=='strawberry','fruit')。否则('other')
您的括号不匹配
new_column=when(lower(col('food'))=='bacon','meat')。when(lower(col('food'))=='草莓','fruit')。否则('other')
简化:
新列=何时(较低(列(“食物”)=“培根”、“肉”)。何时(较低(列(“食物”)=“草莓”、“水果”)。否则(“其他”)
df.withColumn(“食品类型”,新列).show()简化:
新列=何时(较低(列(“食物”)=“培根”、“肉”)。何时(较低(列(“食物”)=“草莓”、“水果”)。否则(“其他”)
df.withColumn(“food_type”,new_column).show()我想分享另一种方法,它更类似于SQL查询,也更适合于更复杂和嵌套的条件
from pyspark.sql.functions import *
cond = """case when lower(food) in ('bacon') then 'meat'
else case when lower(food) in ('strawberries') then 'fruit'
else 'other'
end
end"""
newdf = df.withColumn("food_type", expr(cond))
希望能有帮助
问候,
Neeraj我想分享另一种方法,它更类似于SQL查询,也更适合于更复杂和嵌套的条件
from pyspark.sql.functions import *
cond = """case when lower(food) in ('bacon') then 'meat'
else case when lower(food) in ('strawberries') then 'fruit'
else 'other'
end
end"""
newdf = df.withColumn("food_type", expr(cond))
希望能有帮助
问候,
尼拉吉