Python 根据其他列中满足的条件添加列_Python_Pyspark

Python 根据其他列中满足的条件添加列

python pyspark

Python 根据其他列中满足的条件添加列,python,pyspark,Python,Pyspark,我是PySpark的新手，我现在面临以下问题的挑战。我有一个火花df如下 DeviceID max(A) max(B) max(INUT) 0023002 2.5 3.7 8.1 0023045 2.2 1.3 11.3 0023008 4.7 2.3 1.9 如何添加另一列作为“Status”，其中的值将基于以下逻辑 if 0.20 * max(INUT) > m

我是PySpark的新手，我现在面临以下问题的挑战。我有一个火花

df

如下

DeviceID     max(A)    max(B)    max(INUT)
0023002      2.5       3.7       8.1
0023045      2.2       1.3       11.3
0023008      4.7       2.3       1.9

如何添加另一列作为“Status”，其中的值将基于以下逻辑

if 0.20 * max(INUT) > max(max(A),max(B)) then Status = 'Imbalance' else 'Balance'

上述逻辑预计将产生以下数据帧

DeviceID     max(A)    max(B)    max(INUT)    Status
0023002      2.5       3.7       8.1          'Balance'
0023045      2.2       1.3      11.3          'ImBalance'
0023008      4.7       2.3       1.9          'Balance'

现在，为了实现上述

df

，下面是我正在使用的代码

from pyspark.sql.function import col
import pyspark.sql.function as F
df_final = df.withColumn(
             'Status',
             F.when(col('max(INUT)')*0.20 > F.greatest(col('max(A)'),col('max(B)'),
             'Imbalance')\
         .otherwise('Balance')

上面的代码段引发了一个错误，如下所示：

AttributeError: 'tuple' object has no attribute 'otherwise'

我错过了什么？如有任何提示，将不胜感激

有一些小语法错误，这是您的最终代码：

import pyspark.sql.functions as F

df = spark.createDataFrame(
[("0023002", 2.5, 3.7, 8.1),
("0023045", 2.2, 1.3, 11.3),
("0023008", 4.7, 2.3, 1.9)], ["DeviceID", "max_A", "max_B", "max_INUT"])

df_final = df.withColumn('Status', \
             F.when(F.col('max_INUT')*0.20 > F.greatest(F.col('max_A'),F.col('max_B')), 'Imbalance') \
         .otherwise('Balance'))

以及一些评论/意见：

要使用

pyspark.sql.functions

中的函数，只需使用F alias。您不需要导入它两次

缺少了一些括号

我还替换了

max（A）->max\u A

，因为我相信它更容易阅读

输出：

+--------+-----+-----+--------+---------+
|DeviceID|max_A|max_B|max_INUT|   Status|
+--------+-----+-----+--------+---------+
| 0023002|  2.5|  3.7|     8.1|  Balance|
| 0023045|  2.2|  1.3|    11.3|Imbalance|
| 0023008|  4.7|  2.3|     1.9|  Balance|
+--------+-----+-----+--------+---------+

失踪）在结尾？哦，这是打字错误。即使我用

）

结束，我还是会收到这个错误谢谢。关于第3点：是否有必要将列的名称从

max（A）

更改为

max_A

？我这样问是因为，该列的名称是通过迭代过程从另一个数据帧派生的。不，这让我很困惑，因为它看起来像一个func调用：），尽管您总是可以在数据帧上使用

withColumnRename

或在列类型上使用

alias

重命名列