根据其他2列中的值向数据帧添加新列(需要Pyspark)

根据其他2列中的值向数据帧添加新列(需要Pyspark),pyspark,Pyspark,我想根据“nb_pred_x”和“svm_pred_x”中的两个值添加一个名为“joint_pred_x”(x=0,1,2)的列,如果nb=1,svm=1,则添加0;如果nb=1,svm=0,则添加1;如果nb=0,svm=1,则添加2;如果nb=0,svm=0,则添加3。 我认为withcolumn可以完成这项工作,但我对条件逻辑感到困惑。该解决方案只需使用pyspark,提前感谢您可以使用案例语句 +---------+---------+---------+----------+-----

我想根据“nb_pred_x”和“svm_pred_x”中的两个值添加一个名为“joint_pred_x”(x=0,1,2)的列,如果nb=1,svm=1,则添加0;如果nb=1,svm=0,则添加1;如果nb=0,svm=1,则添加2;如果nb=0,svm=0,则添加3。
我认为withcolumn可以完成这项工作,但我对条件逻辑感到困惑。该解决方案只需使用pyspark,提前感谢

您可以使用
案例
语句

+---------+---------+---------+----------+----------+----------+
|nb_pred_0|nb_pred_1|nb_pred_2|svm_pred_0|svm_pred_1|svm_pred_2|
+---------+---------+---------+----------+----------+----------+
|0.0      |1.0      |0.0      |0.0       |1.0       |0.0       |
+---------+---------+---------+----------+----------+----------+


from pyspark.sql.functions import expr

for i in range(0, 3):
    
    index = str(i)
    
    df = df.withColumn('joint_pred_' + index, expr(f'''
            CASE 
                WHEN {p1}_pred_{index} == 1 and {p2}_pred_{index} == 1 THEN 0
                WHEN {p1}_pred_{index} == 1 and {p2}_pred_{index} == 0 THEN 1
                WHEN {p1}_pred_{index} == 0 and {p2}_pred_{index} == 1 THEN 2
                WHEN {p1}_pred_{index} == 0 and {p2}_pred_{index} == 0 THEN 3
            END
        '''))

df.show(10, False)

+---------+---------+---------+----------+----------+----------+------------+------------+------------+
|nb_pred_0|nb_pred_1|nb_pred_2|svm_pred_0|svm_pred_1|svm_pred_2|joint_pred_0|joint_pred_1|joint_pred_2|
+---------+---------+---------+----------+----------+----------+------------+------------+------------+
|0.0      |1.0      |0.0      |0.0       |1.0       |0.0       |3           |0           |3           |
+---------+---------+---------+----------+----------+----------+------------+------------+------------+

您可以使用
case
语句

+---------+---------+---------+----------+----------+----------+
|nb_pred_0|nb_pred_1|nb_pred_2|svm_pred_0|svm_pred_1|svm_pred_2|
+---------+---------+---------+----------+----------+----------+
|0.0      |1.0      |0.0      |0.0       |1.0       |0.0       |
+---------+---------+---------+----------+----------+----------+


from pyspark.sql.functions import expr

for i in range(0, 3):
    
    index = str(i)
    
    df = df.withColumn('joint_pred_' + index, expr(f'''
            CASE 
                WHEN {p1}_pred_{index} == 1 and {p2}_pred_{index} == 1 THEN 0
                WHEN {p1}_pred_{index} == 1 and {p2}_pred_{index} == 0 THEN 1
                WHEN {p1}_pred_{index} == 0 and {p2}_pred_{index} == 1 THEN 2
                WHEN {p1}_pred_{index} == 0 and {p2}_pred_{index} == 0 THEN 3
            END
        '''))

df.show(10, False)

+---------+---------+---------+----------+----------+----------+------------+------------+------------+
|nb_pred_0|nb_pred_1|nb_pred_2|svm_pred_0|svm_pred_1|svm_pred_2|joint_pred_0|joint_pred_1|joint_pred_2|
+---------+---------+---------+----------+----------+----------+------------+------------+------------+
|0.0      |1.0      |0.0      |0.0       |1.0       |0.0       |3           |0           |3           |
+---------+---------+---------+----------+----------+----------+------------+------------+------------+

非常感谢,它完美地回答了我的问题。非常感谢,它完美地回答了我的问题