根据其他2列中的值向数据帧添加新列(需要Pyspark)
我想根据“nb_pred_x”和“svm_pred_x”中的两个值添加一个名为“joint_pred_x”(x=0,1,2)的列,如果nb=1,svm=1,则添加0;如果nb=1,svm=0,则添加1;如果nb=0,svm=1,则添加2;如果nb=0,svm=0,则添加3。根据其他2列中的值向数据帧添加新列(需要Pyspark),pyspark,Pyspark,我想根据“nb_pred_x”和“svm_pred_x”中的两个值添加一个名为“joint_pred_x”(x=0,1,2)的列,如果nb=1,svm=1,则添加0;如果nb=1,svm=0,则添加1;如果nb=0,svm=1,则添加2;如果nb=0,svm=0,则添加3。 我认为withcolumn可以完成这项工作,但我对条件逻辑感到困惑。该解决方案只需使用pyspark,提前感谢您可以使用案例语句 +---------+---------+---------+----------+-----
我认为withcolumn可以完成这项工作,但我对条件逻辑感到困惑。该解决方案只需使用pyspark,提前感谢您可以使用
案例
语句
+---------+---------+---------+----------+----------+----------+
|nb_pred_0|nb_pred_1|nb_pred_2|svm_pred_0|svm_pred_1|svm_pred_2|
+---------+---------+---------+----------+----------+----------+
|0.0 |1.0 |0.0 |0.0 |1.0 |0.0 |
+---------+---------+---------+----------+----------+----------+
from pyspark.sql.functions import expr
for i in range(0, 3):
index = str(i)
df = df.withColumn('joint_pred_' + index, expr(f'''
CASE
WHEN {p1}_pred_{index} == 1 and {p2}_pred_{index} == 1 THEN 0
WHEN {p1}_pred_{index} == 1 and {p2}_pred_{index} == 0 THEN 1
WHEN {p1}_pred_{index} == 0 and {p2}_pred_{index} == 1 THEN 2
WHEN {p1}_pred_{index} == 0 and {p2}_pred_{index} == 0 THEN 3
END
'''))
df.show(10, False)
+---------+---------+---------+----------+----------+----------+------------+------------+------------+
|nb_pred_0|nb_pred_1|nb_pred_2|svm_pred_0|svm_pred_1|svm_pred_2|joint_pred_0|joint_pred_1|joint_pred_2|
+---------+---------+---------+----------+----------+----------+------------+------------+------------+
|0.0 |1.0 |0.0 |0.0 |1.0 |0.0 |3 |0 |3 |
+---------+---------+---------+----------+----------+----------+------------+------------+------------+
您可以使用
case
语句
+---------+---------+---------+----------+----------+----------+
|nb_pred_0|nb_pred_1|nb_pred_2|svm_pred_0|svm_pred_1|svm_pred_2|
+---------+---------+---------+----------+----------+----------+
|0.0 |1.0 |0.0 |0.0 |1.0 |0.0 |
+---------+---------+---------+----------+----------+----------+
from pyspark.sql.functions import expr
for i in range(0, 3):
index = str(i)
df = df.withColumn('joint_pred_' + index, expr(f'''
CASE
WHEN {p1}_pred_{index} == 1 and {p2}_pred_{index} == 1 THEN 0
WHEN {p1}_pred_{index} == 1 and {p2}_pred_{index} == 0 THEN 1
WHEN {p1}_pred_{index} == 0 and {p2}_pred_{index} == 1 THEN 2
WHEN {p1}_pred_{index} == 0 and {p2}_pred_{index} == 0 THEN 3
END
'''))
df.show(10, False)
+---------+---------+---------+----------+----------+----------+------------+------------+------------+
|nb_pred_0|nb_pred_1|nb_pred_2|svm_pred_0|svm_pred_1|svm_pred_2|joint_pred_0|joint_pred_1|joint_pred_2|
+---------+---------+---------+----------+----------+----------+------------+------------+------------+
|0.0 |1.0 |0.0 |0.0 |1.0 |0.0 |3 |0 |3 |
+---------+---------+---------+----------+----------+----------+------------+------------+------------+
非常感谢,它完美地回答了我的问题。非常感谢,它完美地回答了我的问题