如何基于同一列的条件更改PySpark数据帧中的值?

如何基于同一列的条件更改PySpark数据帧中的值?,pyspark,apache-spark-sql,pyspark-sql,Pyspark,Apache Spark Sql,Pyspark Sql,考虑一个示例数据帧: df = +-------+-----+ | tech|state| +-------+-----+ | 70|wa | | 50|mn | | 20|fl | | 50|mo | | 10|ar | | 90|wi | | 30|al | | 50|ca | +-------+-----+ 我想更改“tech”列,使任何50的值都更改为1,而所有其他值都等于0 输出如下

考虑一个示例数据帧:

df = 
+-------+-----+
|   tech|state|
+-------+-----+
|     70|wa   |
|     50|mn   |
|     20|fl   |
|     50|mo   |
|     10|ar   |
|     90|wi   |
|     30|al   |
|     50|ca   |
+-------+-----+
我想更改“tech”列,使任何50的值都更改为1,而所有其他值都等于0

输出如下所示:

df = 
+-------+-----+
|   tech|state|
+-------+-----+
|     0 |wa   |
|     1 |mn   |
|     0 |fl   |
|     1 |mo   |
|     0 |ar   |
|     0 |wi   |
|     0 |al   |
|     1 |ca   |
+-------+-----+
以下是我目前掌握的情况:

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType


changing_column = 'tech'
udf_first = UserDefinedFunction(lambda x: 1, IntegerType())
udf_second = UserDefinedFunction(lambda x: 0, IntegerType())
first_df = zero_df.select(*[udf_first(changing_column) if column == 50 else column for column in zero_df])
second_df = first_df.select(*[udf_second(changing_column) if column != 50 else column for column in first_df])
second_df.show()
希望这有帮助

from pyspark.sql.functions import when

df = spark\
.createDataFrame([\
    (70, 'wa'),\
    (50, 'mn'),\
    (20, 'fl')],\
    ["tech", "state"])

df\
.select("*", when(df.tech == 50, 1)\
        .otherwise(0)\
        .alias("tech"))\
.show()

+----+-----+----+
|tech|state|tech|
+----+-----+----+
|  70|   wa|   0|
|  50|   mn|   1|
|  20|   fl|   0|
+----+-----+----+

不需要自定义项,使用spark SQL函数时/否则。请看,这里的帮助非常有效。非常感谢。