如何基于同一列的条件更改PySpark数据帧中的值?
考虑一个示例数据帧:如何基于同一列的条件更改PySpark数据帧中的值?,pyspark,apache-spark-sql,pyspark-sql,Pyspark,Apache Spark Sql,Pyspark Sql,考虑一个示例数据帧: df = +-------+-----+ | tech|state| +-------+-----+ | 70|wa | | 50|mn | | 20|fl | | 50|mo | | 10|ar | | 90|wi | | 30|al | | 50|ca | +-------+-----+ 我想更改“tech”列,使任何50的值都更改为1,而所有其他值都等于0 输出如下
df =
+-------+-----+
| tech|state|
+-------+-----+
| 70|wa |
| 50|mn |
| 20|fl |
| 50|mo |
| 10|ar |
| 90|wi |
| 30|al |
| 50|ca |
+-------+-----+
我想更改“tech”列,使任何50的值都更改为1,而所有其他值都等于0
输出如下所示:
df =
+-------+-----+
| tech|state|
+-------+-----+
| 0 |wa |
| 1 |mn |
| 0 |fl |
| 1 |mo |
| 0 |ar |
| 0 |wi |
| 0 |al |
| 1 |ca |
+-------+-----+
以下是我目前掌握的情况:
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
changing_column = 'tech'
udf_first = UserDefinedFunction(lambda x: 1, IntegerType())
udf_second = UserDefinedFunction(lambda x: 0, IntegerType())
first_df = zero_df.select(*[udf_first(changing_column) if column == 50 else column for column in zero_df])
second_df = first_df.select(*[udf_second(changing_column) if column != 50 else column for column in first_df])
second_df.show()
希望这有帮助
from pyspark.sql.functions import when
df = spark\
.createDataFrame([\
(70, 'wa'),\
(50, 'mn'),\
(20, 'fl')],\
["tech", "state"])
df\
.select("*", when(df.tech == 50, 1)\
.otherwise(0)\
.alias("tech"))\
.show()
+----+-----+----+
|tech|state|tech|
+----+-----+----+
| 70| wa| 0|
| 50| mn| 1|
| 20| fl| 0|
+----+-----+----+
不需要自定义项,使用spark SQL函数时/否则。请看,这里的帮助非常有效。非常感谢。