Python 如何创建创建新列并修改现有列的自定义项
我有这样一个数据帧:Python 如何创建创建新列并修改现有列的自定义项,python,apache-spark,pyspark,user-defined-functions,Python,Apache Spark,Pyspark,User Defined Functions,我有这样一个数据帧: id | color ---| ----- 1 | red-dark 2 | green-light 3 | red-light 4 | blue-sky 5 | green-dark 我想创建一个UDF,使我的数据帧成为: id | color | shade ---| ----- | ----- 1 | red | dark 2 | green | light 3 | red | light 4 | blue | sky 5 |
id | color
---| -----
1 | red-dark
2 | green-light
3 | red-light
4 | blue-sky
5 | green-dark
我想创建一个UDF,使我的数据帧成为:
id | color | shade
---| ----- | -----
1 | red | dark
2 | green | light
3 | red | light
4 | blue | sky
5 | green | dark
我已经为此编写了一个UDF:
def my_function(data_str):
return ",".join(data_str.split("-"))
my_function_udf = udf(my_function, StringType())
#apply the UDF
df = df.withColumn("shade", my_function_udf(df['color']))
然而,这并没有像我所希望的那样转换数据帧。相反,它将其转化为:
id | color | shade
---| ---------- | -----
1 | red-dark | red,dark
2 | green-dark | green,light
3 | red-light | red,light
4 | blue-sky | blue,sky
5 | green-dark | green,dark
如何在pyspark中转换数据帧
根据建议的问题进行尝试
schema = ArrayType(StructType([
StructField("color", StringType(), False),
StructField("shade", StringType(), False)
]))
color_shade_udf = udf(
lambda s: [tuple(s.split("-"))],
schema
)
df = df.withColumn("colorshade", color_shade_udf(df['color']))
#Gives the following
id | color | colorshade
---| ---------- | -----
1 | red-dark | [{"color":"red","shade":"dark"}]
2 | green-dark | [{"color":"green","shade":"dark"}]
3 | red-light | [{"color":"red","shade":"light"}]
4 | blue-sky | [{"color":"blue","shade":"sky"}]
5 | green-dark | [{"color":"green","shade":"dark"}]
我感觉自己越来越近了您可以使用内置函数
split()
:
@spark health learn现在只需使用Column(“color”、“colorshade.color”)“+对类似的+dropColumn(“colorshade”)进行着色即可`
from pyspark.sql.functions import split, col
df.withColumn("arr", split(df.color, "\\-")) \
.select("id",
col("arr")[0].alias("color"),
col("arr")[1].alias("shade")) \
.drop("arr") \
.show()
+---+-----+-----+
| id|color|shade|
+---+-----+-----+
| 1| red| dark|
| 2|green|light|
| 3| red|light|
| 4| blue| sky|
| 5|green| dark|
+---+-----+-----+