Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何创建创建新列并修改现有列的自定义项_Python_Apache Spark_Pyspark_User Defined Functions - Fatal编程技术网

Python 如何创建创建新列并修改现有列的自定义项

Python 如何创建创建新列并修改现有列的自定义项,python,apache-spark,pyspark,user-defined-functions,Python,Apache Spark,Pyspark,User Defined Functions,我有这样一个数据帧: id | color ---| ----- 1 | red-dark 2 | green-light 3 | red-light 4 | blue-sky 5 | green-dark 我想创建一个UDF,使我的数据帧成为: id | color | shade ---| ----- | ----- 1 | red | dark 2 | green | light 3 | red | light 4 | blue | sky 5 |

我有这样一个数据帧:

id | color
---| -----
1  | red-dark
2  | green-light
3  | red-light
4  | blue-sky
5  | green-dark
我想创建一个UDF,使我的数据帧成为:

id | color | shade
---| ----- | -----
1  | red   |  dark
2  | green |  light
3  | red   |  light
4  | blue  |  sky
5  | green |  dark
我已经为此编写了一个UDF:

def my_function(data_str):
    return ",".join(data_str.split("-"))

my_function_udf = udf(my_function, StringType())

#apply the UDF

df = df.withColumn("shade", my_function_udf(df['color']))
然而,这并没有像我所希望的那样转换数据帧。相反,它将其转化为:

id | color      | shade
---| ---------- | -----
1  | red-dark   |  red,dark
2  | green-dark |  green,light
3  | red-light  |  red,light
4  | blue-sky   |  blue,sky
5  | green-dark |  green,dark
如何在pyspark中转换数据帧

根据建议的问题进行尝试

schema = ArrayType(StructType([
    StructField("color", StringType(), False),
    StructField("shade", StringType(), False)
]))

color_shade_udf = udf(
    lambda s: [tuple(s.split("-"))],
    schema
)

df = df.withColumn("colorshade", color_shade_udf(df['color']))

#Gives the following

id | color      | colorshade
---| ---------- | -----
1  | red-dark   |  [{"color":"red","shade":"dark"}]
2  | green-dark |  [{"color":"green","shade":"dark"}]
3  | red-light  |  [{"color":"red","shade":"light"}]
4  | blue-sky   |  [{"color":"blue","shade":"sky"}]
5  | green-dark |  [{"color":"green","shade":"dark"}]

我感觉自己越来越近了

您可以使用内置函数
split()


@spark health learn现在只需使用Column(“color”、“colorshade.color”)“+对类似的+dropColumn(“colorshade”)进行着色即可`
from pyspark.sql.functions import split, col

df.withColumn("arr", split(df.color, "\\-")) \
  .select("id", 
          col("arr")[0].alias("color"),
          col("arr")[1].alias("shade")) \
  .drop("arr") \
  .show()
+---+-----+-----+
| id|color|shade|
+---+-----+-----+
|  1|  red| dark|
|  2|green|light|
|  3|  red|light|
|  4| blue|  sky|
|  5|green| dark|
+---+-----+-----+