解析pyspark中数组的每个元素并应用子字符串_Pyspark_User Defined Functions

解析pyspark中数组的每个元素并应用子字符串

pyspark

解析pyspark中数组的每个元素并应用子字符串,pyspark,user-defined-functions,Pyspark,User Defined Functions,嗨，我有一个pyspark数据帧，其数组列如下所示我希望遍历每个元素，只获取连字符之前的字符串，并创建另一列 +------------------------------+ |array_col | +------------------------------+ |[hello-123, abc-111] | |[hello-234, def-22, xyz-33] | |[hiiii-111, def2-333, lmn-22

嗨，我有一个pyspark数据帧，其数组列如下所示

我希望遍历每个元素，只获取连字符之前的字符串，并创建另一列

+------------------------------+
|array_col                     |
+------------------------------+
|[hello-123, abc-111]          |
|[hello-234, def-22, xyz-33]   |
|[hiiii-111, def2-333, lmn-222]|
+------------------------------+

期望输出

+------------------------------+--------------------+
|col1                          |new_column          |
+------------------------------+--------------------+
|[hello-123, abc-111]          |[hello, abc]        |
|[hello-234, def-22, xyz-33]   |[hello, def, xyz]   |
|[hiiii-111, def2-333, lmn-222]|[hiiii, def2, lmn]  |
+------------------------------+--------------------+

我正在尝试下面的方法，但我无法在udf中应用正则表达式/子字符串

cust_udf = udf(lambda arr: [x for x in arr],ArrayType(StringType()))
df1.withColumn('new_column', cust_udf(col("col1")))

有人能帮忙吗。感谢来自
Spark-2.4
的使用转换
高阶函数

示例：

df.show(10,False)
#+---------------------------+
#|array_col                  |
#+---------------------------+
#|[hello-123, abc-111]       |
#|[hello-234, def-22, xyz-33]|
#+---------------------------+

df.printSchema()
#root
# |-- array_col: array (nullable = true)
# |    |-- element: string (containsNull = true)

from pyspark.sql.functions import *


df.withColumn("new_column",expr('transform(array_col,x -> split(x,"-")[0])')).\
show()
#+--------------------+-----------------+
#|           array_col|       new_column|
#+--------------------+-----------------+
#|[hello-123, abc-111]|     [hello, abc]|
#|[hello-234, def-2...|[hello, def, xyz]|
#+--------------------+-----------------+

从Spark-2.4
使用转换
高阶函数

示例：

df.show(10,False)
#+---------------------------+
#|array_col                  |
#+---------------------------+
#|[hello-123, abc-111]       |
#|[hello-234, def-22, xyz-33]|
#+---------------------------+

df.printSchema()
#root
# |-- array_col: array (nullable = true)
# |    |-- element: string (containsNull = true)

from pyspark.sql.functions import *


df.withColumn("new_column",expr('transform(array_col,x -> split(x,"-")[0])')).\
show()
#+--------------------+-----------------+
#|           array_col|       new_column|
#+--------------------+-----------------+
#|[hello-123, abc-111]|     [hello, abc]|
#|[hello-234, def-2...|[hello, def, xyz]|
#+--------------------+-----------------+