Pyspark:通过拆分文本创建新列

Pyspark:通过拆分文本创建新列,pyspark,apache-spark-sql,pyspark-dataframes,Pyspark,Apache Spark Sql,Pyspark Dataframes,我有一个pyspark数据帧,如下所示: spark.createDataFrame( [ (1, '1234ESPNnonzodiac'), (2, '1234ESPNzodiac'), (3, '963CNNnonzodiac'), (4, '963CNNzodiac'), ], ['id', 'col1'] ) spark.createDataFrame( [ (1, '

我有一个pyspark数据帧,如下所示:

spark.createDataFrame(
    [
        (1, '1234ESPNnonzodiac'), 
        (2, '1234ESPNzodiac'),
        (3, '963CNNnonzodiac'), 
        (4, '963CNNzodiac'),
    ],
    ['id', 'col1'] 
)
spark.createDataFrame(
    [
        (1, '1234ESPNnonzodiac', '1234ESPN'), 
        (2, '1234ESPNzodiac', '1234ESPN'),
        (3, '963CNNnonzodiac', '963CNN'), 
        (4, '963CNNzodiac', '963CNN'),
    ],
    ['id', 'col1', 'col2'] 
)
我想创建一个新的专栏,在那里我将
col1
拆分为
zodiac
nonzodiac
,这样我最终可以根据这个新专栏进行分组

我希望最终输出如下:

spark.createDataFrame(
    [
        (1, '1234ESPNnonzodiac'), 
        (2, '1234ESPNzodiac'),
        (3, '963CNNnonzodiac'), 
        (4, '963CNNzodiac'),
    ],
    ['id', 'col1'] 
)
spark.createDataFrame(
    [
        (1, '1234ESPNnonzodiac', '1234ESPN'), 
        (2, '1234ESPNzodiac', '1234ESPN'),
        (3, '963CNNnonzodiac', '963CNN'), 
        (4, '963CNNzodiac', '963CNN'),
    ],
    ['id', 'col1', 'col2'] 
)

我将使用pyspark.sql.functions import regexp\u extract中的

df.withColumn(“col2”,regexp_extract(df.col1,r“([\s\s]+?)(?:non)?zodiac”,1)).show()
+---+-----------------+--------+
|id | col1 | col2|
+---+-----------------+--------+
|1 | 1234ESPN十二生肖| 1234ESPN|
|2 | 1234ESPNzodiac | 1234ESPN|
|3 | 963CNN非黄道带| 963CNN|
|4 | 963CNN黄道带| 963CNN|
+---+-----------------+--------+