Pyspark:通过拆分文本创建新列
我有一个pyspark数据帧,如下所示:Pyspark:通过拆分文本创建新列,pyspark,apache-spark-sql,pyspark-dataframes,Pyspark,Apache Spark Sql,Pyspark Dataframes,我有一个pyspark数据帧,如下所示: spark.createDataFrame( [ (1, '1234ESPNnonzodiac'), (2, '1234ESPNzodiac'), (3, '963CNNnonzodiac'), (4, '963CNNzodiac'), ], ['id', 'col1'] ) spark.createDataFrame( [ (1, '
spark.createDataFrame(
[
(1, '1234ESPNnonzodiac'),
(2, '1234ESPNzodiac'),
(3, '963CNNnonzodiac'),
(4, '963CNNzodiac'),
],
['id', 'col1']
)
spark.createDataFrame(
[
(1, '1234ESPNnonzodiac', '1234ESPN'),
(2, '1234ESPNzodiac', '1234ESPN'),
(3, '963CNNnonzodiac', '963CNN'),
(4, '963CNNzodiac', '963CNN'),
],
['id', 'col1', 'col2']
)
我想创建一个新的专栏,在那里我将col1
拆分为zodiac
或nonzodiac
,这样我最终可以根据这个新专栏进行分组
我希望最终输出如下:
spark.createDataFrame(
[
(1, '1234ESPNnonzodiac'),
(2, '1234ESPNzodiac'),
(3, '963CNNnonzodiac'),
(4, '963CNNzodiac'),
],
['id', 'col1']
)
spark.createDataFrame(
[
(1, '1234ESPNnonzodiac', '1234ESPN'),
(2, '1234ESPNzodiac', '1234ESPN'),
(3, '963CNNnonzodiac', '963CNN'),
(4, '963CNNzodiac', '963CNN'),
],
['id', 'col1', 'col2']
)
我将使用pyspark.sql.functions import regexp\u extract中的
:
df.withColumn(“col2”,regexp_extract(df.col1,r“([\s\s]+?)(?:non)?zodiac”,1)).show()
+---+-----------------+--------+
|id | col1 | col2|
+---+-----------------+--------+
|1 | 1234ESPN十二生肖| 1234ESPN|
|2 | 1234ESPNzodiac | 1234ESPN|
|3 | 963CNN非黄道带| 963CNN|
|4 | 963CNN黄道带| 963CNN|
+---+-----------------+--------+