Dataframe 用两个不同的分隔符拆分数据帧字符串列
以下是我的数据集:Dataframe 用两个不同的分隔符拆分数据帧字符串列,dataframe,pyspark,Dataframe,Pyspark,以下是我的数据集: Itemcode DB9450//DB9450/AD9066 DA0002/DE2396//DF2345 HWC72 GG7183/EB6693 TA444/B9X8X4:7-2- 下面是我一直试图使用的代码 df.withColumn("item1", split(col("Itemcode"), "/").getItem(0)).withColumn("item2", split(
Itemcode
DB9450//DB9450/AD9066
DA0002/DE2396//DF2345
HWC72
GG7183/EB6693
TA444/B9X8X4:7-2-
下面是我一直试图使用的代码
df.withColumn("item1", split(col("Itemcode"), "/").getItem(0)).withColumn("item2", split(col("Itemcode"), "/").getItem(1)).withColumn("item3", split(col("Itemcode"), "//").getItem(0))
但当第一项和第二项之间有一个双斜杠时,它会失败;当第二项和第三项之间有一个双斜杠时,它也会失败
Desired output is:
item1 item2 item3
DB9450 DB9450 AD9066
DA0002 DE2396 DF2345
HWC72
GG7183 EB6693
TA444 B9X8X4
您可以先将//替换为/然后再拆分。。请尝试下面的方法,并让我们知道是否有效 输入
df_b = spark.createDataFrame([('DB9450//DB9450/AD9066',"a"),('DA0002/DE2396//DF2345',"a"),('HWC72',"a"),('GG7183/EB6693',"a"),('TA444/B9X8X4:7-2-',"a")],[ "reg","postime"])
+--------------------+-------+--------------------+------+------+------+
| reg|postime| split_col| col1| col2| col3|
+--------------------+-------+--------------------+------+------+------+
|DB9450//DB9450/AD...| a|[DB9450, DB9450, ...|DB9450|DB9450|AD9066|
|DA0002/DE2396//DF...| a|[DA0002, DE2396, ...|DA0002|DE2396|DF2345|
| HWC72| a| [HWC72]| HWC72| null| null|
| GG7183/EB6693| a| [GG7183, EB6693]|GG7183|EB6693| null|
| TA444/B9X8X4:7-2-| a|[TA444, B9X8X4:7-2-]| TA444|B9X8X4| null|
+--------------------+-------+--------------------+------+------+------+
逻辑
df_b = df_b.withColumn('split_col', F.regexp_replace(F.col('reg'), "//", "/"))
df_b = df_b.withColumn('split_col', F.split(df_b['split_col'], '/'))
df_b = df_b.withColumn('col1' , F.col('split_col').getItem(0))
df_b = df_b.withColumn('col2' , F.col('split_col').getItem(1))
df_b = df_b.withColumn('col2', F.regexp_replace(F.col('col2'), ":7-2-", ""))
df_b = df_b.withColumn('col3' , F.col('split_col').getItem(2))
输出
df_b = spark.createDataFrame([('DB9450//DB9450/AD9066',"a"),('DA0002/DE2396//DF2345',"a"),('HWC72',"a"),('GG7183/EB6693',"a"),('TA444/B9X8X4:7-2-',"a")],[ "reg","postime"])
+--------------------+-------+--------------------+------+------+------+
| reg|postime| split_col| col1| col2| col3|
+--------------------+-------+--------------------+------+------+------+
|DB9450//DB9450/AD...| a|[DB9450, DB9450, ...|DB9450|DB9450|AD9066|
|DA0002/DE2396//DF...| a|[DA0002, DE2396, ...|DA0002|DE2396|DF2345|
| HWC72| a| [HWC72]| HWC72| null| null|
| GG7183/EB6693| a| [GG7183, EB6693]|GG7183|EB6693| null|
| TA444/B9X8X4:7-2-| a|[TA444, B9X8X4:7-2-]| TA444|B9X8X4| null|
+--------------------+-------+--------------------+------+------+------+
也许这是有用的(spark>=2.4
)-
拆分
和转换
spark sql函数将实现如下神奇效果-
加载提供的测试数据
val数据=
"""
|项目代码
|
|DB9450//DB9450/AD9066
|
|DA0002/DE2396//DF2345
|
|HWC72
|
|GG7183/EB6693
|
|TA444/B9X8X4:7-2-
“.stripMargin”
val stringDS=data.split(System.lineSeparator())
.map(\\\\\).map(\.replaceAll(“^[\t]+\\t]+$”,“).mkString(“\\\”))
.toSeq.toDS()
val df=spark.read
.选项(“sep”和“|”)
.选项(“推断模式”、“真”)
.选项(“标题”、“正确”)
.选项(“空值”、“空值”)
.csv(stringDS)
df.show(假)
df.printSchema()
/**
* +---------------------+
*|项目代码|
* +---------------------+
*| DB9450//DB9450/AD9066|
*| DA0002/DE2396//DF2345|
*| HWC72|
*| GG7183/EB6693|
*| TA444/B9X8X4:7-2-|
* +---------------------+
*
*根
*|--Itemcode:string(nullable=true)
*/
使用split
和TRANSFORM
(您可以直接在pyspark中运行此查询)
df.withColumn(“项目代码”,expr(“转换(拆分(项目代码“/+”),x->拆分(x“:”)[0]))
.选择EXPR(“项目编号[0]项目1”、“项目编号[1]项目2”、“项目编号[2]项目3”)
.show(假)
/**
* +------+------+------+
*|项目1 |项目2 |项目3|
* +------+------+------+
*| DB9450 | DB9450 | AD9066|
*| DA0002 | DE2396 | DF2345|
*| HWC72 |空|空|
*| GG7183 | EB6693 |空|
*| TA444 | B9X8X4 |空|
* +------+------+------+
*/
将文本作为csv进行处理可以很好地实现这一点
首先,让我们阅读课文,在阅读过程中替换双反斜杠
编辑:也删除冒号后的所有内容
val items = """
Itemcode
DB9450//DB9450/AD9066
DA0002/DE2396//DF2345
HWC72
GG7183/EB6693
TA444/B9X8X4:7-2-
""".replaceAll("//", "/").split(":")(0)
获取行中的最大项数
创建适当的标题
val numItems = items.split("\n").map(_.split("/").size).reduce(_ max _)
val header = (1 to numItems).map("Itemcode" + _).mkString("/")
然后我们准备创建一个数据帧
val df = spark.read
.option("ignoreTrailingWhiteSpace", "true")
.option("delimiter", "/")
.option("header", "true")
.csv(spark.sparkContext.parallelize((header + items).split("\n")).toDS)
.filter("Itemcode1 <> 'Itemcode'")
df.show(false)
+---------+-----------+---------+
|Itemcode1|Itemcode2 |Itemcode3|
+---------+-----------+---------+
|DB9450 |DB9450 |AD9066 |
|DA0002 |DE2396 |DF2345 |
|HWC72 |null |null |
|GG7183 |EB6693 |null |
|TA444 |B9X8X4 |null |
+---------+-----------+---------+
val df=spark.read
.选项(“忽略跟踪空白”、“真”)
.选项(“分隔符”、“/”)
.选项(“标题”、“正确”)
.csv(spark.sparkContext.parallelize((标题+项目).split(“\n”)).toDS)
.filter(“Itemcode1'Itemcode'”)
df.show(假)
+---------+-----------+---------+
|项目代码1 |项目代码2 |项目代码3|
+---------+-----------+---------+
|DB9450 | DB9450 | AD9066|
|DA0002 | DE2396 | DF2345|
|HWC72 |空|空|
|GG7183 | EB6693 |空|
|TA444 | B9X8X4 |空|
+---------+-----------+---------+
请检查并确认:7-2-不是固定值。我需要删除字符串末尾最后一个分隔符之后的任何文本,在这种情况下,它是:如果您有项目代码,则具有固定值:子字符串可以使用df_b=df_b.withColumn('col2',df_b['col2'].substr(0,6))