Dataframe 用两个不同的分隔符拆分数据帧字符串列

Dataframe 用两个不同的分隔符拆分数据帧字符串列,dataframe,pyspark,Dataframe,Pyspark,以下是我的数据集: Itemcode DB9450//DB9450/AD9066 DA0002/DE2396//DF2345 HWC72 GG7183/EB6693 TA444/B9X8X4:7-2- 下面是我一直试图使用的代码 df.withColumn("item1", split(col("Itemcode"), "/").getItem(0)).withColumn("item2", split(

以下是我的数据集:

Itemcode

DB9450//DB9450/AD9066

DA0002/DE2396//DF2345

HWC72

GG7183/EB6693

TA444/B9X8X4:7-2-
下面是我一直试图使用的代码

df.withColumn("item1", split(col("Itemcode"), "/").getItem(0)).withColumn("item2", split(col("Itemcode"), "/").getItem(1)).withColumn("item3", split(col("Itemcode"), "//").getItem(0))
但当第一项和第二项之间有一个双斜杠时,它会失败;当第二项和第三项之间有一个双斜杠时,它也会失败

Desired output is:

  item1    item2     item3

DB9450 DB9450 AD9066

DA0002 DE2396 DF2345

HWC72

GG7183 EB6693

TA444  B9X8X4

您可以先将//替换为/然后再拆分。。请尝试下面的方法,并让我们知道是否有效 输入

df_b = spark.createDataFrame([('DB9450//DB9450/AD9066',"a"),('DA0002/DE2396//DF2345',"a"),('HWC72',"a"),('GG7183/EB6693',"a"),('TA444/B9X8X4:7-2-',"a")],[ "reg","postime"])
+--------------------+-------+--------------------+------+------+------+
|                 reg|postime|           split_col|  col1|  col2|  col3|
+--------------------+-------+--------------------+------+------+------+
|DB9450//DB9450/AD...|      a|[DB9450, DB9450, ...|DB9450|DB9450|AD9066|
|DA0002/DE2396//DF...|      a|[DA0002, DE2396, ...|DA0002|DE2396|DF2345|
|               HWC72|      a|             [HWC72]| HWC72|  null|  null|
|       GG7183/EB6693|      a|    [GG7183, EB6693]|GG7183|EB6693|  null|
|   TA444/B9X8X4:7-2-|      a|[TA444, B9X8X4:7-2-]| TA444|B9X8X4|  null|
+--------------------+-------+--------------------+------+------+------+

逻辑

df_b = df_b.withColumn('split_col', F.regexp_replace(F.col('reg'), "//", "/"))
df_b = df_b.withColumn('split_col', F.split(df_b['split_col'], '/'))
df_b = df_b.withColumn('col1' , F.col('split_col').getItem(0))
df_b = df_b.withColumn('col2' , F.col('split_col').getItem(1))
df_b = df_b.withColumn('col2', F.regexp_replace(F.col('col2'), ":7-2-", ""))
df_b = df_b.withColumn('col3' , F.col('split_col').getItem(2))
输出

df_b = spark.createDataFrame([('DB9450//DB9450/AD9066',"a"),('DA0002/DE2396//DF2345',"a"),('HWC72',"a"),('GG7183/EB6693',"a"),('TA444/B9X8X4:7-2-',"a")],[ "reg","postime"])
+--------------------+-------+--------------------+------+------+------+
|                 reg|postime|           split_col|  col1|  col2|  col3|
+--------------------+-------+--------------------+------+------+------+
|DB9450//DB9450/AD...|      a|[DB9450, DB9450, ...|DB9450|DB9450|AD9066|
|DA0002/DE2396//DF...|      a|[DA0002, DE2396, ...|DA0002|DE2396|DF2345|
|               HWC72|      a|             [HWC72]| HWC72|  null|  null|
|       GG7183/EB6693|      a|    [GG7183, EB6693]|GG7183|EB6693|  null|
|   TA444/B9X8X4:7-2-|      a|[TA444, B9X8X4:7-2-]| TA444|B9X8X4|  null|
+--------------------+-------+--------------------+------+------+------+
也许这是有用的(
spark>=2.4
)-

拆分
转换
spark sql函数将实现如下神奇效果-

加载提供的测试数据
val数据=
"""
|项目代码
|
|DB9450//DB9450/AD9066
|
|DA0002/DE2396//DF2345
|
|HWC72
|
|GG7183/EB6693
|
|TA444/B9X8X4:7-2-
“.stripMargin”
val stringDS=data.split(System.lineSeparator())
.map(\\\\\).map(\.replaceAll(“^[\t]+\\t]+$”,“).mkString(“\\\”))
.toSeq.toDS()
val df=spark.read
.选项(“sep”和“|”)
.选项(“推断模式”、“真”)
.选项(“标题”、“正确”)
.选项(“空值”、“空值”)
.csv(stringDS)
df.show(假)
df.printSchema()
/**
* +---------------------+
*|项目代码|
* +---------------------+
*| DB9450//DB9450/AD9066|
*| DA0002/DE2396//DF2345|
*| HWC72|
*| GG7183/EB6693|
*| TA444/B9X8X4:7-2-|
* +---------------------+
*
*根
*|--Itemcode:string(nullable=true)
*/
使用
split
TRANSFORM
(您可以直接在pyspark中运行此查询)
df.withColumn(“项目代码”,expr(“转换(拆分(项目代码“/+”),x->拆分(x“:”)[0]))
.选择EXPR(“项目编号[0]项目1”、“项目编号[1]项目2”、“项目编号[2]项目3”)
.show(假)
/**
* +------+------+------+
*|项目1 |项目2 |项目3|
* +------+------+------+
*| DB9450 | DB9450 | AD9066|
*| DA0002 | DE2396 | DF2345|
*| HWC72 |空|空|
*| GG7183 | EB6693 |空|
*| TA444 | B9X8X4 |空|
* +------+------+------+
*/

将文本作为csv进行处理可以很好地实现这一点

首先,让我们阅读课文,在阅读过程中替换双反斜杠

编辑:也删除冒号后的所有内容

val items = """
Itemcode

DB9450//DB9450/AD9066

DA0002/DE2396//DF2345

HWC72

GG7183/EB6693

TA444/B9X8X4:7-2-

""".replaceAll("//", "/").split(":")(0)
获取行中的最大项数 创建适当的标题

val numItems = items.split("\n").map(_.split("/").size).reduce(_ max _)

val header = (1 to numItems).map("Itemcode" + _).mkString("/")
然后我们准备创建一个数据帧

val df = spark.read
  .option("ignoreTrailingWhiteSpace", "true")
  .option("delimiter", "/")
  .option("header", "true")
  .csv(spark.sparkContext.parallelize((header + items).split("\n")).toDS)
  .filter("Itemcode1 <> 'Itemcode'")

df.show(false)


+---------+-----------+---------+
|Itemcode1|Itemcode2  |Itemcode3|
+---------+-----------+---------+
|DB9450   |DB9450     |AD9066   |
|DA0002   |DE2396     |DF2345   |
|HWC72    |null       |null     |
|GG7183   |EB6693     |null     |
|TA444    |B9X8X4     |null     |
+---------+-----------+---------+
val df=spark.read
.选项(“忽略跟踪空白”、“真”)
.选项(“分隔符”、“/”)
.选项(“标题”、“正确”)
.csv(spark.sparkContext.parallelize((标题+项目).split(“\n”)).toDS)
.filter(“Itemcode1'Itemcode'”)
df.show(假)
+---------+-----------+---------+
|项目代码1 |项目代码2 |项目代码3|
+---------+-----------+---------+
|DB9450 | DB9450 | AD9066|
|DA0002 | DE2396 | DF2345|
|HWC72 |空|空|
|GG7183 | EB6693 |空|
|TA444 | B9X8X4 |空|
+---------+-----------+---------+

请检查并确认:7-2-不是固定值。我需要删除字符串末尾最后一个分隔符之后的任何文本,在这种情况下,它是:如果您有项目代码,则具有固定值:子字符串可以使用df_b=df_b.withColumn('col2',df_b['col2'].substr(0,6))