Python Spark:从A列到B列的字符串操作
大家好,我是spark的新手,我想知道如何进行字符串操作,使Column1-Column2得到column3 注意:我的数据在数据框中 所以基本上我有两个不同的列字符串,我只想得到列2中存在的字符串,而不是列1中的字符串,这样我就可以将它生成为列3Python Spark:从A列到B列的字符串操作,python,scala,apache-spark,hadoop,Python,Scala,Apache Spark,Hadoop,大家好,我是spark的新手,我想知道如何进行字符串操作,使Column1-Column2得到column3 注意:我的数据在数据框中 所以基本上我有两个不同的列字符串,我只想得到列2中存在的字符串,而不是列1中的字符串,这样我就可以将它生成为列3 Column1 SAMPLE_OUT_3_APPLE|BANANA|GUAVA|ORANGE Column2 SAMPLE_OUT_3_APPLE|BANANA|GUAVA|GRAPES|ORANGE|BERRY 那么第三栏应该是 Column3
Column1
SAMPLE_OUT_3_APPLE|BANANA|GUAVA|ORANGE
Column2
SAMPLE_OUT_3_APPLE|BANANA|GUAVA|GRAPES|ORANGE|BERRY
那么第三栏应该是
Column3
GRAPES,BERRY
但对于专栏1和专栏2,我也想展示一下
APPLE,BANANA,ORANGE
只需删除
SAMPLE\u OUT\u 3
并对Spark>=2.4进行逗号分隔
除了
import spark.implicits._
val df = Seq(
("SAMPLE_OUT_3_APPLE|BANANA|GUAVA|ORANGE" ,"SAMPLE_OUT_3_APPLE|BANANA|GUAVA|GRAPES|ORANGE|BERRY")
).toDF("column1", "column2")
val remove = df.columns.map(column => split(col(column), "3_").getItem(1).as(column))
val resultDF = df.select(remove: _*)
.withColumn("column1", split($"column1", "\\|"))
.withColumn("column2", split($"column2", "\\|"))
.withColumn("column3", array_except($"column2", $"column1"))
.withColumn("column1", array_except($"column1", $"column3"))
.withColumn("column2", array_except($"column2", $"column3"))
val convertToString = resultDF.columns.map(column => concat_ws("|", col(column)).as(column))
resultDF.select(convertToString: _*).show(false)
输出:
+-------------------------+-------------------------+------------+
|column1 |column2 |column3 |
+-------------------------+-------------------------+------------+
|APPLE|BANANA|GUAVA|ORANGE|APPLE|BANANA|GUAVA|ORANGE|GRAPES|BERRY|
+-------------------------+-------------------------+------------+
+-------------------------+--------------------------------------+------------+
|column1 |column2 |column3 |
+-------------------------+--------------------------------------+------------+
|APPLE,BANANA,GUAVA,ORANGE|APPLE,BANANA,GUAVA,GRAPES,ORANGE,BERRY|GRAPES,BERRY|
+-------------------------+--------------------------------------+------------+
对于火花>=2.4 除了
import spark.implicits._
val df = Seq(
("SAMPLE_OUT_3_APPLE|BANANA|GUAVA|ORANGE" ,"SAMPLE_OUT_3_APPLE|BANANA|GUAVA|GRAPES|ORANGE|BERRY")
).toDF("column1", "column2")
val remove = df.columns.map(column => split(col(column), "3_").getItem(1).as(column))
val resultDF = df.select(remove: _*)
.withColumn("column1", split($"column1", "\\|"))
.withColumn("column2", split($"column2", "\\|"))
.withColumn("column3", array_except($"column2", $"column1"))
.withColumn("column1", array_except($"column1", $"column3"))
.withColumn("column2", array_except($"column2", $"column3"))
val convertToString = resultDF.columns.map(column => concat_ws("|", col(column)).as(column))
resultDF.select(convertToString: _*).show(false)
输出:
+-------------------------+-------------------------+------------+
|column1 |column2 |column3 |
+-------------------------+-------------------------+------------+
|APPLE|BANANA|GUAVA|ORANGE|APPLE|BANANA|GUAVA|ORANGE|GRAPES|BERRY|
+-------------------------+-------------------------+------------+
+-------------------------+--------------------------------------+------------+
|column1 |column2 |column3 |
+-------------------------+--------------------------------------+------------+
|APPLE,BANANA,GUAVA,ORANGE|APPLE,BANANA,GUAVA,GRAPES,ORANGE,BERRY|GRAPES,BERRY|
+-------------------------+--------------------------------------+------------+
您可以按如下所示的“|”拆分列 导入spark.implicits_
val df = mainDf.select("Column1","Column2").map(x => {
val s1 = x.getAsString(0).replaceAll("^.*3_","").split("|");
val s2 = x.getAsString(1).replaceAll("^.*3_","").split("|");
(x.getAsString(0),x.getAsString(1),s2.diff(s1).union(s1.diff(s2)))
}
).toDF("Column1","Column2","Column3")
您可以按如下所示的“|”拆分列 导入spark.implicits_
val df = mainDf.select("Column1","Column2").map(x => {
val s1 = x.getAsString(0).replaceAll("^.*3_","").split("|");
val s2 = x.getAsString(1).replaceAll("^.*3_","").split("|");
(x.getAsString(0),x.getAsString(1),s2.diff(s1).union(s1.diff(s2)))
}
).toDF("Column1","Column2","Column3")
您还可以通过regexp\u replace和udf实现您的目的
+-------------------------+-------------------------+------------+
|column1 |column2 |column3 |
+-------------------------+-------------------------+------------+
|APPLE|BANANA|GUAVA|ORANGE|APPLE|BANANA|GUAVA|ORANGE|GRAPES|BERRY|
+-------------------------+-------------------------+------------+
+-------------------------+--------------------------------------+------------+
|column1 |column2 |column3 |
+-------------------------+--------------------------------------+------------+
|APPLE,BANANA,GUAVA,ORANGE|APPLE,BANANA,GUAVA,GRAPES,ORANGE,BERRY|GRAPES,BERRY|
+-------------------------+--------------------------------------+------------+
您还可以通过regexp\u replace和udf实现您的目的
+-------------------------+-------------------------+------------+
|column1 |column2 |column3 |
+-------------------------+-------------------------+------------+
|APPLE|BANANA|GUAVA|ORANGE|APPLE|BANANA|GUAVA|ORANGE|GRAPES|BERRY|
+-------------------------+-------------------------+------------+
+-------------------------+--------------------------------------+------------+
|column1 |column2 |column3 |
+-------------------------+--------------------------------------+------------+
|APPLE,BANANA,GUAVA,ORANGE|APPLE,BANANA,GUAVA,GRAPES,ORANGE,BERRY|GRAPES,BERRY|
+-------------------------+--------------------------------------+------------+
您可以共享数据帧的架构吗?@koiralo |--column1:string(nullable=true)|--column2:string(nullable=true)您可以共享数据帧的架构吗?@koiralo |--column1:string(nullable=true)|--column2:string(nullable=true)第一个示例可以不同,但它总是有3个示例,第一个示例可以不同,但它总是有3个,我如何将它恢复到字符串?对于第1列和第2列,我想把它格式化成苹果、香蕉、番石榴、葡萄、橙子。我怎样才能把它还原成字符串呢?对于第1列和第2列,我想将其格式化为苹果、香蕉、番石榴、葡萄、桔梗