Scala Spark-将转义分隔符的字符串列拆分为一部分
我有一个包含两个字符串列(术语、代码)的CSV文件。代码列具有特殊格式Scala Spark-将转义分隔符的字符串列拆分为一部分,scala,apache-spark,Scala,Apache Spark,我有一个包含两个字符串列(术语、代码)的CSV文件。代码列具有特殊格式[num]-[2个字母]-[text],其中文本也可以包含破折号-。我想使用Spark将此文件读入正好有四列(term、num、两个字母、text)的数据框中 如果code列的text部分中没有破折号,我可以将code列拆分为三列,但如何实现解决所有情况的解决方案(例如将恰好两个破折号后的所有文本合并为一列) 将一列拆分为三列的代码在答案中得到了很好的说明这里有一个带有regexp\u extract的选项: val df =
[num]-[2个字母]-[text]
,其中文本也可以包含破折号-
。我想使用Spark将此文件读入正好有四列(term、num、两个字母、text)的数据框中
如果code
列的text
部分中没有破折号,我可以将code
列拆分为三列,但如何实现解决所有情况的解决方案(例如将恰好两个破折号后的所有文本合并为一列)
将一列拆分为三列的代码在答案中得到了很好的说明这里有一个带有regexp\u extract的选项
:
val df = Seq(("term01", "12-AB-some text"), ("term02", "130-CD-some-other-text")).toDF("term", "code")
// define the pattern that matches the string column
val p = "([0-9]+)-([a-zA-Z]{2})-(.*)"
// p: String = ([0-9]+)-([a-zA-Z]{2})-(.*)
// define the map from new column names to the group index in the pattern
val cols = Map("num" -> 1, "letters" -> 2, "text" -> 3)
// cols: scala.collection.immutable.Map[String,Int] = Map(num -> 1, letters -> 2, text -> 3)
// create the new columns on data frame
cols.foldLeft(df){
case (df, (colName, groupIdx)) => df.withColumn(colName, regexp_extract($"code", p, groupIdx))
}.drop("code").show
+------+---+-------+---------------+
| term|num|letters| text|
+------+---+-------+---------------+
|term01| 12| AB| some text|
|term02|130| CD|some-other-text|
+------+---+-------+---------------+
这里有一个带有regexp\u extract
的选项:
val df = Seq(("term01", "12-AB-some text"), ("term02", "130-CD-some-other-text")).toDF("term", "code")
// define the pattern that matches the string column
val p = "([0-9]+)-([a-zA-Z]{2})-(.*)"
// p: String = ([0-9]+)-([a-zA-Z]{2})-(.*)
// define the map from new column names to the group index in the pattern
val cols = Map("num" -> 1, "letters" -> 2, "text" -> 3)
// cols: scala.collection.immutable.Map[String,Int] = Map(num -> 1, letters -> 2, text -> 3)
// create the new columns on data frame
cols.foldLeft(df){
case (df, (colName, groupIdx)) => df.withColumn(colName, regexp_extract($"code", p, groupIdx))
}.drop("code").show
+------+---+-------+---------------+
| term|num|letters| text|
+------+---+-------+---------------+
|term01| 12| AB| some text|
|term02|130| CD|some-other-text|
+------+---+-------+---------------+
我将通过利用UDF来实现这一点:
case class MyData(num: Int, letters: String, text: String)
def udfSplit = udf(
(input: String) => {
val res = input.split("-", 3) // limit=3 => pattern applied at most n - 1 times
MyData(res(0).toInt, res(1), res(2))
}
)
val df = spark.createDataFrame(
Seq(
("term01", "12-AB-some text"),
("term02", "130-CD-some-other-text")
)
).toDF("term", "code")
df.show(false)
+------+----------------------+
|term |code |
+------+----------------------+
|term01|12-AB-some text |
|term02|130-CD-some-other-text|
+------+----------------------+
val res = df.withColumn("code", udfSplit($"code"))
res.show(false)
+------+------------------------+
|term |code |
+------+------------------------+
|term01|[12,AB,some text] |
|term02|[130,CD,some-other-text]|
+------+------------------------+
res.printSchema
root
|-- term: string (nullable = true)
|-- code: struct (nullable = true)
| |-- num: integer (nullable = false)
| |-- letters: string (nullable = true)
| |-- text: string (nullable = true)
res.select("term", "code.*").show(false)
+------+---+-------+---------------+
|term |num|letters|text |
+------+---+-------+---------------+
|term01|12 |AB |some text |
|term02|130|CD |some-other-text|
+------+---+-------+---------------+
我将通过利用UDF来实现这一点:
case class MyData(num: Int, letters: String, text: String)
def udfSplit = udf(
(input: String) => {
val res = input.split("-", 3) // limit=3 => pattern applied at most n - 1 times
MyData(res(0).toInt, res(1), res(2))
}
)
val df = spark.createDataFrame(
Seq(
("term01", "12-AB-some text"),
("term02", "130-CD-some-other-text")
)
).toDF("term", "code")
df.show(false)
+------+----------------------+
|term |code |
+------+----------------------+
|term01|12-AB-some text |
|term02|130-CD-some-other-text|
+------+----------------------+
val res = df.withColumn("code", udfSplit($"code"))
res.show(false)
+------+------------------------+
|term |code |
+------+------------------------+
|term01|[12,AB,some text] |
|term02|[130,CD,some-other-text]|
+------+------------------------+
res.printSchema
root
|-- term: string (nullable = true)
|-- code: struct (nullable = true)
| |-- num: integer (nullable = false)
| |-- letters: string (nullable = true)
| |-- text: string (nullable = true)
res.select("term", "code.*").show(false)
+------+---+-------+---------------+
|term |num|letters|text |
+------+---+-------+---------------+
|term01|12 |AB |some text |
|term02|130|CD |some-other-text|
+------+---+-------+---------------+
感谢您提供了非常清晰的解决方案。它也解决了这个问题。谢谢你提供了非常清晰的解决方案。它也解决了这个问题。