Scala 按列值之间出现分隔符的次数拆分为新列
我有一个数据框,其中在某些列中有多个值,总是用^Scala 按列值之间出现分隔符的次数拆分为新列,scala,apache-spark,spark-dataframe,Scala,Apache Spark,Spark Dataframe,我有一个数据框,其中在某些列中有多个值,总是用^ phone|contact| ERN~58XXXXXX7~^EPN~5XXXXX551~|C~MXXX~MSO~^CAxxE~~~~~~3XXX5| phone1|phone2|contact1|contact2| ERN~5XXXXXXX7|EPN~58XXXX91551~|C~MXXXH~MSO~|CAxxE~~~~~~3XXX5| 如何使用循环作为列值之间的分隔符来实现这一点 不是常量。val df=sqlContext.read.f
phone|contact|
ERN~58XXXXXX7~^EPN~5XXXXX551~|C~MXXX~MSO~^CAxxE~~~~~~3XXX5|
phone1|phone2|contact1|contact2|
ERN~5XXXXXXX7|EPN~58XXXX91551~|C~MXXXH~MSO~|CAxxE~~~~~~3XXX5|
如何使用循环作为列值之间的分隔符来实现这一点不是常量。
val df=sqlContext.read.format(“com.databricks.spark.csv”).option(“header”、“true”).option(“delimiter”、“|”).option(“charset”、“UTF-8”).load(“test.txt”)。
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").option("charset", "UTF-8").load("test.txt").
val columnList=df.columns
val xx = columnList.map(x => x->0).toMap
val opMap = df.rdd.flatMap { row =>
columnList.foldLeft(xx) { case (y, col) =>
val s = row.getAs[String](col).split("\\^").length
if (y(col) < s)
y.updated(col, s)
else
y
}.toList
}
val colMaxSizeMap = opMap.groupBy(x => x._1).map(x => x._2.toList.maxBy(x => x._2)).collect().toMap
val x = df.rdd.map{x =>
val op = columnList.flatMap{ y =>
val op = x.getAs[String](y).split("\\^")
op++List.fill(colMaxSizeMap(y)-op.size)("")
}
Row.fromSeq(op)
}
val structFieldList = columnList.flatMap{colName =>
List.range(0,colMaxSizeMap(colName),1).map{ i =>
StructField(s"$colName"+s"$i",StringType)
}
}
val schema = StructType(structFieldList)
val da= spark.createDataFrame(x,schema)
val columnList=df.columns
valxx=columnList.map(x=>x->0).toMap
val opMap=df.rdd.flatMap{row=>
columnList.foldLeft(xx){case(y,col)=>
val s=行.getAs[String](列).split(“\\^”).length
if(y(col)x.\u 1).map(x=>x.\u 2.toList.maxBy(x=>x.\u 2)).collect().toMap
valx=df.rdd.map{x=>
val op=columnList.flatMap{y=>
val op=x.getAs[String](y).split(\\^)
op++List.fill(colMaxSizeMap(y)-op.size)(“”)
}
第行fromSeq(op)
}
val structFieldList=columnList.flatMap{colName=>
List.range(0,colMaxSizeMap(colName),1).map{i=>
StructField(s“$colName”+s“$i”,StringType)
}
}
val schema=StructType(structFieldList)
val da=spark.createDataFrame(x,模式)
您至少尝试过什么吗?data.withColumn(“phone”,split($“phone”,“\\^”)。选择($“phone”.getItem(0)。as(“phone1”),$“phone.getItem(1)。as(“phone2”))我曾想过这样做,但问题是其中一列的列值之间有100+分隔符。您收到了一个错误?否,我没有收到错误,但我希望在循环中实现这一点,这样我就不必在select语句中提到新列。