Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 按列值之间出现分隔符的次数拆分为新列_Scala_Apache Spark_Spark Dataframe - Fatal编程技术网

Scala 按列值之间出现分隔符的次数拆分为新列

Scala 按列值之间出现分隔符的次数拆分为新列,scala,apache-spark,spark-dataframe,Scala,Apache Spark,Spark Dataframe,我有一个数据框,其中在某些列中有多个值,总是用^ phone|contact| ERN~58XXXXXX7~^EPN~5XXXXX551~|C~MXXX~MSO~^CAxxE~~~~~~3XXX5| phone1|phone2|contact1|contact2| ERN~5XXXXXXX7|EPN~58XXXX91551~|C~MXXXH~MSO~|CAxxE~~~~~~3XXX5| 如何使用循环作为列值之间的分隔符来实现这一点 不是常量。val df=sqlContext.read.f

我有一个数据框,其中在某些列中有多个值,总是用^

phone|contact|
ERN~58XXXXXX7~^EPN~5XXXXX551~|C~MXXX~MSO~^CAxxE~~~~~~3XXX5|

phone1|phone2|contact1|contact2| 
ERN~5XXXXXXX7|EPN~58XXXX91551~|C~MXXXH~MSO~|CAxxE~~~~~~3XXX5|
如何使用循环作为列值之间的分隔符来实现这一点
不是常量。

val df=sqlContext.read.format(“com.databricks.spark.csv”).option(“header”、“true”).option(“delimiter”、“|”).option(“charset”、“UTF-8”).load(“test.txt”)。
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").option("charset", "UTF-8").load("test.txt").
          val columnList=df.columns


          val xx = columnList.map(x => x->0).toMap
          val opMap = df.rdd.flatMap { row =>
            columnList.foldLeft(xx) { case (y, col) =>
              val s = row.getAs[String](col).split("\\^").length

              if (y(col) < s)
                y.updated(col, s)
              else
                y
            }.toList
          }
      val colMaxSizeMap = opMap.groupBy(x => x._1).map(x => x._2.toList.maxBy(x => x._2)).collect().toMap
          val x = df.rdd.map{x =>
            val op = columnList.flatMap{ y =>
              val op = x.getAs[String](y).split("\\^")
              op++List.fill(colMaxSizeMap(y)-op.size)("")
            }
            Row.fromSeq(op)
          }

          val structFieldList = columnList.flatMap{colName =>
            List.range(0,colMaxSizeMap(colName),1).map{ i =>
              StructField(s"$colName"+s"$i",StringType)
            }
          }
          val schema = StructType(structFieldList)

          val da= spark.createDataFrame(x,schema)
val columnList=df.columns valxx=columnList.map(x=>x->0).toMap val opMap=df.rdd.flatMap{row=> columnList.foldLeft(xx){case(y,col)=> val s=行.getAs[String](列).split(“\\^”).length if(y(col)x.\u 1).map(x=>x.\u 2.toList.maxBy(x=>x.\u 2)).collect().toMap valx=df.rdd.map{x=> val op=columnList.flatMap{y=> val op=x.getAs[String](y).split(\\^) op++List.fill(colMaxSizeMap(y)-op.size)(“”) } 第行fromSeq(op) } val structFieldList=columnList.flatMap{colName=> List.range(0,colMaxSizeMap(colName),1).map{i=> StructField(s“$colName”+s“$i”,StringType) } } val schema=StructType(structFieldList) val da=spark.createDataFrame(x,模式)
您至少尝试过什么吗?data.withColumn(“phone”,split($“phone”,“\\^”)。选择($“phone”.getItem(0)。as(“phone1”),$“phone.getItem(1)。as(“phone2”))我曾想过这样做,但问题是其中一列的列值之间有100+分隔符。您收到了一个错误?否,我没有收到错误,但我希望在循环中实现这一点,这样我就不必在select语句中提到新列。