获取WrappedArray行值并将其转换为Scala中的字符串

获取WrappedArray行值并将其转换为Scala中的字符串,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个数据框,如下所示 +---------------------------------------------------------------------+ |价值观| +---------------------------------------------------------------------+ |[WrappedArray(LineItem\u organizationId,LineItem\u lineItemId)]| |[WrappedArray(Organi

我有一个数据框,如下所示

+---------------------------------------------------------------------+
|价值观|
+---------------------------------------------------------------------+
|[WrappedArray(LineItem\u organizationId,LineItem\u lineItemId)]|
|[WrappedArray(OrganizationId、LineItemId、SegmentSequence\u segmentId)]|
+---------------------------------------------------------------------+
从上面两行我想创建一个字符串,它是这种格式

“LineItem\u organizationId”、“LineItem\u lineItemId”
“OrganizationId”、“LineItemId”、“SegmentSequence\u segmentId”
我想将其创建为动态,以便在第一列中显示第三个值,我的字符串将有一个以上的分隔列值

我怎样才能在Scala中做到这一点

这就是我为了创建数据帧所做的事情

 val xmlFiles = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML"
    val discriptorFileLOcation = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//FinancialLineItem//REFXML"
    import sqlContext.implicits._

    val dfDiscriptor = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "FlatFileDescriptor").load(discriptorFileLOcation)
    dfDiscriptor.printSchema()
    val firstColumn = dfDiscriptor.select($"FFFileType.FFRecord.FFField").as("FFField")
    val FirstColumnOfHeaderFile = firstColumn.select(explode($"FFField")).as("ColumnsDetails").select(explode($"col")).first.get(0).toString().split(",")(5)
    println(FirstColumnOfHeaderFile)
    //dfDiscriptor.printSchema()
    val primaryKeyColumnsFinancialLineItem = dfDiscriptor.select(explode($"FFFileType.FFRecord.FFPrimKey.FFPrimKeyCol"))
    primaryKeyColumnsFinancialLineItem.show(false)
添加完整模式

根目录
|--FFColumnDelimiter:字符串(nullable=true)
|--FFContentItem:struct(nullable=true)
||--_值:字符串(nullable=true)
||--\u ffMajVers:long(nullable=true)
||--\u ffMinVers:double(nullable=true)
|--FFFileEncoding:字符串(nullable=true)
|--FFFileType:数组(nullable=true)
||--元素:struct(containsnall=true)
|| |--FFPhysicalFile:array(nullable=true)
|| | |--元素:struct(containsnall=true)
|| | |--FFFileName:string(nullable=true)
|| | |--FFRowCount:long(nullable=true)
|| |--FFRecord:struct(nullable=true)
|| | |--FFField:array(nullable=true)
|| | | |--元素:struct(containsnall=true)
|| | | |--FFColumnNumber:long(nullable=true)
|| | | |--FFDataType:string(nullable=true)
|| | | | |--FFFacets:struct(nullable=true)
|| | | | |--FFMaxLength:long(nullable=true)
|| | | | |--FFTotalDigits:long(nullable=true)
|| | | |--FFFieldIsOptional:布尔值(nullable=true)
|| | | |--FFFieldName:string(nullable=true)
|| | | |--FFForKey:struct(nullable=true)
|| | | | |--FFForKeyCol:string(nullable=true)
|| | | | |--FFForKeyRecord:string(nullable=true)
|| | |--FFPrimKey:struct(nullable=true)
|| | |--FFPrimKeyCol:array(nullable=true)
|| | | | |--元素:字符串(containsnall=true)
|| | |--FFRecordType:string(nullable=true)
|--FFHeaderRow:布尔值(nullable=true)
|--FFId:字符串(nullable=true)
|--FFRowDelimiter:字符串(nullable=true)
|--FFTimeStamp:string(nullable=true)
|--_env:string(nullable=true)
|--_ffMajVers:long(nullable=true)
|--\u ffMinVers:double(nullable=true)
|--\u ffPubstyle:string(nullable=true)
|--\u schemaLocation:string(nullable=true)
|--_sr:string(nullable=true)
|--_xmlns:string(nullable=true)
|--_xsi:string(nullable=true)

查看给定的
数据帧

+---------------------------------------------------------------------+
|value                                                                |
+---------------------------------------------------------------------+
|[WrappedArray(LineItem_organizationId, LineItem_lineItemId)]         |
|[WrappedArray(OrganizationId, LineItemId, SegmentSequence_segmentId)]|
+---------------------------------------------------------------------+
它必须具有以下
架构

 |-- value: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
如果上述假设成立,那么您应该编写一个
udf
函数

import org.apache.spark.sql.functions._
def arrayToString = udf((arr: collection.mutable.WrappedArray[collection.mutable.WrappedArray[String]]) => arr.flatten.mkString(", "))
并在
数据帧中使用它作为

df.withColumn("value", arrayToString($"value"))
你应该有

+-----------------------------------------------------+
|value                                                |
+-----------------------------------------------------+
|LineItem_organizationId, LineItem_lineItemId         |
|OrganizationId, LineItemId, SegmentSequence_segmentId|
+-----------------------------------------------------+

 |-- value: string (nullable = true)