获取WrappedArray行值并将其转换为Scala中的字符串
我有一个数据框,如下所示获取WrappedArray行值并将其转换为Scala中的字符串,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个数据框,如下所示 +---------------------------------------------------------------------+ |价值观| +---------------------------------------------------------------------+ |[WrappedArray(LineItem\u organizationId,LineItem\u lineItemId)]| |[WrappedArray(Organi
+---------------------------------------------------------------------+
|价值观|
+---------------------------------------------------------------------+
|[WrappedArray(LineItem\u organizationId,LineItem\u lineItemId)]|
|[WrappedArray(OrganizationId、LineItemId、SegmentSequence\u segmentId)]|
+---------------------------------------------------------------------+
从上面两行我想创建一个字符串,它是这种格式
“LineItem\u organizationId”、“LineItem\u lineItemId”
“OrganizationId”、“LineItemId”、“SegmentSequence\u segmentId”
我想将其创建为动态,以便在第一列中显示第三个值,我的字符串将有一个以上的分隔列值
我怎样才能在Scala中做到这一点
这就是我为了创建数据帧所做的事情
val xmlFiles = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML"
val discriptorFileLOcation = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//FinancialLineItem//REFXML"
import sqlContext.implicits._
val dfDiscriptor = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "FlatFileDescriptor").load(discriptorFileLOcation)
dfDiscriptor.printSchema()
val firstColumn = dfDiscriptor.select($"FFFileType.FFRecord.FFField").as("FFField")
val FirstColumnOfHeaderFile = firstColumn.select(explode($"FFField")).as("ColumnsDetails").select(explode($"col")).first.get(0).toString().split(",")(5)
println(FirstColumnOfHeaderFile)
//dfDiscriptor.printSchema()
val primaryKeyColumnsFinancialLineItem = dfDiscriptor.select(explode($"FFFileType.FFRecord.FFPrimKey.FFPrimKeyCol"))
primaryKeyColumnsFinancialLineItem.show(false)
添加完整模式
根目录
|--FFColumnDelimiter:字符串(nullable=true)
|--FFContentItem:struct(nullable=true)
||--_值:字符串(nullable=true)
||--\u ffMajVers:long(nullable=true)
||--\u ffMinVers:double(nullable=true)
|--FFFileEncoding:字符串(nullable=true)
|--FFFileType:数组(nullable=true)
||--元素:struct(containsnall=true)
|| |--FFPhysicalFile:array(nullable=true)
|| | |--元素:struct(containsnall=true)
|| | |--FFFileName:string(nullable=true)
|| | |--FFRowCount:long(nullable=true)
|| |--FFRecord:struct(nullable=true)
|| | |--FFField:array(nullable=true)
|| | | |--元素:struct(containsnall=true)
|| | | |--FFColumnNumber:long(nullable=true)
|| | | |--FFDataType:string(nullable=true)
|| | | | |--FFFacets:struct(nullable=true)
|| | | | |--FFMaxLength:long(nullable=true)
|| | | | |--FFTotalDigits:long(nullable=true)
|| | | |--FFFieldIsOptional:布尔值(nullable=true)
|| | | |--FFFieldName:string(nullable=true)
|| | | |--FFForKey:struct(nullable=true)
|| | | | |--FFForKeyCol:string(nullable=true)
|| | | | |--FFForKeyRecord:string(nullable=true)
|| | |--FFPrimKey:struct(nullable=true)
|| | |--FFPrimKeyCol:array(nullable=true)
|| | | | |--元素:字符串(containsnall=true)
|| | |--FFRecordType:string(nullable=true)
|--FFHeaderRow:布尔值(nullable=true)
|--FFId:字符串(nullable=true)
|--FFRowDelimiter:字符串(nullable=true)
|--FFTimeStamp:string(nullable=true)
|--_env:string(nullable=true)
|--_ffMajVers:long(nullable=true)
|--\u ffMinVers:double(nullable=true)
|--\u ffPubstyle:string(nullable=true)
|--\u schemaLocation:string(nullable=true)
|--_sr:string(nullable=true)
|--_xmlns:string(nullable=true)
|--_xsi:string(nullable=true)
查看给定的数据帧
+---------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------+
|[WrappedArray(LineItem_organizationId, LineItem_lineItemId)] |
|[WrappedArray(OrganizationId, LineItemId, SegmentSequence_segmentId)]|
+---------------------------------------------------------------------+
它必须具有以下架构
|-- value: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
如果上述假设成立,那么您应该编写一个udf
函数
import org.apache.spark.sql.functions._
def arrayToString = udf((arr: collection.mutable.WrappedArray[collection.mutable.WrappedArray[String]]) => arr.flatten.mkString(", "))
并在数据帧中使用它作为
df.withColumn("value", arrayToString($"value"))
你应该有
+-----------------------------------------------------+
|value |
+-----------------------------------------------------+
|LineItem_organizationId, LineItem_lineItemId |
|OrganizationId, LineItemId, SegmentSequence_segmentId|
+-----------------------------------------------------+
|-- value: string (nullable = true)