Scala 如何从spark dataframe中的多层结构类型创建列？_Scala_Apache Spark_Apache Spark Sql_Spark Dataframe

Scala 如何从spark dataframe中的多层结构类型创建列？

scala apache-spark

Scala 如何从spark dataframe中的多层结构类型创建列？,scala,apache-spark,apache-spark-sql,spark-dataframe,Scala,Apache Spark,Apache Spark Sql,Spark Dataframe,这是我的主数据框的模式： root |-- DataPartition: string (nullable = true) |-- TimeStamp: string (nullable = true) |-- _lineItemId: long (nullable = true) |-- _organizationId: long (nullable = true) |-- fl:FinancialConceptGlobal: string (nullable = true) |-

这是我的主数据框的模式：

root
 |-- DataPartition: string (nullable = true)
 |-- TimeStamp: string (nullable = true)
 |-- _lineItemId: long (nullable = true)
 |-- _organizationId: long (nullable = true)
 |-- fl:FinancialConceptGlobal: string (nullable = true)
 |-- fl:FinancialConceptGlobalId: long (nullable = true)
 |-- fl:FinancialConceptLocal: string (nullable = true)
 |-- fl:FinancialConceptLocalId: long (nullable = true)
 |-- fl:InstrumentId: long (nullable = true)
 |-- fl:IsCredit: boolean (nullable = true)
 |-- fl:IsDimensional: boolean (nullable = true)
 |-- fl:IsRangeAllowed: boolean (nullable = true)
 |-- fl:IsSegmentedByOrigin: boolean (nullable = true)
 |-- fl:LineItemName: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _languageId: long (nullable = true)
 |-- fl:LocalLanguageLabel: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _languageId: long (nullable = true)
 |-- fl:SegmentChildDescription: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _languageId: long (nullable = true)
 |-- fl:SegmentGroupDescription: string (nullable = true)
 |-- fl:Segments: struct (nullable = true)
 |    |-- fl:SegmentSequence: struct (nullable = true)
 |    |    |-- _VALUE: long (nullable = true)
 |    |    |-- _segmentId: long (nullable = true)
 |-- fl:StatementTypeCode: string (nullable = true)
 |-- FFAction|!|: string (nullable = true)

由此，我需要的输出如下：

LineItem.organizationId|^|LineItem.lineItemId|^|StatementTypeCode|^|LineItemName|^|LocalLanguageLabel|^|FinancialConceptLocal|^|FinancialConceptGlobal|^|IsDimensional|^|InstrumentId|^|LineItemSequence|^|PhysicalMeasureId|^|FinancialConceptCodeGlobalSecondary|^|IsRangeAllowed|^|IsSegmentedByOrigin|^|SegmentGroupDescription|^|SegmentChildDescription|^|SegmentChildLocalLanguageLabel|^|LocalLanguageLabel.languageId|^|LineItemName.languageId|^|SegmentChildDescription.languageId|^|SegmentChildLocalLanguageLabel.languageId|^|SegmentGroupDescription.languageId|^|SegmentMultipleFundbDescription|^|SegmentMultipleFundbDescription.languageId|^|IsCredit|^|FinancialConceptLocalId|^|FinancialConceptGlobalId|^|FinancialConceptCodeGlobalSecondaryId|^|FFAction|!|
4295879842|^|1246|^|CUS|^|Net Sales-Customer Segment|^|相手先別の販売高（相手先別）|^|JCSNTS|^|REXM|^|False|^||^||^||^||^|False|^|False|^|CUS_JCSNTS|^||^||^|505126|^|505074|^|505074|^|505126|^|505126|^||^|505074|^|True|^|3020155|^|3015249|^||^|I|!|

为了获得以上输出，我尝试了以下方法：

val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML")
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select($"env:Header.fun:DataPartitionId".as("DataPartition"), $"env:Header.env:info.env:TimeStamp".as("TimeStamp"), $"column1.*")
val dfType = dfContentItem.select(getDataPartition($"DataPartition").as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"env:Data.fl:LineItem.*", getFFActionParent($"_action").as("FFAction|!|")).filter($"env:Data.fl:LineItem._organizationId".isNotNull)

有了这个，我的产量就下降了

 +------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+-----------------------------------------------------------------------------------------------------+---------------------+--------------------------+--------------------------+-----------+--------------------+-----------+
|DataPartition     |TimeStamp                |_lineItemId|_organizationId|fl:FinancialConceptGlobal|fl:FinancialConceptGlobalId|fl:FinancialConceptLocal|fl:FinancialConceptLocalId|fl:InstrumentId|fl:IsCredit|fl:IsDimensional|fl:IsRangeAllowed|fl:IsSegmentedByOrigin|fl:LineItemName                                                                                      |fl:LocalLanguageLabel|fl:SegmentChildDescription|fl:SegmentGroupDescription|fl:Segments|fl:StatementTypeCode|FFAction|!||
+------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+-----------------------------------------------------------------------------------------------------+---------------------+--------------------------+--------------------------+-----------+--------------------+-----------+
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|3          |4298009288     |XTOT                     |3016350                    |null                    |null                      |null           |true       |false           |false            |false                 |[Total Assets,505074]                                                                                |null                 |null                      |null                      |null       |BAL                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|9          |4298009288     |XTCOI                    |3016329                    |null                    |null                      |21521455386    |true       |false           |false            |false                 |[S/O-Ordinary Shares,505074]                                                                         |null                 |null                      |null                      |null       |BAL                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|10         |4298009288     |XTCOC                    |3016328                    |null                    |null                      |null           |true       |false           |false            |false                 |[Total Equivalent No of Common Shares O/S,505074]                                                    |null                 |null                      |null                      |null       |BAL                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|11         |4298009288     |XTCTI                    |3016331                    |null                    |null                      |21521455386    |true       |false           |false            |false                 |[T/S-Ordinary Shares,505074]                                                                         |null                 |null                      |null                      |null       |BAL                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|19         |4298009288     |ESGA                     |3018991                    |null                    |null                      |null           |false      |false           |false            |false                 |[General and administrative expense,505074]                                                          |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|20         |4298009288     |XTOE                     |3016349                    |null                    |null                      |null           |false      |false           |false            |false                 |[Total Operating Expense,505074]                                                                     |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|21         |4298009288     |XIBT                     |3016299                    |null                    |null                      |null           |true       |false           |false            |false                 |[Net Income Before Taxes,505074]                                                                     |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|22         |4298009288     |TTAX                     |3019472                    |null                    |null                      |null           |false      |false           |false            |false                 |[Income tax benefit,505074]                                                                          |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|23         |4298009288     |XIAT                     |3016297                    |null                    |null                      |null           |true       |false           |false            |false                 |[Net Income After Taxes,505074]                                                                      |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|24         |4298009288     |XBXP                     |3016252                    |null                    |null                      |null           |true       |false           |false            |false                 |[Net Income Before Extra. Items,505074]                                                              |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|25         |4298009288     |XNIC                     |3019922                    |null                    |null                      |null           |true       |false           |false            |false                 |[Net loss,505074]                                                                                    |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|26         |4298009288     |XNCN                     |3016316                    |null                    |null                      |null           |true       |false           |false            |false                 |[Income Available to Com Excl ExtraOrd,505074]                                                       |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|27         |4298009288     |XNCX                     |3016318                    |null                    |null                      |null           |true       |false           |false            |false                 |[Net loss,505074]                                                                                    |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|29         |4298009288     |CDNI                     |3018735                    |null                    |null                      |null           |true       |false           |false            |false                 |[Diluted Net Income,505074]                                                                          |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|30         |4298009288     |XTAX                     |3019589                    |null                    |null                      |null           |false      |false           |false            |false                 |[Income Taxes - Total,505074]                                                                        |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|33         |4298009288     |RNTS                     |3015275                    |null                    |null                      |null           |true       |false           |false            |false                 |[Revenues,505074]                                                                                    |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|34         |4298009288     |XTLR                     |3016345                    |null                    |null                      |null           |true       |false           |false            |false                 |[Total revenues,505074]                                                                              |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|35         |4298009288     |XTCII                    |3016326                    |null                    |null                      |21521455386    |true       |false           |false            |null                  |[Common Shares Issued - (Instrument Level),505074]                                                   |null                 |null                      |null                      |null       |BAL                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|36         |4298009288     |XTCTIPF                  |1002023922                 |null                    |null                      |21521455386    |true       |false           |false            |null                  |[Common Treasury Shares on Instrument Level Multiplied to its Conversion to Primary Factor,505074]   |null                 |null                      |null                      |null       |BAL                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|37         |4298009288     |XTCOIPF                  |1002023921                 |null                    |null                      |21521455386    |true       |false           |false            |null                  |[Common Shares Outstanding on Instrument Level Multiplied to its Conversion to Primary Factor,505074]|null                 |null                      |null                      |null       |BAL                 |I|!|       |
+------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+-----------------------------------------------------------------------------------------------------+---------------------+--------------------------+--------------------------+-----------+--------------------+-----------+

我的问题是列名称fl:LineItemName。这是一个结构类型，我需要从中创建两个不同的列。一个用于作为LineItemName的_值，另一个用于作为languageId的_languageId

我必须以同样的方式为fl:LocalLanguageLabel和fl:SegmentChildDescription创建

我必须使用with column选项来执行此操作吗？或者没有它我还能做什么

除最后一行外，这对我有效：

val dfType = dfContentItem.select(getDataPartition($"DataPartition").as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"env:Data.fl:LineItem.*", getFFActionParent($"_action").as("FFAction|!|")).filter($"env:Data.fl:LineItem._organizationId".isNotNull)

val dfnewTemp = dfType
  .withColumn("LineItemName", $"fl:LineItemName._VALUE")
  .withColumn("LineItemName.languageId", $"fl:LineItemName._languageId")
  .withColumn("LocalLanguageLabel", $"fl:LocalLanguageLabel._languageId")
  .withColumn("LocalLanguageLabel.languageId", $"fl:LocalLanguageLabel._VALUE")   
  .withColumn("SegmentChildDescription", $"fl:SegmentChildDescription._languageId")
  .withColumn("SegmentChildDescription.languageId", $"fl:SegmentChildDescription._VALUE")
  .drop($"fl:LineItemName")
  .drop($"fl:LocalLanguageLabel")
  .drop($"fl:SegmentChildDescription")
dfnewTemp.show(false)
val temp = dfnewTemp.select(dfnewTemp.columns.filter(x => !x.equals("fl:Segments")).map(x => col(x).as(x.replace("_", "LineItem_").replace("fl:", ""))): _*)

您需要做的是使用with column并简单地选择结构中存在的变量。fl:LineItemName列包含一个具有两个值的结构，_VALUE和_languageId，可以按如下方式简单选择：

val df = dfType.withColumn("LineItemName", $"fl:LineItemName._VALUE")
  .withColumn("LanguageId", $"fl:LineItemName._languageId")
  .drop("fl:LineItemName")

对于前面提到的另外两个专栏，只需做同样的事情。

因此，如果我理解正确，你想把fl:LineItemName列拆分成两个LineItemName和LanguageId吗？@Shaido是的，确实是的，谢谢你，我们还需要删除$fl:LineItemName。我在想我们可以在explode本身中这样做吗…@SUDARSHAN：没错。也可以使用explode，但是，这里并不方便，因为列名与结构中的变量名不同also@SUDARSHAN：是的，正如我提到的，您需要对其他两列执行相同的操作，但请确保使用不同的列名。只需再做一次澄清。用最新的更改更新了我的问题。我在最后一行执行此操作，但在线程main org.apache.spark.sql.AnalysisException中出现错误，如Exception:无法从LineItemName368:需要中提取值结构类型，但得到字符串；