Scala 如何从spark dataframe中的多层结构类型创建列?
这是我的主数据框的模式:Scala 如何从spark dataframe中的多层结构类型创建列?,scala,apache-spark,apache-spark-sql,spark-dataframe,Scala,Apache Spark,Apache Spark Sql,Spark Dataframe,这是我的主数据框的模式: root |-- DataPartition: string (nullable = true) |-- TimeStamp: string (nullable = true) |-- _lineItemId: long (nullable = true) |-- _organizationId: long (nullable = true) |-- fl:FinancialConceptGlobal: string (nullable = true) |-
root
|-- DataPartition: string (nullable = true)
|-- TimeStamp: string (nullable = true)
|-- _lineItemId: long (nullable = true)
|-- _organizationId: long (nullable = true)
|-- fl:FinancialConceptGlobal: string (nullable = true)
|-- fl:FinancialConceptGlobalId: long (nullable = true)
|-- fl:FinancialConceptLocal: string (nullable = true)
|-- fl:FinancialConceptLocalId: long (nullable = true)
|-- fl:InstrumentId: long (nullable = true)
|-- fl:IsCredit: boolean (nullable = true)
|-- fl:IsDimensional: boolean (nullable = true)
|-- fl:IsRangeAllowed: boolean (nullable = true)
|-- fl:IsSegmentedByOrigin: boolean (nullable = true)
|-- fl:LineItemName: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _languageId: long (nullable = true)
|-- fl:LocalLanguageLabel: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _languageId: long (nullable = true)
|-- fl:SegmentChildDescription: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _languageId: long (nullable = true)
|-- fl:SegmentGroupDescription: string (nullable = true)
|-- fl:Segments: struct (nullable = true)
| |-- fl:SegmentSequence: struct (nullable = true)
| | |-- _VALUE: long (nullable = true)
| | |-- _segmentId: long (nullable = true)
|-- fl:StatementTypeCode: string (nullable = true)
|-- FFAction|!|: string (nullable = true)
由此,我需要的输出如下:
LineItem.organizationId|^|LineItem.lineItemId|^|StatementTypeCode|^|LineItemName|^|LocalLanguageLabel|^|FinancialConceptLocal|^|FinancialConceptGlobal|^|IsDimensional|^|InstrumentId|^|LineItemSequence|^|PhysicalMeasureId|^|FinancialConceptCodeGlobalSecondary|^|IsRangeAllowed|^|IsSegmentedByOrigin|^|SegmentGroupDescription|^|SegmentChildDescription|^|SegmentChildLocalLanguageLabel|^|LocalLanguageLabel.languageId|^|LineItemName.languageId|^|SegmentChildDescription.languageId|^|SegmentChildLocalLanguageLabel.languageId|^|SegmentGroupDescription.languageId|^|SegmentMultipleFundbDescription|^|SegmentMultipleFundbDescription.languageId|^|IsCredit|^|FinancialConceptLocalId|^|FinancialConceptGlobalId|^|FinancialConceptCodeGlobalSecondaryId|^|FFAction|!|
4295879842|^|1246|^|CUS|^|Net Sales-Customer Segment|^|相手先別の販売高(相手先別)|^|JCSNTS|^|REXM|^|False|^||^||^||^||^|False|^|False|^|CUS_JCSNTS|^||^||^|505126|^|505074|^|505074|^|505126|^|505126|^||^|505074|^|True|^|3020155|^|3015249|^||^|I|!|
为了获得以上输出,我尝试了以下方法:
val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML")
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select($"env:Header.fun:DataPartitionId".as("DataPartition"), $"env:Header.env:info.env:TimeStamp".as("TimeStamp"), $"column1.*")
val dfType = dfContentItem.select(getDataPartition($"DataPartition").as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"env:Data.fl:LineItem.*", getFFActionParent($"_action").as("FFAction|!|")).filter($"env:Data.fl:LineItem._organizationId".isNotNull)
有了这个,我的产量就下降了
+------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+-----------------------------------------------------------------------------------------------------+---------------------+--------------------------+--------------------------+-----------+--------------------+-----------+
|DataPartition |TimeStamp |_lineItemId|_organizationId|fl:FinancialConceptGlobal|fl:FinancialConceptGlobalId|fl:FinancialConceptLocal|fl:FinancialConceptLocalId|fl:InstrumentId|fl:IsCredit|fl:IsDimensional|fl:IsRangeAllowed|fl:IsSegmentedByOrigin|fl:LineItemName |fl:LocalLanguageLabel|fl:SegmentChildDescription|fl:SegmentGroupDescription|fl:Segments|fl:StatementTypeCode|FFAction|!||
+------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+-----------------------------------------------------------------------------------------------------+---------------------+--------------------------+--------------------------+-----------+--------------------+-----------+
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|3 |4298009288 |XTOT |3016350 |null |null |null |true |false |false |false |[Total Assets,505074] |null |null |null |null |BAL |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|9 |4298009288 |XTCOI |3016329 |null |null |21521455386 |true |false |false |false |[S/O-Ordinary Shares,505074] |null |null |null |null |BAL |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|10 |4298009288 |XTCOC |3016328 |null |null |null |true |false |false |false |[Total Equivalent No of Common Shares O/S,505074] |null |null |null |null |BAL |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|11 |4298009288 |XTCTI |3016331 |null |null |21521455386 |true |false |false |false |[T/S-Ordinary Shares,505074] |null |null |null |null |BAL |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|19 |4298009288 |ESGA |3018991 |null |null |null |false |false |false |false |[General and administrative expense,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|20 |4298009288 |XTOE |3016349 |null |null |null |false |false |false |false |[Total Operating Expense,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|21 |4298009288 |XIBT |3016299 |null |null |null |true |false |false |false |[Net Income Before Taxes,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|22 |4298009288 |TTAX |3019472 |null |null |null |false |false |false |false |[Income tax benefit,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|23 |4298009288 |XIAT |3016297 |null |null |null |true |false |false |false |[Net Income After Taxes,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|24 |4298009288 |XBXP |3016252 |null |null |null |true |false |false |false |[Net Income Before Extra. Items,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|25 |4298009288 |XNIC |3019922 |null |null |null |true |false |false |false |[Net loss,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|26 |4298009288 |XNCN |3016316 |null |null |null |true |false |false |false |[Income Available to Com Excl ExtraOrd,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|27 |4298009288 |XNCX |3016318 |null |null |null |true |false |false |false |[Net loss,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|29 |4298009288 |CDNI |3018735 |null |null |null |true |false |false |false |[Diluted Net Income,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|30 |4298009288 |XTAX |3019589 |null |null |null |false |false |false |false |[Income Taxes - Total,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|33 |4298009288 |RNTS |3015275 |null |null |null |true |false |false |false |[Revenues,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|34 |4298009288 |XTLR |3016345 |null |null |null |true |false |false |false |[Total revenues,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|35 |4298009288 |XTCII |3016326 |null |null |21521455386 |true |false |false |null |[Common Shares Issued - (Instrument Level),505074] |null |null |null |null |BAL |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|36 |4298009288 |XTCTIPF |1002023922 |null |null |21521455386 |true |false |false |null |[Common Treasury Shares on Instrument Level Multiplied to its Conversion to Primary Factor,505074] |null |null |null |null |BAL |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|37 |4298009288 |XTCOIPF |1002023921 |null |null |21521455386 |true |false |false |null |[Common Shares Outstanding on Instrument Level Multiplied to its Conversion to Primary Factor,505074]|null |null |null |null |BAL |I|!| |
+------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+-----------------------------------------------------------------------------------------------------+---------------------+--------------------------+--------------------------+-----------+--------------------+-----------+
我的问题是列名称fl:LineItemName。
这是一个结构类型,我需要从中创建两个不同的列。
一个用于作为LineItemName的_值,另一个用于作为languageId的_languageId
我必须以同样的方式为fl:LocalLanguageLabel和fl:SegmentChildDescription创建
我必须使用with column选项来执行此操作吗?
或者没有它我还能做什么
除最后一行外,这对我有效:
val dfType = dfContentItem.select(getDataPartition($"DataPartition").as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"env:Data.fl:LineItem.*", getFFActionParent($"_action").as("FFAction|!|")).filter($"env:Data.fl:LineItem._organizationId".isNotNull)
val dfnewTemp = dfType
.withColumn("LineItemName", $"fl:LineItemName._VALUE")
.withColumn("LineItemName.languageId", $"fl:LineItemName._languageId")
.withColumn("LocalLanguageLabel", $"fl:LocalLanguageLabel._languageId")
.withColumn("LocalLanguageLabel.languageId", $"fl:LocalLanguageLabel._VALUE")
.withColumn("SegmentChildDescription", $"fl:SegmentChildDescription._languageId")
.withColumn("SegmentChildDescription.languageId", $"fl:SegmentChildDescription._VALUE")
.drop($"fl:LineItemName")
.drop($"fl:LocalLanguageLabel")
.drop($"fl:SegmentChildDescription")
dfnewTemp.show(false)
val temp = dfnewTemp.select(dfnewTemp.columns.filter(x => !x.equals("fl:Segments")).map(x => col(x).as(x.replace("_", "LineItem_").replace("fl:", ""))): _*)
您需要做的是使用with column并简单地选择结构中存在的变量。fl:LineItemName列包含一个具有两个值的结构,_VALUE和_languageId,可以按如下方式简单选择:
val df = dfType.withColumn("LineItemName", $"fl:LineItemName._VALUE")
.withColumn("LanguageId", $"fl:LineItemName._languageId")
.drop("fl:LineItemName")
对于前面提到的另外两个专栏,只需做同样的事情。因此,如果我理解正确,你想把fl:LineItemName列拆分成两个LineItemName和LanguageId吗?@Shaido是的,确实是的,谢谢你,我们还需要删除$fl:LineItemName。我在想我们可以在explode本身中这样做吗…@SUDARSHAN:没错。也可以使用explode,但是,这里并不方便,因为列名与结构中的变量名不同also@SUDARSHAN:是的,正如我提到的,您需要对其他两列执行相同的操作,但请确保使用不同的列名。只需再做一次澄清。用最新的更改更新了我的问题。我在最后一行执行此操作,但在线程main org.apache.spark.sql.AnalysisException中出现错误,如Exception:无法从LineItemName368:需要中提取值结构类型,但得到字符串;