NotNull条件不适用于spark数据帧scala中的withColumn条件_Scala_Spark Dataframe_Apache Spark Xml

NotNull条件不适用于spark数据帧scala中的withColumn条件

scala

NotNull条件不适用于spark数据帧scala中的withColumn条件,scala,spark-dataframe,apache-spark-xml,Scala,Spark Dataframe,Apache Spark Xml,所以，我试图在找到列时添加它，但在xml模式中不存在列时，我不想添加它。这就是我在做的我想我在检查病情时做错了什么 val temp = tempNew1 .withColumn("BookMark", when($"AsReportedItem.fs:BookMark".isNotNull or $"AsReportedItem.fs:BookMark" =!= "", 0)) .withColumn("DocByteOffset", when($"AsReportedItem.

所以，我试图在找到列时添加它，但在xml模式中不存在列时，我不想添加它。这就是我在做的我想我在检查病情时做错了什么

  val temp = tempNew1
  .withColumn("BookMark", when($"AsReportedItem.fs:BookMark".isNotNull or $"AsReportedItem.fs:BookMark" =!= "", 0))
  .withColumn("DocByteOffset", when($"AsReportedItem.fs:DocByteOffset".isNotNull or $"AsReportedItem.fs:DocByteOffset" =!= "", 0))
  .withColumn("DocByteLength", when($"AsReportedItem.fs:DocByteLength".isNotNull or $"AsReportedItem.fs:DocByteLength" =!= "", 0))
  .withColumn("EditedDescription", when($"AsReportedItem.fs:EditedDescription".isNotNull or $"AsReportedItem.fs:EditedDescription" =!= "", 0))
  .withColumn("EditedDescription", when($"AsReportedItem.fs:EditedDescription._VALUE".isNotNull or $"AsReportedItem.fs:EditedDescription._VALUE" =!= "", 0))
  .withColumn("EditedDescription_languageId", when($"AsReportedItem.fs:EditedDescription._languageId".isNotNull or $"AsReportedItem.fs:EditedDescription._languageId" =!= "", 0))
  .withColumn("ReportedDescription", when($"AsReportedItem.fs:ReportedDescription._VALUE".isNotNull or $"AsReportedItem.fs:ReportedDescription._VALUE" =!= "", 0))
  .withColumn("ReportedDescription_languageId", when($"AsReportedItem.fs:ReportedDescription._languageId".isNotNull or $"AsReportedItem.fs:ReportedDescription._languageId" =!= "", 0))
  .withColumn("FinancialAsReportedLineItemName_languageId", when($"FinancialAsReportedLineItemName._languageId".isNotNull or $"FinancialAsReportedLineItemName._languageId" =!= "", 0))
  .withColumn("FinancialAsReportedLineItemName", when($"FinancialAsReportedLineItemName._VALUE".isNotNull or $"FinancialAsReportedLineItemName._VALUE" =!= "", 0))
  .withColumn("PeriodPermId_objectTypeId", when($"PeriodPermId._objectTypeId".isNotNull or $"PeriodPermId._objectTypeId" =!= "", 0))
  .withColumn("PeriodPermId", when($"PeriodPermId._VALUE".isNotNull or $"PeriodPermId._VALUE" =!= "", 0))
  .drop($"AsReportedItem").drop($"AsReportedItem")

但当我找到列时，它对我来说很好，但当列不在

tempNew1

中时，我会得到错误

基本上，如果在架构中找不到标记，我根本不想处理column。

抚慰我在这里失踪。请帮我确定问题

我得到的错误如下

线程“main”org.apache.spark.sql.AnalysisException中的异常：无法解析给定输入列的“

AsReportedItem.fs:BookMark

”： [IsAsReportedCurrencySetManually

这也是我尝试过的

    def hasColumn(df: DataFrame, path: String) = Try(df(path)).isSuccess
 val temp = tempNew1.withColumn("BookMark", when(hasColumn(tempNew1,"AsReportedItem.fs:BookMark") == true, $"AsReportedItem.fs:BookMark"))

但无法使其充分发挥作用

这是可行的，但我如何才能为所有列编写它

val temp = if (hasColumn(tempNew1, "AsReportedItem")) {
      tempNew1
        .withColumn("BookMark", $"AsReportedItem.fs:BookMark")
        .withColumn("DocByteOffset", $"AsReportedItem.fs:DocByteOffset")
        .withColumn("DocByteLength", $"AsReportedItem.fs:DocByteLength")
        .withColumn("EditedDescription", $"AsReportedItem.fs:EditedDescription")
        .withColumn("EditedDescription", $"AsReportedItem.fs:EditedDescription._VALUE")
        .withColumn("EditedDescription_languageId", $"AsReportedItem.fs:EditedDescription._languageId")
        .withColumn("ReportedDescription", $"AsReportedItem.fs:ReportedDescription._VALUE")
        .withColumn("ReportedDescription_languageId", $"AsReportedItem.fs:ReportedDescription._languageId")
        .withColumn("FinancialAsReportedLineItemName_languageId", $"FinancialAsReportedLineItemName._languageId")
        .withColumn("FinancialAsReportedLineItemName", $"FinancialAsReportedLineItemName._VALUE")
        .withColumn("PeriodPermId_objectTypeId", $"PeriodPermId._objectTypeId")
        .withColumn("PeriodPermId", $"PeriodPermId._VALUE")
        .drop($"AsReportedItem")
    } else {
      tempNew1
        .withColumn("BookMark", lit(null))
        .withColumn("DocByteOffset", lit(null))
        .withColumn("DocByteLength", lit(null))
        .withColumn("EditedDescription", lit(null))
        .withColumn("EditedDescription", lit(null))
        .withColumn("EditedDescription_languageId", lit(null))
        .withColumn("ReportedDescription", lit(null))
        .withColumn("ReportedDescription_languageId", lit(null))
        .withColumn("FinancialAsReportedLineItemName_languageId", $"FinancialAsReportedLineItemName._languageId")
        .withColumn("FinancialAsReportedLineItemName", $"FinancialAsReportedLineItemName._VALUE")
        .withColumn("PeriodPermId_objectTypeId", $"PeriodPermId._objectTypeId")
        .withColumn("PeriodPermId", $"PeriodPermId._VALUE")
        .drop($"AsReportedItem")

    }

添加主数据框的架构

root
 |-- DataPartition: string (nullable = true)
 |-- TimeStamp: string (nullable = true)
 |-- PeriodId: long (nullable = true)
 |-- SourceId: long (nullable = true)
 |-- FinancialStatementLineItem_lineItemId: long (nullable = true)
 |-- FinancialStatementLineItem_lineItemInstanceKey: long (nullable = true)
 |-- StatementCurrencyId: long (nullable = true)
 |-- StatementTypeCode: string (nullable = true)
 |-- uniqueFundamentalSet: long (nullable = true)
 |-- AuditID: string (nullable = true)
 |-- EstimateMethodCode: string (nullable = true)
 |-- EstimateMethodId: long (nullable = true)
 |-- FinancialAsReportedLineItemName: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _languageId: long (nullable = true)
 |-- FinancialStatementLineItemSequence: long (nullable = true)
 |-- FinancialStatementLineItemValue: double (nullable = true)
 |-- FiscalYear: long (nullable = true)
 |-- IsAnnual: boolean (nullable = true)
 |-- IsAsReportedCurrencySetManually: boolean (nullable = true)
 |-- IsCombinedItem: boolean (nullable = true)
 |-- IsDerived: boolean (nullable = true)
 |-- IsExcludedFromStandardization: boolean (nullable = true)
 |-- IsFinal: boolean (nullable = true)
 |-- IsTotal: boolean (nullable = true)
 |-- PeriodEndDate: string (nullable = true)
 |-- PeriodPermId: struct (nullable = true)
 |    |-- _VALUE: long (nullable = true)
 |    |-- _objectTypeId: long (nullable = true)
 |-- ReportedCurrencyId: long (nullable = true)
 |-- StatementSectionCode: string (nullable = true)
 |-- StatementSectionId: long (nullable = true)
 |-- StatementSectionIsCredit: boolean (nullable = true)
 |-- SystemDerivedTypeCode: string (nullable = true)
 |-- SystemDerivedTypeCodeId: long (nullable = true)
 |-- Unit: double (nullable = true)
 |-- UnitEnumerationId: long (nullable = true)
 |-- FFAction|!|: string (nullable = true)
 |-- PartitionYear: long (nullable = true)
 |-- PartitionStatement: string (nullable = true)

在架构中出现列之后添加架构

|-- uniqueFundamentalSet: long (nullable = true)
 |-- AsReportedItem: struct (nullable = true)
 |    |-- fs:BookMark: string (nullable = true)
 |    |-- fs:DocByteLength: long (nullable = true)
 |    |-- fs:DocByteOffset: long (nullable = true)
 |    |-- fs:EditedDescription: struct (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _languageId: long (nullable = true)
 |    |-- fs:ItemDisplayedNegativeFlag: boolean (nullable = true)
 |    |-- fs:ItemDisplayedValue: double (nullable = true)
 |    |-- fs:ItemScalingFactor: long (nullable = true)
 |    |-- fs:ReportedDescription: struct (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _languageId: long (nullable = true)
 |    |-- fs:ReportedValue: double (nullable = true)
 |-- EstimateMethodCode: string (nullable = true)
 |-- EstimateMethodId: long (nullable = true)
 |-- FinancialAsReportedLineItemName: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _languageId: long (nullable = true)
 |-- FinancialLineItemSource: long (nullable = true)

把它作为一个答案，因为它越来越大，评论

假设您有一组要添加的列：

val cols = Seq("BookMark")

您需要对原始的

数据帧

重复调用

with column

，将结果分配给新的

数据帧

val result = cols.foldLeft(tempNew1)((df, name) =>
  df.withColumn(name, if (df.column.contains(s"AsReportedItem.fs:$name"))
    col(s"AsReportedItem.fs:$name") else lit("null")))

fold

接受第一个参数（

tempNew1

，在您的例子中）并为

cols

中的每个元素调用提供的函数，每次将结果分配给一个新的

DataFrame

，将其作为一个答案，因为它对于注释来说太大了

假设您有一组要添加的列：

val cols = Seq("BookMark")

您需要对原始的

数据帧

重复调用

with column

，将结果分配给新的

数据帧

val result = cols.foldLeft(tempNew1)((df, name) =>
  df.withColumn(name, if (df.column.contains(s"AsReportedItem.fs:$name"))
    col(s"AsReportedItem.fs:$name") else lit("null")))

fold

接受第一个参数（

tempNew1

，在您的例子中），并为

cols

中的每个元素调用提供的函数，每次都将结果分配给一个新的

数据帧

，我将向您展示在AsReportedItem结构列上应用逻辑的一般方法（为了清楚起见，我已在代码中进行了注释）

对其余两个结构列

FinancialAlasReportedLineItemName

和

PeriodPermId

应用相同的逻辑，但在转换后的数据帧上，即

final\u AsReportedItem

上的和
tempNew1上的上的和tempNew1 要归功于我将向您展示在AsReportedItem struct列上应用逻辑的一般方法（为了清晰起见，我在代码中进行了注释）对其余两个结构列FinancialAlasReportedLineItemName 和PeriodPermId 应用相同的逻辑，但在转换后的数据帧上，即final\u AsReportedItem上的和tempNew1上的上的和tempNew1 归功于您可以检查tempNew1 数据集的columns 属性是否存在AsReportedItem.fs:BookMark 列，并根据结果有条件地调用withColumn 。请参阅（）更多details@AlexSavitsky但是我有10个这样的列，我必须一个接一个地做吗？是的。但是，您可以将您的列放在序列中，根据数据集列对其进行过滤，然后使用withColumn 折叠您的数据集，使其具有一定的功能-style@AlexSavitsky我刚试过用hasColumn，但有些东西丢失了ng..如果你能抽出一些时间，请看一下语法..正如@Alexavitsky指出你要使用foldleft一样，使用这个想法，但给出的代码不会起作用。你必须使用他的想法来处理你拥有的结构列，你应该解决它；）您可以检查tempNew1 数据集的columns 属性是否存在AsReportedItem.fs:BookMark 列，并根据结果有条件地调用withColumn 。请参阅（）更多details@AlexSavitsky但是我有10个这样的列，我必须一个接一个地做吗？是的。但是，您可以将您的列放在序列中，根据数据集列对其进行过滤，然后使用withColumn 折叠您的数据集，使其具有一定的功能-style@AlexSavitsky我刚试过用hasColumn，但有些东西丢失了ng..如果你能抽出一些时间，请看一下语法..正如@Alexavitsky指出你要使用foldleft，使用这个想法，但给出的代码不起作用。你必须使用他的想法来处理你拥有的struct列，你会得到解决；）我没有得到错误，但所有记录我都得到空值，我不是getting错误，但对于所有记录，我在线程“main”中得到了null值，包括tting error异常org.apache.spark.sql.AnalysisException:无法解析'AsReportedItem.*'给定输入列'PartitionYear，EstimateMethodId，EstimateMethodCode 我在问题中添加了这一点OK，因此这不会引发任何错误..但我需要在获取列后爆炸..我正在搜索一个示例抱歉，您应该查看获取列线程“main”中的ror异常org.apache.spark.sql.AnalysisException:无法解析'AsReportedItem.*'给定输入列'PartitionYear，EstimateMethodId，EstimateMethodCode 我在问题中添加了这一点OK，因此这不会引发任何错误..但我需要在获得列后分解..我正在搜索一个示例，抱歉，您应该看看