将spark中的dataframe转换为XML会在写入文件系统时引发StaxXML中的空指针异常_Xml_Scala_Apache Spark_Databricks_Stax

将spark中的dataframe转换为XML会在写入文件系统时引发StaxXML中的空指针异常

xml scala apache-spark

将spark中的dataframe转换为XML会在写入文件系统时引发StaxXML中的空指针异常,xml,scala,apache-spark,databricks,stax,Xml,Scala,Apache Spark,Databricks,Stax,我正在基于给定的rowTag使用sparkSession读取xml文件。获得的结果数据帧需要转换为xml文件。下面是我正在尝试的代码： val sparkSession = SparkSession.builder.master("local[*]").getOrCreate() val xmldf = sparkSession.read.format(SEAConstant.STR_IMPORT_SPARK_DATA_BRICK_XML) .option(SEAConstant.S

我正在基于给定的

rowTag

使用sparkSession读取

xml

文件。获得的结果数据帧需要转换为

xml

文件。下面是我正在尝试的代码：

val sparkSession = SparkSession.builder.master("local[*]").getOrCreate()
val xmldf = sparkSession.read.format(SEAConstant.STR_IMPORT_SPARK_DATA_BRICK_XML)
      .option(SEAConstant.STR_ROW_TAG, "Employee").option("nullValue", "").load("demo.xml")
    val columnNames = xmldf.columns.toSeq
    val sdf = xmldf.select(columnNames.map(c => xmldf.col(c)): _*)
sdf.write.format("com.databricks.spark.xml").option("rootTag", "Company")
      .option("rowTag", "Employee").save("Rel")

这是

xml

文件：

    <?xml version="1.0"?>
<Company>
  <Employee id="id47" masterRef="#id53" revision="" nomenclature="">
<ApplicationRef version="J.0" application="Teamcenter"></ApplicationRef>
<UserData id="id52">
<UserValue valueRef="#id4" value="" title="_CONFIG_CONTEXT"></UserValue></UserData></Employee>
<Employee id="id47" masterRef="#id53" revision="" nomenclature="">
<ApplicationRef version="B.0" application="Teamcenter"></ApplicationRef>
<UserData id="id63">
<UserValue valueRef="#id5" value="" title="_CONFIG_CONTEXT"></UserValue></UserData></Employee>
</Company>

我看了很多地方，但找不到解决办法。另外，使用上面生成的相同的

sdf

，我能够成功地创建

json

文件。有什么想法吗

xmldf.write.format("com.databricks.spark.xml").option("rootTag", "Company")
     .option("rowTag", "Employee").option("attributePrefix", "_Att")
     .option("valueTag","_VALUE").save("Rel")

用this替换OP中的相应语句

StaxParser

实际上正在查找这些

attributePrefix

和

valueTag

，如果没有它们，它将抛出

NPE

。我是在看这个的时候发现的

用this替换OP中的相应语句

StaxParser

实际上正在查找这些

attributePrefix

和

valueTag

，如果没有它们，它将抛出

NPE

。我在查看此文件时发现，您的nullpointer异常无效

具有spark2.20及以下xml依赖项

<dependency>
            <groupId>com.databricks</groupId>
            <artifactId>spark-xml_2.11</artifactId>
            <version>0.4.1</version>
        </dependency>

更新：更新了最新的XML OP后，我尝试了一下，得到了异常，并用下面的代码进行了修复…

来自databricks的全套选项和说明：

This package allows reading XML files in local or distributed filesystem as Spark DataFrames. When reading files the API accepts several options: path: Location of files. Similar to Spark can accept standard Hadoop globbing expressions. rowTag: The row tag of your xml files to treat as a row. For example, in this xml ..., the appropriate value would be book. Default is ROW. At the moment, rows containing self closing xml tags are not supported. samplingRatio: Sampling ratio for inferring schema (0.0 ~ 1). Default is 1. Possible types are StructType, ArrayType, StringType, LongType, DoubleType, BooleanType, TimestampType and NullType, unless user provides a schema for this. excludeAttribute : Whether you want to exclude attributes in elements or not. Default is false. treatEmptyValuesAsNulls : (DEPRECATED: use nullValue set to "") Whether you want to treat whitespaces as a null value. Default is false mode: The mode for dealing with corrupt records during parsing. Default is PERMISSIVE. PERMISSIVE : When it encounters a corrupted record, it sets all fields to null and puts the malformed string into a new field configured by columnNameOfCorruptRecord. When it encounters a field of the wrong datatype, it sets the offending field to null. DROPMALFORMED : ignores the whole corrupted records. FAILFAST : throws an exception when it meets corrupted records. inferSchema: if true, attempts to infer an appropriate type for each resulting DataFrame column, like a boolean, numeric or date type. If false, all resulting columns are of string type. Default is true. columnNameOfCorruptRecord: The name of new field where malformed strings are stored. Default is _corrupt_record. attributePrefix: The prefix for attributes so that we can differentiate attributes and elements. This will be the prefix for field names. Default is _. valueTag: The tag used for the value when there are attributes in the element having no child. Default is _VALUE. charset: Defaults to 'UTF-8' but can be set to other valid charset names ignoreSurroundingSpaces: Defines whether or not surrounding whitespaces from values being read should be skipped. Default is false. When writing files the API accepts several options: path: Location to write files. rowTag: The row tag of your xml files to treat as a row. For example, in this xml ..., the appropriate value would be book. Default is ROW. rootTag: The root tag of your xml files to treat as the root. For example, in this xml ..., the appropriate value would be books. Default is ROWS. nullValue: The value to write null value. Default is string null. When this is null, it does not write attributes and elements for fields. attributePrefix: The prefix for attributes so that we can differentiating attributes and elements. This will be the prefix for field names. Default is _. valueTag: The tag used for the value when there are attributes in the element having no child. Default is _VALUE. compression: compression codec to use when saving to file. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive shorten names (bzip2, gzip, lz4, and snappy). Defaults to no compression when a codec is not specified. Currently it supports the shortened name usage. You can use just xml instead of com.databricks.spark.xml.

您的nullpointer异常无效

具有spark2.20及以下xml依赖项

<dependency>
            <groupId>com.databricks</groupId>
            <artifactId>spark-xml_2.11</artifactId>
            <version>0.4.1</version>
        </dependency>

更新：更新了最新的XML OP后，我尝试了一下，得到了异常，并用下面的代码进行了修复…

来自databricks的全套选项和说明：

奇怪。我也犯了同样的错误，当我尝试其他选择时，它成功了。见下面我的答案。您使用的是哪种spark版本？我正在使用2.3.119/06/25 16:40:03信息SparkContext:运行Spark版本2.2.0更新：2.3.1我也成功尝试了。您的NPE无效。您可以尝试删除

.mode

选项建议的

/.mode（“覆盖”）

结果传输中没有更改。我也犯了同样的错误，当我尝试其他选择时，它成功了。见下面我的答案。您使用的是哪种spark版本？我正在使用2.3.119/06/25 16:40:03信息SparkContext:运行Spark版本2.2.0更新：2.3.1我也成功尝试了。您的NPE无效。您可以尝试删除

.mode

选项建议的

/.mode（“覆盖”）

结果无变化

.option("attributePrefix", "_Att")
      .option("valueTag", "_VALUE")

package com.examples

import java.io.File

import org.apache.commons.io.FileUtils
import org.apache.spark.sql.{SQLContext, SparkSession}

/**
  * Created by Ram Ghadiyaram
  */
object SparkXmlTest {
  // org.apache.log4j.Logger.getLogger("org").setLevel(Level.ERROR)
  def main(args: Array[String]) {

    val spark = SparkSession.builder.
      master("local")
      .appName(this.getClass.getName)
      .getOrCreate()
    //  spark.sparkContext.setLogLevel("ERROR")
    val sc = spark.sparkContext
    val sqlContext = new SQLContext(sc)
    //    val str =
    //    """
    //        |<?xml version="1.0"?>
    //        |<Company>
    //        |  <Employee id="1">
    //        |      <Email>tp@xyz.com</Email>
    //        |      <UserData id="id32" type="AttributesInContext">
    //        |      <UserValue value="7in" title="Height"></UserValue>
    //        |      <UserValue value="23lb" title="Weight"></UserValue>
    //        |</UserData>
    //        |  </Employee>
    //        |  <Measures id="1">
    //        |      <Email>newdata@rty.com</Email>
    //        |      <UserData id="id32" type="SitesInContext">
    //        |</UserData>
    //        |  </Measures>
    //        |  <Employee id="2">
    //        |      <Email>tp@xyz.com</Email>
    //        |      <UserData id="id33" type="AttributesInContext">
    //        |      <UserValue value="7in" title="Height"></UserValue>
    //        |      <UserValue value="34lb" title="Weight"></UserValue>
    //        |</UserData>
    //        |  </Employee>
    //        |  <Measures id="2">
    //        |      <Email>nextrow@rty.com</Email>
    //        |      <UserData id="id35" type="SitesInContext">
    //        |</UserData>
    //        |  </Measures>
    //        |  <Employee id="3">
    //        |      <Email>tp@xyz.com</Email>
    //        |      <UserData id="id34" type="AttributesInContext">
    //        |      <UserValue value="7in" title="Height"></UserValue>
    //        |      <UserValue value="" title="Weight"></UserValue>
    //        |</UserData>
    //        |  </Employee>
    //        |</Company>
    //      """.stripMargin
    val str =
    """
      |<Company>
      |  <Employee id="id47" masterRef="#id53" revision="" nomenclature="">
      |<ApplicationRef version="J.0" application="Teamcenter"></ApplicationRef>
      |<UserData id="id52">
      |<UserValue valueRef="#id4" value="" title="_CONFIG_CONTEXT"></UserValue></UserData></Employee>
      |<Employee id="id47" masterRef="#id53" revision="" nomenclature="">
      |<ApplicationRef version="B.0" application="Teamcenter"></ApplicationRef>
      |<UserData id="id63">
      |<UserValue valueRef="#id5" value="" title="_CONFIG_CONTEXT"></UserValue></UserData></Employee>
      |</Company>
    """.stripMargin
    println("save to file ")

    val f = new File("xmltest.xml")
    FileUtils.writeStringToFile(f, str)


    val xmldf = spark.read.format("com.databricks.spark.xml")
      .option("rootTag", "Company")
      .option("rowTag", "Employee")
      .option("nullValue", "")
      .load(f.getAbsolutePath)
    val columnNames = xmldf.columns.toSeq
    val sdf = xmldf.select(columnNames.map(c => xmldf.col(c)): _*)
    sdf.write.format("com.databricks.spark.xml")
      .option("rootTag", "Company")
      .option("rowTag", "Employee")
      .option("attributePrefix", "_Att")
      .option("valueTag", "_VALUE")
      .mode("overwrite")
      .save("./src/main/resources/Rel1")


    println("read back from saved file ....")
    val readbackdf = spark.read.format("com.databricks.spark.xml")
      .option("rootTag", "Company")
      .option("rowTag", "Employee")
      .option("nullValue", "")
      .load("./src/main/resources/Rel1")
    readbackdf.show(false)
  }
}

save to file 
read back from saved file ....
+-----------------+-------------------------------+----+----------+
|ApplicationRef   |UserData                       |_id |_masterRef|
+-----------------+-------------------------------+----+----------+
|[Teamcenter, J.0]|[[_CONFIG_CONTEXT, #id4], id52]|id47|#id53     |
|[Teamcenter, B.0]|[[_CONFIG_CONTEXT, #id5], id63]|id47|#id53     |
+-----------------+-------------------------------+----+----------+