Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/ruby/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用DataFrame和com.databricks.spark.XML格式访问XML的属性?_Xml_Scala_Apache Spark_Dataframe - Fatal编程技术网

如何使用DataFrame和com.databricks.spark.XML格式访问XML的属性?

如何使用DataFrame和com.databricks.spark.XML格式访问XML的属性?,xml,scala,apache-spark,dataframe,Xml,Scala,Apache Spark,Dataframe,我有一个如下所示的xml文件: <?xml version="1.0" encoding="UTF-8"?> <paml version="2.0" xmlns="paml20.xsd"> <kmData type="partial"> <header> <log dateTime="2016-11-10T07:01:37" action="created">partial used</log>

我有一个如下所示的xml文件:

<?xml version="1.0" encoding="UTF-8"?>
<paml version="2.0" xmlns="paml20.xsd">
  <kmData type="partial">
    <header>
      <log dateTime="2016-11-10T07:01:37" action="created">partial used</log>
    </header>
    <Object class="SSC" version="0.3" dName="p2345" id="600">
    <list name="sscOptions">
        <p>0</p>
        <p>1</p>
        <p>2</p>
        <p>3</p>
        <p>4</p>
      </list>
    <p name="AAA">2</p>
      <p name="BBB">3</p>
      <p name="CCC">NNN</p>
      <p name="DDD">26</p>
      <p name="EEE">30</p>
      <p name="FFF">30</p>
      <p name="GGG">80</p>
      <p name="HHH">20</p>
      <p name="III">100</p>
      </Object>
    <Object class="PLUS2" version="0.5" dName="p2346" id="700">
      <p name="AAA">5</p>
      <p name="BBB">1</p>
      <p name="CCC">0</p>
      <p name="DDD">0</p>
      <p name="EEE">0</p>
      <p name="FFF">0</p>
      <list name="PLUS2Out">
        <p>0</p>
        <p>0</p>
        <p>0</p>
        <p>0</p>
        <p>0</p>
        <p>0</p>
      </list>
      <p name="GGG">8</p>
      </Object>
   </kmData>
 </paml>
我想写一个文件

我已经尝试了下面的代码

package Dataframeparsing
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import scala.xml.XML

object Dataframeparse {
  def main(args: Array[String]) {
  val sc = new SparkContext(new SparkConf().setAppName("Parse XML Data").setMaster("local[*]"))

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read
    .format("com.databricks.spark.xml")
    .option("rowTag", "Object")
    .load("D://userdata//sam//Desktop//abc.xml")
    print("xml file read done")
    val Data1 = df.filter("class = 'SSC'")
    val D1store=Data1.select("AAA","CCC").show()
    D1store.write.option("header", "true").csv("file:///C:/out1.csv")
    val Data2 = df.filter("class = 'PLUS2'")
    val D2store=Data2.select("AAA","BBB","CCC").show()
    D2store..write.option("header", "true").csv("file:///C:/out2.csv")
}
}
当我尝试上述代码时,我发现以下错误:

17/02/21 18:16:12信息调度程序:结果阶段0(InferSchema.scala:60处的treeAggregate)在14.905秒内完成
17/02/21 18:16:12信息调度程序:作业0已完成:InferSchema处的treeAggregate。scala:60,耗时15.088996秒 线程“main”org.apache.spark.sql.AnalysisException中的xml文件read-DoneeException:无法解析“AAA”,给定输入列p、类、dName、默认值、列表、版本、id; 位于org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)

我希望最终输出如下所示:

对于
class=SSC

2016-11-10T07:01:37,SSC,0.3,p2345,600,2,NNN
对于
class=PLUS2

2016-11-10T07:01:37,PLUS2,0.5,p2346,700,5,1,0

val df=..…
之后添加一行新的
df.show(10)
并将此命令的输出添加到question@khumar你发现了吗?
2016-11-10T07:01:37,PLUS2,0.5,p2346,700,5,1,0