如何使用DataFrame和com.databricks.spark.XML格式访问XML的属性?
我有一个如下所示的xml文件:如何使用DataFrame和com.databricks.spark.XML格式访问XML的属性?,xml,scala,apache-spark,dataframe,Xml,Scala,Apache Spark,Dataframe,我有一个如下所示的xml文件: <?xml version="1.0" encoding="UTF-8"?> <paml version="2.0" xmlns="paml20.xsd"> <kmData type="partial"> <header> <log dateTime="2016-11-10T07:01:37" action="created">partial used</log>
<?xml version="1.0" encoding="UTF-8"?>
<paml version="2.0" xmlns="paml20.xsd">
<kmData type="partial">
<header>
<log dateTime="2016-11-10T07:01:37" action="created">partial used</log>
</header>
<Object class="SSC" version="0.3" dName="p2345" id="600">
<list name="sscOptions">
<p>0</p>
<p>1</p>
<p>2</p>
<p>3</p>
<p>4</p>
</list>
<p name="AAA">2</p>
<p name="BBB">3</p>
<p name="CCC">NNN</p>
<p name="DDD">26</p>
<p name="EEE">30</p>
<p name="FFF">30</p>
<p name="GGG">80</p>
<p name="HHH">20</p>
<p name="III">100</p>
</Object>
<Object class="PLUS2" version="0.5" dName="p2346" id="700">
<p name="AAA">5</p>
<p name="BBB">1</p>
<p name="CCC">0</p>
<p name="DDD">0</p>
<p name="EEE">0</p>
<p name="FFF">0</p>
<list name="PLUS2Out">
<p>0</p>
<p>0</p>
<p>0</p>
<p>0</p>
<p>0</p>
<p>0</p>
</list>
<p name="GGG">8</p>
</Object>
</kmData>
</paml>
我想写一个文件
我已经尝试了下面的代码
package Dataframeparsing
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import scala.xml.XML
object Dataframeparse {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Parse XML Data").setMaster("local[*]"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "Object")
.load("D://userdata//sam//Desktop//abc.xml")
print("xml file read done")
val Data1 = df.filter("class = 'SSC'")
val D1store=Data1.select("AAA","CCC").show()
D1store.write.option("header", "true").csv("file:///C:/out1.csv")
val Data2 = df.filter("class = 'PLUS2'")
val D2store=Data2.select("AAA","BBB","CCC").show()
D2store..write.option("header", "true").csv("file:///C:/out2.csv")
}
}
当我尝试上述代码时,我发现以下错误:
17/02/21 18:16:12信息调度程序:结果阶段0(InferSchema.scala:60处的treeAggregate)在14.905秒内完成17/02/21 18:16:12信息调度程序:作业0已完成:InferSchema处的treeAggregate。scala:60,耗时15.088996秒 线程“main”org.apache.spark.sql.AnalysisException中的xml文件read-DoneeException:无法解析“AAA”,给定输入列p、类、dName、默认值、列表、版本、id; 位于org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) 我希望最终输出如下所示: 对于
class=SSC
2016-11-10T07:01:37,SSC,0.3,p2345,600,2,NNN
对于class=PLUS2
2016-11-10T07:01:37,PLUS2,0.5,p2346,700,5,1,0
在
val df=..…
之后添加一行新的df.show(10)
并将此命令的输出添加到question@khumar你发现了吗?
2016-11-10T07:01:37,PLUS2,0.5,p2346,700,5,1,0