在apache spark scala中处理xml列的文本文件
我有这样一个文件:在apache spark scala中处理xml列的文本文件,xml,scala,apache-spark,Xml,Scala,Apache Spark,我有这样一个文件: 1,<note><from>Messi</from><body>Don't forget me this weekend!</body></note> 2,<note><from>Ronaldo</from><body>Don't forget Laliga</body></note> 3,<note><from&g
1,<note><from>Messi</from><body>Don't forget me this weekend!</body></note>
2,<note><from>Ronaldo</from><body>Don't forget Laliga</body></note>
3,<note><from>Neymar</from><body>I am the best </body></note>
4,<note><from>Suarez</from><body>Don't forget me this weekend!</body></note>
由于实际场景中的xml很复杂,我想使用xml解析器。我该怎么做呢?您可以使用Scala自己的XML库。但是,您需要先将字符串解析为
Elem
对象,然后才能执行此操作:
import scala.xml._
val str = "<note><from>Messi</from><body>Don't forget me this weekend!</body></note>"
val xml = XML.loadString(xml)
xml: scala.xml.Elem = <note><from>Messi</from><body>Don't forget me this weekend!</body></note>
来回答你的问题
val rdd = sc.parallelize(Array(
(1,"<note><from>Messi</from><body>Don't forget me this weekend!</body></note>"),
(2,"<note><from>Ronaldo</from><body>Don't forget La Liga</body></note>"),
(3,"<note><from>Neymar</from><body>I am the best </body></note>"),
(4,"<note><from>Suarez</from><body>Don't forget me this weekend!</body></note>")
))
rdd.map{ case (id, xml) =>
(id ,
(XML.loadString(xml) \\ "note" \\ "from").text ,
(XML.loadString(xml) \\ "note" \\ "body").text )
}.collect.foreach(println)
(1,Messi,Don't forget me this weekend!)
(2,Ronaldo,Don't forget Laliga)
(3,Neymar,I am the best )
(4,Suarez,Don't forget me this weekend!)
val rdd=sc.parallelize(数组(
(1,“这个周末别忘了我!”,
(2,“罗纳尔多别忘了西甲”),
(3,“内马里是最好的”),
(4,“苏亚雷斯顿这个周末别忘了我!”
))
map{case(id,xml)=>
(id),
(XML.loadString(XML)\\“note”\\“from”).text,
(XML.loadString(XML)\\\“note”\\“body”).text)
}.collect.foreach(println)
(1,梅西,这个周末别忘了我!)
(2,罗纳尔多,别忘了拉利加)
(3,内马尔,我是最好的)
(苏亚雷斯,这个周末别忘了我!)
xml \\ "note" \\ "from"
res19: scala.xml.NodeSeq = NodeSeq(<from>Messi</from>)
(xml \\ "note" \\ "from").text
res20: String = Messi
val rdd = sc.parallelize(Array(
(1,"<note><from>Messi</from><body>Don't forget me this weekend!</body></note>"),
(2,"<note><from>Ronaldo</from><body>Don't forget La Liga</body></note>"),
(3,"<note><from>Neymar</from><body>I am the best </body></note>"),
(4,"<note><from>Suarez</from><body>Don't forget me this weekend!</body></note>")
))
rdd.map{ case (id, xml) =>
(id ,
(XML.loadString(xml) \\ "note" \\ "from").text ,
(XML.loadString(xml) \\ "note" \\ "body").text )
}.collect.foreach(println)
(1,Messi,Don't forget me this weekend!)
(2,Ronaldo,Don't forget Laliga)
(3,Neymar,I am the best )
(4,Suarez,Don't forget me this weekend!)