Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在apache spark scala中处理xml列的文本文件_Xml_Scala_Apache Spark - Fatal编程技术网

在apache spark scala中处理xml列的文本文件

在apache spark scala中处理xml列的文本文件,xml,scala,apache-spark,Xml,Scala,Apache Spark,我有这样一个文件: 1,<note><from>Messi</from><body>Don't forget me this weekend!</body></note> 2,<note><from>Ronaldo</from><body>Don't forget Laliga</body></note> 3,<note><from&g

我有这样一个文件:

1,<note><from>Messi</from><body>Don't forget me this weekend!</body></note>
2,<note><from>Ronaldo</from><body>Don't forget Laliga</body></note>
3,<note><from>Neymar</from><body>I am the best </body></note>
4,<note><from>Suarez</from><body>Don't forget me this weekend!</body></note>

由于实际场景中的xml很复杂,我想使用xml解析器。我该怎么做呢?

您可以使用Scala自己的XML库。但是,您需要先将字符串解析为
Elem
对象,然后才能执行此操作:

import scala.xml._

val str = "<note><from>Messi</from><body>Don't forget me this weekend!</body></note>"

val xml = XML.loadString(xml)
xml: scala.xml.Elem = <note><from>Messi</from><body>Don't forget me this weekend!</body></note>
来回答你的问题

val rdd = sc.parallelize(Array(
(1,"<note><from>Messi</from><body>Don't forget me this weekend!</body></note>"),
(2,"<note><from>Ronaldo</from><body>Don't forget La Liga</body></note>"),
(3,"<note><from>Neymar</from><body>I am the best </body></note>"),
(4,"<note><from>Suarez</from><body>Don't forget me this weekend!</body></note>")
)) 

rdd.map{ case (id, xml) => 
    (id , 
    (XML.loadString(xml) \\ "note" \\ "from").text , 
    (XML.loadString(xml) \\ "note" \\ "body").text ) 
}.collect.foreach(println)

(1,Messi,Don't forget me this weekend!)
(2,Ronaldo,Don't forget Laliga)
(3,Neymar,I am the best )
(4,Suarez,Don't forget me this weekend!)
val rdd=sc.parallelize(数组(
(1,“这个周末别忘了我!”,
(2,“罗纳尔多别忘了西甲”),
(3,“内马里是最好的”),
(4,“苏亚雷斯顿这个周末别忘了我!”
)) 
map{case(id,xml)=>
(id),
(XML.loadString(XML)\\“note”\\“from”).text,
(XML.loadString(XML)\\\“note”\\“body”).text)
}.collect.foreach(println)
(1,梅西,这个周末别忘了我!)
(2,罗纳尔多,别忘了拉利加)
(3,内马尔,我是最好的)
(苏亚雷斯,这个周末别忘了我!)
xml \\ "note" \\ "from"
res19: scala.xml.NodeSeq = NodeSeq(<from>Messi</from>)
(xml \\ "note" \\ "from").text
res20: String = Messi
val rdd = sc.parallelize(Array(
(1,"<note><from>Messi</from><body>Don't forget me this weekend!</body></note>"),
(2,"<note><from>Ronaldo</from><body>Don't forget La Liga</body></note>"),
(3,"<note><from>Neymar</from><body>I am the best </body></note>"),
(4,"<note><from>Suarez</from><body>Don't forget me this weekend!</body></note>")
)) 

rdd.map{ case (id, xml) => 
    (id , 
    (XML.loadString(xml) \\ "note" \\ "from").text , 
    (XML.loadString(xml) \\ "note" \\ "body").text ) 
}.collect.foreach(println)

(1,Messi,Don't forget me this weekend!)
(2,Ronaldo,Don't forget Laliga)
(3,Neymar,I am the best )
(4,Suarez,Don't forget me this weekend!)