Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 为spark中数据帧中的特定列应用逻辑_Scala_Apache Spark_Dataframe_Apache Spark Sql - Fatal编程技术网

Scala 为spark中数据帧中的特定列应用逻辑

Scala 为spark中数据帧中的特定列应用逻辑,scala,apache-spark,dataframe,apache-spark-sql,Scala,Apache Spark,Dataframe,Apache Spark Sql,我有一个Dataframe,它是从mysql导入的 dataframe_mysql.show() +----+---------+-------------------------------------------------------+ | id|accountid| xmldata| +----+---------+---------------------------------------

我有一个Dataframe,它是从mysql导入的

dataframe_mysql.show()
+----+---------+-------------------------------------------------------+
|  id|accountid|                                                xmldata|
+----+---------+-------------------------------------------------------+
|1001|    12346|<AccountSetup xmlns:xsi="test"><Customers test="test...|
|1002|    12346|<AccountSetup xmlns:xsi="test"><Customers test="test...|
|1003|    12346|<AccountSetup xmlns:xsi="test"><Customers test="test...|
|1004|    12347|<AccountSetup xmlns:xsi="test"><Customers test="test...|
+----+---------+-------------------------------------------------------+
我得到的最终输出是结构化的

df.show()


当我将xml内容放在一个数据框中时,请建议如何实现这一点。

既然您试图将xml数据列拉到一个单独的
数据框中,您仍然可以使用spark xml包中的代码。你只需要直接使用他们的阅读器

case class Data(id: Int, accountid: Int, xmldata: String)
val df = Seq(
    Data(1001, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers></AccountSetup>"),
    Data(1002, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"b\">e</Customers></AccountSetup>"),
    Data(1003, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"c\">f</Customers></AccountSetup>")
).toDF


import com.databricks.spark.xml.XmlReader

val reader = new XmlReader()

// Set options using methods
reader.withRowTag("AccountSetup")

val rdd = df.select("xmldata").map(r => r.getString(0)).rdd
val xmlDF = reader.xmlRdd(spark.sqlContext, rdd)
case类数据(id:Int、accountid:Int、xmldata:String)
val df=Seq(

数据(100112345),

我尝试了下面的查询

val dff1 = Seq(
Data(1001, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers></AccountSetup>"),
Data(1002, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"b\">e</Customers></AccountSetup>"),
Data(1003, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"c\">f</Customers></AccountSetup>")
    ).toDF

    dff1.show()
    val reader = new XmlReader().withRowTag("AccountSetup")
    val xmlrdd = dff1.select("xmldata").map(a => a.getString(0)).rdd
    xmlrdd.toDF("newRowXml").show()
    val xmldf = reader.xmlRdd(sqlcontext, xmlrdd)
    xmldf.show()
val dff1=Seq(
数据(100112345,“d”),
数据(100212345,“e”),
数据(100312345,“f”)
)托夫先生
dff1.show()
val reader=new XmlReader().withRowTag(“AccountSetup”)
val xmlrdd=dff1.select(“xmldata”).map(a=>a.getString(0)).rdd
toDF(“newRowXml”).show()
val xmldf=reader.xmlRdd(sqlcontext,xmlRdd)
xmldf.show()
我得到了dff1.show()和xmlrdd.toDF(“newRowXml”).show()的输出

//dff1.show()
+----+---------+--------------------+
|id | accountid | xmldata|
+----+---------+--------------------+

|1001 | 12345 |您的预期输出是什么请粘贴结果spark sql中没有列级xml解析器。您必须编写一个UDF或编写中间xml,然后使用DataRicks的xml ParserTank for tour reply将其读回。当我尝试上述查询时,我得到的值toDF不是Seq[Data]的成员。你能在这方面帮助我吗。事实上,我对spar和scala是新手。你必须执行
导入sqlContext.implicits.\u
。在spark 2.x中,你可以从SparkSession对象或通过sqlContext上的单例方法获取sqlContext
case class Data(id: Int, accountid: Int, xmldata: String)
val df = Seq(
    Data(1001, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers></AccountSetup>"),
    Data(1002, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"b\">e</Customers></AccountSetup>"),
    Data(1003, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"c\">f</Customers></AccountSetup>")
).toDF


import com.databricks.spark.xml.XmlReader

val reader = new XmlReader()

// Set options using methods
reader.withRowTag("AccountSetup")

val rdd = df.select("xmldata").map(r => r.getString(0)).rdd
val xmlDF = reader.xmlRdd(spark.sqlContext, rdd)
val dff1 = Seq(
Data(1001, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers></AccountSetup>"),
Data(1002, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"b\">e</Customers></AccountSetup>"),
Data(1003, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"c\">f</Customers></AccountSetup>")
    ).toDF

    dff1.show()
    val reader = new XmlReader().withRowTag("AccountSetup")
    val xmlrdd = dff1.select("xmldata").map(a => a.getString(0)).rdd
    xmlrdd.toDF("newRowXml").show()
    val xmldf = reader.xmlRdd(sqlcontext, xmlrdd)
    xmldf.show()
//dff1.show()
+----+---------+--------------------+
|  id|accountid|             xmldata|
+----+---------+--------------------+
|1001|    12345|<AccountSetup xml...|
|1002|    12345|<AccountSetup xml...|
|1003|    12345|<AccountSetup xml...|
+----+---------+--------------------+

xmlrdd.toDF("newRowXml").show()
+--------------------+
|           newRowXml|
+--------------------+
|<AccountSetup xml...|
|<AccountSetup xml...|
|<AccountSetup xml...|
+--------------------+
18/09/20 19:30:29 INFO SparkUI: Stopped Spark web UI at http://192.168.56.1:4040
18/09/20 19:30:29 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/09/20 19:30:29 INFO MemoryStore: MemoryStore cleared
18/09/20 19:30:29 INFO BlockManager: BlockManager stopped
18/09/20 19:30:29 INFO BlockManagerMaster: BlockManagerMaster stopped
18/09/20 19:30:29 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/09/20 19:30:29 INFO SparkContext: Successfully stopped SparkContext
18/09/20 19:30:29 INFO ShutdownHookManager: Shutdown hook called
18/09/20 19:30:29 INFO ShutdownHookManager: Deleting directory C:\Users\rajkiranu\AppData\Local\Temp\spark-16433b5e-01b7-472b-9b88-fea0a67a991a

Process finished with exit code 1