如何处理Dataframe/Spark SQL/Spark Scala中的无效XML字符串和无效JSON字符串
我有一个场景,我必须基于另一个字段解析XML和JSON值如何处理Dataframe/Spark SQL/Spark Scala中的无效XML字符串和无效JSON字符串,scala,dataframe,apache-spark,apache-spark-sql,Scala,Dataframe,Apache Spark,Apache Spark Sql,我有一个场景,我必须基于另一个字段解析XML和JSON值 Customer\u Order表有两个字段,分别名为response\u id和response\u output响应_输出将包含JSON字符串、XML字符串、错误、空格和null的组合 我需要解决以下问题陈述 问题陈述 val DF1 = rdd.toDF("customer_id","response_id","response_output") spark.sql(&q
Customer\u Order
表有两个字段,分别名为response\u id
和response\u output
<代码>响应_输出将包含JSON字符串、XML字符串、错误、空格和null的组合
我需要解决以下问题陈述
问题陈述
val DF1 = rdd.toDF("customer_id","response_id","response_output")
spark.sql("""select customer_id,response_id,response_output from Customer_Order""").show()
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_id|response_id|response_output |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 |1 |<USR_ORD><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult><OrderResponse></USR_ORD> |
|200 |1 |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|300 |1 |Error |
|400 |1 | |
|500 |1 |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 } |
|600 |2 |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 } |
|700 |2 |{ "OrderResponse":"COMPLETE", "OrderTime":200, "USR1Order":null "USR1Orderqut":4} |
|800 |2 |{ "OrderResponse":"COMPLETE", "OrderTime":100, "USR1Order":null} "USR1Orderqut":1} |
|900 |2 |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|101 |2 |Error |
|202 |2 | |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
- 如果response_id=1且response_输出具有有效的JSON,则选择JSON 逻辑
- 如果response_id=1且response_输出没有有效的JSON,则 废除
- 如果response_id=1且response_输出为XML值,则为null
- 如果response_id=1且response_输出为Error,则为Error
- 如果response_id=1且response_输出为空或Null,则为Null
- 如果response_id=2和response_输出具有有效的JSON,则选择XML 逻辑
- 如果response_id=2且response_输出没有有效的XML,则 废除
- 如果response_id=2且response_输出为JSON值,则为null
- 如果response_id=2且response_输出为Error,则为Error
- 如果response_id=2且response_输出为空或Null,则为Null
spark.sql("""select
customer_id,
response_id,
CASE WHEN (response_id=2 and response_output!="Error") THEN get_json_object(response_output, '$.Metrics.OrderResponseTime')
WHEN (response_id=1 and response_output!="Error") THEN xpath_string(response_output,'USR_ORD/OrderResponse/USR1OrderTotalTime')
WHEN ((response_id=1 or response_id=2) and response_output="Error") THEN "Error"
ELSE null END as order_time
from Customer_Order""").show()
下面是我在尝试上述查询时遇到的错误,如何处理无效的XML或JSON
Driver stacktrace:
21/02/05 00:48:06 INFO scheduler.DAGScheduler: Job 5 failed: show at Engine.scala:221, took 1.099890 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 10.0 failed 4 times, most recent failure: Lost task 3.3 in stage 10.0 (TID 83, srwredf2021.analytics1.test.dev.corp, executor 3): java.lang.RuntimeException: Invalid XML document: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1234; XML document structures must start and end within the same entity.
<USR_ORD><OrderResult><ORDTime>2021-02-02 10:34:19</ORDTime><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult><OrderResponse></USR_ORD>
强制使用架构列名
val DF1 = rdd.toDF("customer_id","response_id","response_output")
spark.sql("""select customer_id,response_id,response_output from Customer_Order""").show()
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_id|response_id|response_output |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 |1 |<USR_ORD><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult><OrderResponse></USR_ORD> |
|200 |1 |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|300 |1 |Error |
|400 |1 | |
|500 |1 |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 } |
|600 |2 |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 } |
|700 |2 |{ "OrderResponse":"COMPLETE", "OrderTime":200, "USR1Order":null "USR1Orderqut":4} |
|800 |2 |{ "OrderResponse":"COMPLETE", "OrderTime":100, "USR1Order":null} "USR1Orderqut":1} |
|900 |2 |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|101 |2 |Error |
|202 |2 | |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
创建表格
DF1.createOrReplaceTempView("Customer_Order")
打印模式
spark.sql("""select customer_id,response_id,response_output from Customer_Order""").printSchema()
root
|-- customer_id: integer (nullable = false)
|-- response_id: integer (nullable = false)
|-- response_output: string (nullable = true)
显示记录
val DF1 = rdd.toDF("customer_id","response_id","response_output")
spark.sql("""select customer_id,response_id,response_output from Customer_Order""").show()
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_id|response_id|response_output |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 |1 |<USR_ORD><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult><OrderResponse></USR_ORD> |
|200 |1 |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|300 |1 |Error |
|400 |1 | |
|500 |1 |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 } |
|600 |2 |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 } |
|700 |2 |{ "OrderResponse":"COMPLETE", "OrderTime":200, "USR1Order":null "USR1Orderqut":4} |
|800 |2 |{ "OrderResponse":"COMPLETE", "OrderTime":100, "USR1Order":null} "USR1Orderqut":1} |
|900 |2 |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|101 |2 |Error |
|202 |2 | |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
spark.sql(““从客户订单中选择客户id、响应id、响应输出”)。show()
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|客户|响应|响应|输出 |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 | 1 |完全321{ND}{ND}100{ND}|
|200 | 1 |完全221{ND}{ND}100{ND}|
|300 | 1 |误差 |
|400 |1 | |
|500 | 1 |{“OrderResponse”:“COMPLETE”,“OrderTime”:300,“USR1Order”:null,“USR1Orderqut”:10} |
|600 | 2 |{“OrderResponse”:“COMPLETE”,“OrderTime”:300,“USR1Order”:null,“USR1Orderqut”:10} |
|700 | 2 |{“OrderResponse”:“COMPLETE”,“OrderTime”:200,“USR1Order”:null“USR1Orderqut”:4} |
|800 | 2 |{“OrderResponse”:“COMPLETE”,“OrderTime”:100,“USR1Order”:null}“USR1Orderqut”:1} |
|900 | 2 |完全221{ND}{ND}100{ND}|
|101 | 2 |错误 |
|202 |2 |