如何处理Dataframe/Spark SQL/Spark Scala中的无效XML字符串和无效JSON字符串

如何处理Dataframe/Spark SQL/Spark Scala中的无效XML字符串和无效JSON字符串,scala,dataframe,apache-spark,apache-spark-sql,Scala,Dataframe,Apache Spark,Apache Spark Sql,我有一个场景,我必须基于另一个字段解析XML和JSON值 Customer\u Order表有两个字段,分别名为response\u id和response\u output响应_输出将包含JSON字符串、XML字符串、错误、空格和null的组合 我需要解决以下问题陈述 问题陈述 val DF1 = rdd.toDF("customer_id","response_id","response_output") spark.sql(&q

我有一个场景,我必须基于另一个字段解析XML和JSON值

Customer\u Order
表有两个字段,分别名为
response\u id
response\u output
<代码>响应_输出将包含JSON字符串、XML字符串、错误、空格和null的组合

我需要解决以下问题陈述

问题陈述

val DF1 = rdd.toDF("customer_id","response_id","response_output")
spark.sql("""select customer_id,response_id,response_output from Customer_Order""").show()

+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_id|response_id|response_output                                                                                                                                                                                                                                                                                                                          |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100        |1          |<USR_ORD><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult><OrderResponse></USR_ORD>                |
|200        |1          |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|300        |1          |Error                                                                                                                                                                                                                                                                                                                                    |
|400        |1          |                                                                                                                                                                                                                                                                                                                                         |
|500        |1          |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 }                                                                                                                                                                                                                                                     |
|600        |2          |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 }                                                                                                                                                                                                                                                     |
|700        |2          |{ "OrderResponse":"COMPLETE", "OrderTime":200, "USR1Order":null "USR1Orderqut":4}                                                                                                                                                                                                                                                        |
|800        |2          |{ "OrderResponse":"COMPLETE", "OrderTime":100, "USR1Order":null} "USR1Orderqut":1}                                                                                                                                                                                                                                                       |
|900        |2          |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|101        |2          |Error                                                                                                                                                                                                                                                                                                                                    |
|202        |2          |                                                                                                                                                                                                                                                                                                                                         |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  • 如果response_id=1且response_输出具有有效的JSON,则选择JSON 逻辑

  • 如果response_id=1且response_输出没有有效的JSON,则 废除

  • 如果response_id=1且response_输出为XML值,则为null

  • 如果response_id=1且response_输出为Error,则为Error

  • 如果response_id=1且response_输出为空或Null,则为Null

  • 如果response_id=2和response_输出具有有效的JSON,则选择XML 逻辑

  • 如果response_id=2且response_输出没有有效的XML,则 废除

  • 如果response_id=2且response_输出为JSON值,则为null

  • 如果response_id=2且response_输出为Error,则为Error

  • 如果response_id=2且response_输出为空或Null,则为Null

当我试图使用SPARK SQL实现上述问题语句时,但当遇到无效XML或无效JSON时,我的代码正在崩溃

下面是错误,有人能帮我处理吗

   spark.sql("""select 
    customer_id,
    response_id,
    CASE WHEN (response_id=2 and response_output!="Error") THEN get_json_object(response_output, '$.Metrics.OrderResponseTime')
         WHEN (response_id=1 and response_output!="Error") THEN xpath_string(response_output,'USR_ORD/OrderResponse/USR1OrderTotalTime')
         WHEN ((response_id=1 or response_id=2) and  response_output="Error") THEN "Error"
         ELSE null END as order_time 
         from Customer_Order""").show()
下面是我在尝试上述查询时遇到的错误,如何处理无效的XML或JSON

Driver stacktrace:
21/02/05 00:48:06 INFO scheduler.DAGScheduler: Job 5 failed: show at Engine.scala:221, took 1.099890 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 10.0 failed 4 times, most recent failure: Lost task 3.3 in stage 10.0 (TID 83, srwredf2021.analytics1.test.dev.corp, executor 3): java.lang.RuntimeException: Invalid XML document: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1234; XML document structures must start and end within the same entity.
<USR_ORD><OrderResult><ORDTime>2021-02-02 10:34:19</ORDTime><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult><OrderResponse></USR_ORD>
强制使用架构列名

val DF1 = rdd.toDF("customer_id","response_id","response_output")
spark.sql("""select customer_id,response_id,response_output from Customer_Order""").show()

+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_id|response_id|response_output                                                                                                                                                                                                                                                                                                                          |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100        |1          |<USR_ORD><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult><OrderResponse></USR_ORD>                |
|200        |1          |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|300        |1          |Error                                                                                                                                                                                                                                                                                                                                    |
|400        |1          |                                                                                                                                                                                                                                                                                                                                         |
|500        |1          |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 }                                                                                                                                                                                                                                                     |
|600        |2          |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 }                                                                                                                                                                                                                                                     |
|700        |2          |{ "OrderResponse":"COMPLETE", "OrderTime":200, "USR1Order":null "USR1Orderqut":4}                                                                                                                                                                                                                                                        |
|800        |2          |{ "OrderResponse":"COMPLETE", "OrderTime":100, "USR1Order":null} "USR1Orderqut":1}                                                                                                                                                                                                                                                       |
|900        |2          |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|101        |2          |Error                                                                                                                                                                                                                                                                                                                                    |
|202        |2          |                                                                                                                                                                                                                                                                                                                                         |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
创建表格

DF1.createOrReplaceTempView("Customer_Order")
打印模式

spark.sql("""select customer_id,response_id,response_output from Customer_Order""").printSchema()

root
 |-- customer_id: integer (nullable = false)
 |-- response_id: integer (nullable = false)
 |-- response_output: string (nullable = true)
显示记录

val DF1 = rdd.toDF("customer_id","response_id","response_output")
spark.sql("""select customer_id,response_id,response_output from Customer_Order""").show()

+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_id|response_id|response_output                                                                                                                                                                                                                                                                                                                          |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100        |1          |<USR_ORD><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult><OrderResponse></USR_ORD>                |
|200        |1          |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|300        |1          |Error                                                                                                                                                                                                                                                                                                                                    |
|400        |1          |                                                                                                                                                                                                                                                                                                                                         |
|500        |1          |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 }                                                                                                                                                                                                                                                     |
|600        |2          |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 }                                                                                                                                                                                                                                                     |
|700        |2          |{ "OrderResponse":"COMPLETE", "OrderTime":200, "USR1Order":null "USR1Orderqut":4}                                                                                                                                                                                                                                                        |
|800        |2          |{ "OrderResponse":"COMPLETE", "OrderTime":100, "USR1Order":null} "USR1Orderqut":1}                                                                                                                                                                                                                                                       |
|900        |2          |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|101        |2          |Error                                                                                                                                                                                                                                                                                                                                    |
|202        |2          |                                                                                                                                                                                                                                                                                                                                         |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
spark.sql(““从客户订单中选择客户id、响应id、响应输出”)。show()
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|客户|响应|响应|输出    |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 | 1 |完全321{ND}{ND}100{ND}|
|200 | 1 |完全221{ND}{ND}100{ND}|
|300 | 1 |误差    |
|400        |1          |                                                                                                                                                                                                                                                                                                                                         |
|500 | 1 |{“OrderResponse”:“COMPLETE”,“OrderTime”:300,“USR1Order”:null,“USR1Orderqut”:10}                                                                                                                                                                                                                                                     |
|600 | 2 |{“OrderResponse”:“COMPLETE”,“OrderTime”:300,“USR1Order”:null,“USR1Orderqut”:10}                                                                                                                                                                                                                                                     |
|700 | 2 |{“OrderResponse”:“COMPLETE”,“OrderTime”:200,“USR1Order”:null“USR1Orderqut”:4}                                                                                                                                                                                                                                                        |
|800 | 2 |{“OrderResponse”:“COMPLETE”,“OrderTime”:100,“USR1Order”:null}“USR1Orderqut”:1}                                                                                                                                                                                                                                                       |
|900 | 2 |完全221{ND}{ND}100{ND}|
|101 | 2 |错误    |
|202        |2          |