Regex 提取Spark dataframe列中特定字符串后的数字-Scala_Regex_Scala_Apache Spark_Apache Spark Sql

Regex 提取Spark dataframe列中特定字符串后的数字-Scala

regex scala apache-spark

Regex 提取Spark dataframe列中特定字符串后的数字-Scala,regex,scala,apache-spark,apache-spark-sql,Regex,Scala,Apache Spark,Apache Spark Sql,我有一个数据帧df，格式如下 |constraint |constraint_status |constraint_msg +--------------------------------------------------------

我有一个数据帧

df

，格式如下

 |constraint                                     |constraint_status |constraint_msg                                                                                             
 +----------------------------------------------------------------------------------------------------------------+--------------------------------+
 |CompletenessConstraint                        |Success          |Value: 1.0 Notnull condition should be satisfied     
 |UniquenessConstraint                          |Success          |Value: 1.0 Uniqueness condition should be satisfied                            |
 |PatternMatchConstraint                        |Failure          |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                          |
 |MinimumConstraint                             |Success          |Value: 5.1210650000005 Minimum value should be greater than 10.000000 
 |HistogramConstraint                           |Failure          |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|

我想在

value:

string之后获取数值，并创建一个新列

value

预期产量

 |constraint                                     |constraint_status |constraint_msg                                                       |Value                                        
 +----------------------------------------------------------------------------------------------------------------+--------------------------------+
 |CompletenessConstraint                        |Success          |Value: 1.0 Notnull condition should be satisfied                          |     1.0
 |UniquenessConstraint                          |Success          |Value: 1.0 Uniqueness condition should be satisfied                       |     1.0 
 |PatternMatchConstraint                        |Failure          |Expected type of column CHD_ACCOUNT_NUMBER to be StringType               |     null
 |MinimumConstraint                             |Success          |Value: 5.1210650000005 Minimum value should be greater than 10.000000     |     5.1210650000005 
 |HistogramConstraint                           |Failure          |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000| null

我尝试了以下代码：

      df = df.withColumn("Value",split(df("constraint_msg"), "Value\\: (\\d+)").getItem(0))

但这是一个错误。需要帮助

org.apache.spark.sql.AnalysisException:由于数据类型不匹配，无法解析“拆分（

constraint\u msg

，“Value\：（\d+）”）：参数1需要字符串类型，但“

constraint\u msg

”是数组类型

when..否则

将帮助您首先筛选不包含

值：

的记录。假设constraint_msg总是以

值开始：

，我将选择分割后的第二个元素作为所需值

val df = sc.parallelize(Seq(("CompletenessConstraint", "Success", "Value: 1.0 Notnull condition should be satisfied"), ("PatternMatchConstraint", "Failure", "Expected type of column CHD_ACCOUNT_NUMBER to be StringType"))).toDF("constraint", "constraint_status", "constraint_msg")

val df1 = df.withColumn("Value",when(col("constraint_msg").contains("Value:"),split(df("constraint_msg"), " ").getItem(1)).otherwise(null))

df1.show()
+--------------------+-----------------+--------------------+-----+
|          constraint|constraint_status|      constraint_msg|Value|
+--------------------+-----------------+--------------------+-----+
|CompletenessConst...|          Success|Value: 1.0 Notnul...|  1.0|
|PatternMatchConst...|          Failure|Expected type of ...| null|
+--------------------+-----------------+--------------------+-----+

when..否则

将帮助您首先筛选不包含

值：

的记录。假设constraint_msg总是以

值开始：

，我将选择分割后的第二个元素作为所需值

val df = sc.parallelize(Seq(("CompletenessConstraint", "Success", "Value: 1.0 Notnull condition should be satisfied"), ("PatternMatchConstraint", "Failure", "Expected type of column CHD_ACCOUNT_NUMBER to be StringType"))).toDF("constraint", "constraint_status", "constraint_msg")

val df1 = df.withColumn("Value",when(col("constraint_msg").contains("Value:"),split(df("constraint_msg"), " ").getItem(1)).otherwise(null))

df1.show()
+--------------------+-----------------+--------------------+-----+
|          constraint|constraint_status|      constraint_msg|Value|
+--------------------+-----------------+--------------------+-----+
|CompletenessConst...|          Success|Value: 1.0 Notnul...|  1.0|
|PatternMatchConst...|          Failure|Expected type of ...| null|
+--------------------+-----------------+--------------------+-----+

检查下面的代码

scala> df.show(false)
+----------------------+------------------+----------------------------------------------------------------------------------------------+
|constraint            |constraint_status |constraint_msg                                                                                |
+----------------------+------------------+----------------------------------------------------------------------------------------------+
|CompletenessConstraint|Success           |Value: 1.0 Notnull condition should be satisfied                                              |
|UniquenessConstraint  |Success           |Value: 1.0 Uniqueness condition should be satisfied                                           |
|PatternMatchConstraint|Failure           |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                                   |
|MinimumConstraint     |Success           |Value: 5.1210650000005 Minimum value should be greater than 10.000000                         |
|HistogramConstraint   |Failure           |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|
+----------------------+------------------+----------------------------------------------------------------------------------------------+


scala> df
.withColumn("Value",regexp_extract($"constraint_msg","Value: (\\d.\\d+)",1))
.show(false)
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
|constraint            |constraint_status |constraint_msg                                                                                |Value          |
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
|CompletenessConstraint|Success           |Value: 1.0 Notnull condition should be satisfied                                              |1.0            |
|UniquenessConstraint  |Success           |Value: 1.0 Uniqueness condition should be satisfied                                           |1.0            |
|PatternMatchConstraint|Failure           |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                                   |               |
|MinimumConstraint     |Success           |Value: 5.1210650000005 Minimum value should be greater than 10.000000                         |5.1210650000005|
|HistogramConstraint   |Failure           |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|               |
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+

检查下面的代码

scala> df.show(false)
+----------------------+------------------+----------------------------------------------------------------------------------------------+
|constraint            |constraint_status |constraint_msg                                                                                |
+----------------------+------------------+----------------------------------------------------------------------------------------------+
|CompletenessConstraint|Success           |Value: 1.0 Notnull condition should be satisfied                                              |
|UniquenessConstraint  |Success           |Value: 1.0 Uniqueness condition should be satisfied                                           |
|PatternMatchConstraint|Failure           |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                                   |
|MinimumConstraint     |Success           |Value: 5.1210650000005 Minimum value should be greater than 10.000000                         |
|HistogramConstraint   |Failure           |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|
+----------------------+------------------+----------------------------------------------------------------------------------------------+


scala> df
.withColumn("Value",regexp_extract($"constraint_msg","Value: (\\d.\\d+)",1))
.show(false)
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
|constraint            |constraint_status |constraint_msg                                                                                |Value          |
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+
|CompletenessConstraint|Success           |Value: 1.0 Notnull condition should be satisfied                                              |1.0            |
|UniquenessConstraint  |Success           |Value: 1.0 Uniqueness condition should be satisfied                                           |1.0            |
|PatternMatchConstraint|Failure           |Expected type of column CHD_ACCOUNT_NUMBER to be StringType                                   |               |
|MinimumConstraint     |Success           |Value: 5.1210650000005 Minimum value should be greater than 10.000000                         |5.1210650000005|
|HistogramConstraint   |Failure           |Can't execute the assertion: key not found: 1242.0!Percentage should be greater than 10.000000|               |
+----------------------+------------------+----------------------------------------------------------------------------------------------+---------------+