Regex 如何根据Spark中的列值匹配多个正则表达式模式?
我有一个专栏:Regex 如何根据Spark中的列值匹配多个正则表达式模式?,regex,scala,apache-spark,dataframe,pattern-matching,Regex,Scala,Apache Spark,Dataframe,Pattern Matching,我有一个专栏: val originalSqlLikePatternMap = Map("item (%) is blacklisted%" -> "BLACK_LIST", "%Testing%" -> "TESTING", "%purchase count % is too low %" -> "TOO_LOW_PURCHASE_COUNT") val javaPatternMap = originalSqlLikePatternMap.map(v =&g
val originalSqlLikePatternMap = Map("item (%) is blacklisted%" -> "BLACK_LIST",
"%Testing%" -> "TESTING",
"%purchase count % is too low %" -> "TOO_LOW_PURCHASE_COUNT")
val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*") -> v._2)
val df = Seq(
"Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
"Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
"item (1) is blacklisted, #!@"
).toDF("raw_type")
val converter = (value: String) => javaPatternMap.find(v => value.matches(v._1)).map(_._2).getOrElse("Unknown")
val converterUDF = udf(converter)
val result = df.withColumn("updatedType", converterUDF($"raw_type"))
但它给出了:
+---------------------------------------------------------+----------------------+
|raw_type |updatedType |
+---------------------------------------------------------+----------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING |
|Foo purchase count (12, 4) is too low |TOO_LOW_PURCHASE_COUNT|
|#!@ |Unknown |
|item (mejwnw) is blacklisted |BLACK_LIST |
|item (1) is blacklisted, #!@ |BLACK_LIST |
+---------------------------------------------------------+----------------------+
但是我想要“测试(2,4,(4,6,7)foo,foo购买计数1太低”来给出2个值“测试,购买计数太低”,如下所示:
+---------------------------------------------------------+--------------------------------+
|raw_type |updatedType |
+---------------------------------------------------------+--------------------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT |
|Foo purchase count (12, 4) is too low |TOO_LOW_PURCHASE_COUNT |
|#!@ |Unknown |
|item (mejwnw) is blacklisted |BLACK_LIST |
|item (1) is blacklisted, #!@ |BLACK_LIST, Unkown |
+---------------------------------------------------------+--------------------------------+
有人能告诉我我做错了什么吗?好的。这里有几件事
find
,您需要对照每个正则表达式检查每个行
,以获得所需的输出,因此find不是正确的选择
迭代器产生的满足谓词的第一个值,如果
任何
%
替换为*
%purchase count%过低%
val originalSqlLikePatternMap = Map(
"item (%) is blacklisted%" -> "BLACK_LIST",
"%Testing%" -> "TESTING",
"%purchase count % is too low%" -> "TOO_LOW_PURCHASE_COUNT")
val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*").r -> v._2)
val df = Seq(
"Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
"Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
"item (1) is blacklisted, #!@"
).toDF("raw_type")
val converter = (value: String) => {
val res = javaPatternMap.map(v => {
v._1.findFirstIn(value) match {
case Some(_) => v._2
case None => ""
}
})
.filter(_.nonEmpty).mkString(", ")
if (res.isEmpty) "Unknown" else res
}
val converterUDF = udf(converter)
val result = df.withColumn("updatedType", converterUDF($"raw_type"))
result.show(false)
产出
+---------------------------------------------------------+-------------------------------+
|raw_type |updatedType |
+---------------------------------------------------------+-------------------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT|
|Foo purchase count (12, 4) is too low |TOO_LOW_PURCHASE_COUNT |
|#!@ |Unknown |
|item (mejwnw) is blacklisted |BLACK_LIST |
|item (1) is blacklisted, #!@ |BLACK_LIST |
+---------------------------------------------------------+-------------------------------+
希望这有帮助!好的。这里有几件事
find
,您需要对照每个正则表达式检查每个行
,以获得所需的输出,因此find不是正确的选择
迭代器产生的满足谓词的第一个值,如果
任何
%
替换为*
%purchase count%过低%
val originalSqlLikePatternMap = Map(
"item (%) is blacklisted%" -> "BLACK_LIST",
"%Testing%" -> "TESTING",
"%purchase count % is too low%" -> "TOO_LOW_PURCHASE_COUNT")
val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*").r -> v._2)
val df = Seq(
"Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
"Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
"item (1) is blacklisted, #!@"
).toDF("raw_type")
val converter = (value: String) => {
val res = javaPatternMap.map(v => {
v._1.findFirstIn(value) match {
case Some(_) => v._2
case None => ""
}
})
.filter(_.nonEmpty).mkString(", ")
if (res.isEmpty) "Unknown" else res
}
val converterUDF = udf(converter)
val result = df.withColumn("updatedType", converterUDF($"raw_type"))
result.show(false)
产出
+---------------------------------------------------------+-------------------------------+
|raw_type |updatedType |
+---------------------------------------------------------+-------------------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT|
|Foo purchase count (12, 4) is too low |TOO_LOW_PURCHASE_COUNT |
|#!@ |Unknown |
|item (mejwnw) is blacklisted |BLACK_LIST |
|item (1) is blacklisted, #!@ |BLACK_LIST |
+---------------------------------------------------------+-------------------------------+
希望这能有所帮助!目前,udf的编写方式是,如果没有匹配项,则会给出未知值。如果您要求有多个正则表达式匹配项,则会重新返回它们';就像最上面一行一样。Unknown是默认值。要更改为,case None=>“Unknown”因此,如果有多个值,它将匹配,如果有任何未知值后跟其他匹配,它将不匹配?是的,因为JavaPatternMap中没有未知的特定规则当前udf的编写方式是,如果没有匹配项,则给出未知值。如果有多个正则表达式匹配项,它们将重新返回,如for因此,顶行.unknown是默认值。要更改为,case None=>“unknown”,因此如果有多个值将匹配,并且如果有任何未知值后跟其他匹配项,则不会匹配。是的,因为javaPatternMap中没有未知的特定规则