Regex 如何根据Spark中的列值匹配多个正则表达式模式?

Regex 如何根据Spark中的列值匹配多个正则表达式模式?,regex,scala,apache-spark,dataframe,pattern-matching,Regex,Scala,Apache Spark,Dataframe,Pattern Matching,我有一个专栏: val originalSqlLikePatternMap = Map("item (%) is blacklisted%" -> "BLACK_LIST", "%Testing%" -> "TESTING", "%purchase count % is too low %" -> "TOO_LOW_PURCHASE_COUNT") val javaPatternMap = originalSqlLikePatternMap.map(v =&g

我有一个专栏:

val originalSqlLikePatternMap = Map("item (%) is blacklisted%" -> "BLACK_LIST",
      "%Testing%" -> "TESTING",
  "%purchase count % is too low %" -> "TOO_LOW_PURCHASE_COUNT")

val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*") -> v._2)

val df = Seq(
  "Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low", 
  "Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
   "item (1) is blacklisted, #!@" 
).toDF("raw_type")

val converter = (value: String) => javaPatternMap.find(v => value.matches(v._1)).map(_._2).getOrElse("Unknown")
val converterUDF = udf(converter)

val result = df.withColumn("updatedType", converterUDF($"raw_type"))
但它给出了:

+---------------------------------------------------------+----------------------+
|raw_type                                                 |updatedType           |
+---------------------------------------------------------+----------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING               |
|Foo purchase count (12, 4) is too low                    |TOO_LOW_PURCHASE_COUNT|
|#!@                                                      |Unknown               |
|item (mejwnw) is blacklisted                             |BLACK_LIST            |
|item (1) is blacklisted, #!@                             |BLACK_LIST            |
+---------------------------------------------------------+----------------------+
但是我想要“测试(2,4,(4,6,7)foo,foo购买计数1太低”来给出2个值“测试,购买计数太低”,如下所示:

 +---------------------------------------------------------+--------------------------------+
|raw_type                                                 |updatedType                     |
+---------------------------------------------------------+--------------------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT |
|Foo purchase count (12, 4) is too low                    |TOO_LOW_PURCHASE_COUNT          |
|#!@                                                      |Unknown                         |
|item (mejwnw) is blacklisted                             |BLACK_LIST                      |
|item (1) is blacklisted, #!@                             |BLACK_LIST, Unkown              |
+---------------------------------------------------------+--------------------------------+

有人能告诉我我做错了什么吗?

好的。这里有几件事

  • 关于
    find
    ,您需要对照每个正则表达式检查每个
    ,以获得所需的输出,因此find不是正确的选择

    迭代器产生的满足谓词的第一个值,如果 任何

  • 注意正则表达式,low之后留下了一个空格,这就是它不匹配的原因。请您重新考虑是否也将
    %
    替换为
    *

    %purchase count%过低%

  • 因此,随着更改,您的代码将类似于

     val originalSqlLikePatternMap = Map(
          "item (%) is blacklisted%" -> "BLACK_LIST",
          "%Testing%" -> "TESTING",
          "%purchase count % is too low%" -> "TOO_LOW_PURCHASE_COUNT")
    
        val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*").r -> v._2)
    
        val df = Seq(
          "Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
          "Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
          "item (1) is blacklisted, #!@"
        ).toDF("raw_type")
    
        val converter = (value: String) => {
          val res = javaPatternMap.map(v => {
            v._1.findFirstIn(value) match {
              case Some(_) => v._2
              case None => ""
            }
          })
            .filter(_.nonEmpty).mkString(", ")
    
          if (res.isEmpty) "Unknown" else res
        }
    
        val converterUDF = udf(converter)
    
        val result = df.withColumn("updatedType", converterUDF($"raw_type"))
    
        result.show(false)
    
    产出

    +---------------------------------------------------------+-------------------------------+
    |raw_type                                                 |updatedType                    |
    +---------------------------------------------------------+-------------------------------+
    |Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT|
    |Foo purchase count (12, 4) is too low                    |TOO_LOW_PURCHASE_COUNT         |
    |#!@                                                      |Unknown                        |
    |item (mejwnw) is blacklisted                             |BLACK_LIST                     |
    |item (1) is blacklisted, #!@                             |BLACK_LIST                     |
    +---------------------------------------------------------+-------------------------------+
    

    希望这有帮助!

    好的。这里有几件事

  • 关于
    find
    ,您需要对照每个正则表达式检查每个
    ,以获得所需的输出,因此find不是正确的选择

    迭代器产生的满足谓词的第一个值,如果 任何

  • 注意正则表达式,low之后留下了一个空格,这就是它不匹配的原因。请您重新考虑是否也将
    %
    替换为
    *

    %purchase count%过低%

  • 因此,随着更改,您的代码将类似于

     val originalSqlLikePatternMap = Map(
          "item (%) is blacklisted%" -> "BLACK_LIST",
          "%Testing%" -> "TESTING",
          "%purchase count % is too low%" -> "TOO_LOW_PURCHASE_COUNT")
    
        val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*").r -> v._2)
    
        val df = Seq(
          "Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
          "Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
          "item (1) is blacklisted, #!@"
        ).toDF("raw_type")
    
        val converter = (value: String) => {
          val res = javaPatternMap.map(v => {
            v._1.findFirstIn(value) match {
              case Some(_) => v._2
              case None => ""
            }
          })
            .filter(_.nonEmpty).mkString(", ")
    
          if (res.isEmpty) "Unknown" else res
        }
    
        val converterUDF = udf(converter)
    
        val result = df.withColumn("updatedType", converterUDF($"raw_type"))
    
        result.show(false)
    
    产出

    +---------------------------------------------------------+-------------------------------+
    |raw_type                                                 |updatedType                    |
    +---------------------------------------------------------+-------------------------------+
    |Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT|
    |Foo purchase count (12, 4) is too low                    |TOO_LOW_PURCHASE_COUNT         |
    |#!@                                                      |Unknown                        |
    |item (mejwnw) is blacklisted                             |BLACK_LIST                     |
    |item (1) is blacklisted, #!@                             |BLACK_LIST                     |
    +---------------------------------------------------------+-------------------------------+
    

    希望这能有所帮助!

    目前,udf的编写方式是,如果没有匹配项,则会给出未知值。如果您要求有多个正则表达式匹配项,则会重新返回它们';就像最上面一行一样。Unknown是默认值。要更改为,case None=>“Unknown”因此,如果有多个值,它将匹配,如果有任何未知值后跟其他匹配,它将不匹配?是的,因为JavaPatternMap中没有未知的特定规则当前udf的编写方式是,如果没有匹配项,则给出未知值。如果有多个正则表达式匹配项,它们将重新返回,如for因此,顶行.unknown是默认值。要更改为,case None=>“unknown”,因此如果有多个值将匹配,并且如果有任何未知值后跟其他匹配项,则不会匹配。是的,因为javaPatternMap中没有未知的特定规则