如何在Spark/Scala ML中使用RegexTokenizer（）删除标记？_Regex_Scala_Apache Spark

如何在Spark/Scala ML中使用RegexTokenizer（）删除标记？

regex scala apache-spark

如何在Spark/Scala ML中使用RegexTokenizer（）删除标记？,regex,scala,apache-spark,Regex,Scala,Apache Spark,我有一个功能列，其中包含HTML标记。我想删除所有标签。 “body”列中一行数据的示例如下： "<p>Are questions related to and similar products on-topic?</p>" "are questions related to and similar products on-topic?" 以下是我的开始： val regexTokenizer = new RegexTokenizer() .setInputCol

我有一个功能列，其中包含HTML标记。我想删除所有标签。 “body”列中一行数据的示例如下：

"<p>Are questions related to and similar products on-topic?</p>"

"are questions related to and similar products on-topic?"

以下是我的开始：

val regexTokenizer = new RegexTokenizer()
  .setInputCol("body")
  .setOutputCol("removedTags")
  .setPattern("")

我想我需要修复.setPattern（），但不确定如何修复。

假设字符串中可能没有任何其他

，可能

<[^>]+>

]+>

替换为空字符串可能在某种程度上工作正常

如果您希望简化/修改/探索表达式，将在的右上面板中进行解释。如果您愿意，还可以在中查看它与一些示例输入的匹配情况