Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex 在Scala-Spark中删除文本中的标点符号_Regex_Scala_Apache Spark_Punctuation - Fatal编程技术网

Regex 在Scala-Spark中删除文本中的标点符号

Regex 在Scala-Spark中删除文本中的标点符号,regex,scala,apache-spark,punctuation,Regex,Scala,Apache Spark,Punctuation,这是我的一个数据示例: case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time) xm "life support" picture . flip part bit flimsy g

这是我的一个数据示例:

case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time) 
xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ($25).
我希望删除除点(.)之外的所有标点符号,并删除长度为
的单词,例如,我的预期输出为:

case time especially its purse read manual care follow care instructions . make stays waterproof example inspect rubber seals doors especially batterymemory card door open time
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dock chance back base xm3020 . traveling bag connect laptop extra speaker . amount paid $25 .
这应该在Scala中实现, 我试过:

replaceAll( """\\W\s""", "")
replaceAll(""""[^a-zA-Z\.]""", "")
但是效果不好,有人能帮我吗?

这个怎么样:

replaceAll("(\\(|\\)|'|/", "")

然后,您只需使用|添加更多要删除的标点符号,并确保使用双反斜杠转义(and)等字符即可。

您可以尝试按如下方式过滤字符串:

val example = "Hey there! It's me, myself and I."
example.filterNot(x => x == ',' || x == '!' || x == 'm')
 res3: String = Hey there It's e yself and I.

试试这个,它会起作用的:

val str = """
  |case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time) 
  |xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ($25).
  """.stripMargin('|')

println(str)
val pat = """[^\w\s\.\$]"""
val pat2 = """\s\w{2}\s"""
println(str.replaceAll(pat, "").replaceAll(pat2, ""))
输出:

case time especially its purse read manual care follow care instructions make stays waterproof  example inspect rubber seals doors especially batterymemory card door open time 
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dockchance back base xm3020 . traveling bag connect laptop extra speaker . amount paid $25.

查看regex javadoc(),我们看到标点符号的字符类是
\p{Punct}
,我们可以使用
[a-z&&[^def]]
从字符类中删除字符。从那时起,很容易定义一个正则表达式,该正则表达式将删除除点以外的所有标点:

s.replaceAll("""[\p{Punct}&&[^.]]""", "")

删除大小为
$25
的单词有一个特殊字符,您尚未删除该字符。如果使用第二种模式,您可能希望用“”替换以保留空白。按照当前模式,“狗先生猫”变成“狗猫”。
s.replaceAll("""\b\p{IsLetter}{1,2}\b""")
s.replaceAll("""([\p{Punct}&&[^.]]|\b\p{IsLetter}{1,2}\b)\s*""", "")
s.replaceAll("""([?.!:]|\b\p{IsLetter}{1,2}\b)\s*""", "")
s.replaceAll("""([\p{Punct}&&[^.$]]|\b\p{IsLetter}{1,2}\b)\s*""", "")