R/regex与stringi/ICU：a'+'；是否被视为非-[：punct:]字符？_Regex_R_String_Icu_Stringi

R/regex与stringi/ICU：a'+'；是否被视为非-[：punct:]字符？

regex r string

R/regex与stringi/ICU：a'+'；是否被视为非-[：punct:]字符？,regex,r,string,icu,stringi,Regex,R,String,Icu,Stringi,我试图从字符串向量中删除非字母字符。我原以为[：punct://code>分组可以覆盖它，但它似乎忽略了+。这是否属于另一组角色 library(stringi) string1 <- c( "this is a test" ,"this, is also a test" ,"this is the final. test" ,"this is the final + test!" ) string1 <- stri_replace_all_regex(string1, '[:pu

我试图从字符串向量中删除非字母字符。我原以为

[：punct://code>分组可以覆盖它，但它似乎忽略了+
。这是否属于另一组角色
library(stringi)
string1 <- c(
"this is a test"
,"this, is also a test"
,"this is the final. test"
,"this is the final + test!"
)

string1 <- stri_replace_all_regex(string1, '[:punct:]', ' ')
string1 <- stri_replace_all_regex(string1, '\\+', ' ')

库（stringi）
string1POSIX字符类需要包装在字符类中，正确的形式应该是[[：punt:][]
。不要将POSIX术语“字符类”与通常所称的正则表达式字符类混淆
ASCII范围内的这个POSIX命名类匹配所有非控件、非字母数字、非空格字符
ascii <- rawToChar(as.raw(0:127), multiple=T)
paste(ascii[grepl('[[:punct:]]', ascii)], collapse="")
# [1] "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"


或者从基本R返回到gsub
，它处理得非常好
gsub('[[:punct:]]', ' ', string1)
# [1] "this is a test"            "this  is also a test"     
# [3] "this is the final  test"   "this is the final   test "


在类似POSIX的正则表达式引擎中，punct代表
与ispunt（）分类对应的字符类
函数（在类UNIX系统上，请查看man 3 ispunt）。
根据ISO/IEC 9899:1990（ISO C90），功能测试
用于任何打印字符，空格或
isalnum（）为真。但是，在POSIX设置中，详细信息是什么
字符属于哪个类取决于当前区域设置。
所以这里的punct类不会导致可移植代码，
见
更多细节
另一方面，stringi所依赖的ICU图书馆，
完全符合Unicode标准，
以自己的方式定义一些CharClass，但定义良好
而且总是可以随身携带
特别是根据Unicode标准，
加号
（U+002B
）是符号，数学
（Sm
）类别（并且不是Puctuation标记（P
））
库（“stringi”）
ascii它不应该，至少根据@davide的说法，实际上，您的第二个链接在[：putt:][/code>字符下列出“+”，并且grepl（“[：putt:][]”，“+”）
返回TRUE
。因此，在基本R正则表达式中，至少，“+”被视为标点字符。R正则表达式需要一组额外的“[]”才能使字符类的参数成功。请参见？regex@bondedust，OP的示例中似乎没有这样做。另一个被认为是“有用”的包装器进行的非标准评估最终让我们感到困惑。另一个例子说明了为什么我从来没有认为使用stringi或stringr是合适的。普通的R正则表达式已经非常干净和“常规”。包装它只会增加出错的能力。@bondedust，实际上string的主要优点是速度。它不是包装器，而是完全重写的。与stringr不同，就我所知，stringr基本上是一个包装器。我给人的印象不正确。似乎它还提供了模式和替换参数的矢量化，但如果没有更好的文档，它对我没有多大用处。我使用它是出于速度原因，这是针对一个有50MM行的文件。工作正常时，stringi比stringr快约100倍。
gsub('[[:punct:]]', ' ', string1)
# [1] "this is a test"            "this  is also a test"     
# [3] "this is the final  test"   "this is the final   test "

library("stringi")
ascii <- stri_enc_fromutf32(1:127)
stri_extract_all_regex(ascii, "[[:punct:]]")[[1]]
##  [1] "!"  "\"" "#"  "%"  "&"  "'"  "("  ")"  "*"  ","  "-"  "."  "/"  ":"  ";"  "?"  "@"  "["  "\\" "]"  "_"  "{"  "}" 
stri_extract_all_regex(ascii, "[[:symbol:]]")[[1]]
## [1] "$" "+" "<" "=" ">" "^" "`" "|" "~"