R带填充的条件替换/修剪（正则表达式、gsub、gregexpr、正则匹配）_Regex_R_Replace_Trim

R带填充的条件替换/修剪（正则表达式、gsub、gregexpr、正则匹配）

regex r replace

R带填充的条件替换/修剪（正则表达式、gsub、gregexpr、正则匹配）,regex,r,replace,trim,Regex,R,Replace,Trim,我有一个关于条件替换的问题我基本上想找到每一个数字串，对于4之后的每一个连续数字，用空格替换它我需要的解决方案是矢量化和速度是必不可少的以下是一个有效（但效率低下）的解决方案： data这里有一个快速方法，只需一个gsub命令： gsub("(?<!\\d)(\\d{4})\\d*", "\\1", data$input, perl = TRUE) # [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 " # [2] "

我有一个关于条件替换的问题

我基本上想找到每一个数字串，对于4之后的每一个连续数字，用空格替换它

我需要的解决方案是矢量化和速度是必不可少的

以下是一个有效（但效率低下）的解决方案：

data这里有一个快速方法，只需一个gsub
命令：
gsub("(?<!\\d)(\\d{4})\\d*", "\\1", data$input, perl = TRUE)
# [1] "STRING WITH 2 FIX(ES): 1234    0987  1111   "        
# [2] " PADDED STRING WITH 3 FIX(ES): 1234    0987  1111   "
# [3] " STRING WITH 0 FIX(ES): 12        098     111   "    
# [4] NA                                                    
# [5] "1234"                                                
# [6] "   1234   6789    "  

这里有一种使用gregexpr
和regmatches

#find all numbers with more than 4 digits
m <- gregexpr("\\d{5,}", data$input)

#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x) {
        mapply(function(x, n) formatC(substr(x,1,4), width=-n), x, nchar(x))
})

#combine with original values
data$output2 <- unlist(Map(function(a,b) paste0(a,c(b,""), collapse=""), 
    regmatches(data$input, m, invert=T), zz))

#查找所有超过4位的数字
m您可以在一行（一个空格代表一个数字）中使用以下命令执行相同操作：
详情：
(?:        # non-capturing group: the two possible entry points
    \G     # either the position after the last match or the start of the string
    (?!\A) # exclude the start of the string position
  |        # OR
    \d{4}  # four digits
)          # close the non-capturing group
\K         # removes all on the left from the match result
\d         # a single digit

因此，您实际上是在缩短字符串，而不是用空格替换多余的数字。我认为这与OP的要求不同。斯文，谢谢你的帮助，但你的回答与我的回答不一样。我看到了一些差异：第1行1234和0987之间应该有6个空格，第4行没有尾随空格。我不想只删除4后面的字符，我想用空格替换它们，这样替换后的字符串长度应该相同。@Brad请参阅更新。我没有测试这个方法是否比你的快。谢谢斯文。你的电影很有意义。我喜欢这种方法。Flick，当我做动态截断时，我得到了一个错误：好吧，不要这样做。事实上，我不知道你说的是什么意思。您到底更改了什么？错误是什么？您使用的mapply
不正确。不能只添加cutoff=cutoff
，因为它只有一个长度，而mapply
希望所有参数都具有相同的长度。您必须使用MoreArgs=
参数添加它。例如MoreArgs=list（cutoff=cutoff）
。有关更多信息，请参见？mapply
。再次感谢！真的很感激@斯文霍恩斯坦：禁止\G
匹配字符串的开头。因为第一个匹配的入口点必须是第二个备选方案（即\d{4}
），如果我允许\G匹配字符串的开头，并且如果字符串开头有一个数字，则该数字将被删除。
matches <- gregexpr("(?<=\\d{4})\\d+", data$input, perl = TRUE)
mapply(function(m, d) {
  if (!is.na(m) && m != -1L) {
    for (i in seq_along(m)) {
      substr(d, m[i], m[i] + attr(m, "match.length") - 1L) <- paste(rep(" ", attr(m, "match.length")[i]), collapse = "")
    }
  }
  return(d)
}, matches, data$input)

# [1] "STRING WITH 2 FIX(ES): 1234      0987    1111   "          
# [2] " PADDED STRING WITH 3 FIX(ES): 1234      0987    1111     "
# [3] " STRING WITH 0 FIX(ES): 12        098     111   "          
# [4] NA                                                          
# [5] "1234      "                                                
# [6] "   1234    6789     "  

#find all numbers with more than 4 digits
m <- gregexpr("\\d{5,}", data$input)

#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x) {
        mapply(function(x, n) formatC(substr(x,1,4), width=-n), x, nchar(x))
})

#combine with original values
data$output2 <- unlist(Map(function(a,b) paste0(a,c(b,""), collapse=""), 
    regmatches(data$input, m, invert=T), zz))

gsub("(?:\\G(?!\\A)|\\d{4})\\K\\d", " ", data$input, perl = TRUE)

(?:        # non-capturing group: the two possible entry points
    \G     # either the position after the last match or the start of the string
    (?!\A) # exclude the start of the string position
  |        # OR
    \d{4}  # four digits
)          # close the non-capturing group
\K         # removes all on the left from the match result
\d         # a single digit