R:使用GSUB删除包含3个或更多重复字母的单词

R:使用GSUB删除包含3个或更多重复字母的单词,r,gsub,words,repeat,R,Gsub,Words,Repeat,我需要使用gsub从字符串中删除包含3个或更多重复字母的单词。例如: “前几天雨下得很大” 我需要使用gsub函数获得以下内容: rm.repeatLetters <- function(x){ xvec <- unlist(strsplit(x, " ")) rmword <- grepl("(\\w)\\1{2, }", xvec) return(paste(xvec[!rmword], collapse = " ")) } “最近几天一直在下大雨。”“ver

我需要使用gsub从字符串中删除包含3个或更多重复字母的单词。例如:

“前几天雨下得很大”

我需要使用gsub函数获得以下内容:

rm.repeatLetters <- function(x){
  xvec <- unlist(strsplit(x, " "))
  rmword <- grepl("(\\w)\\1{2, }", xvec)
  return(paste(xvec[!rmword], collapse = " "))
}

“最近几天一直在下大雨。”“verrry”和“heeere”字将从字符串中删除。

这看起来像是您瞄准的输出字符串:

origStr = "It has been raining verrrry badly heeere last few days"

newStr <- gsub("e{3,}","e", origStr ) # replaces e's greater than 2 repeat
(newStr <- gsub("r{3,}","r", newStr )) # replaces r's greater than 2 repeat

# [1] "It has been raining very badly here last few days"
origStr=“前几天雨下得很大”
newStr这里有一个方法:

library(tm)
data("acq")
acq[[12]]$content -> sometext
tm::MC_tokenizer(x = sometext) -> q
q[131] <- "eeee"

sapply(letters, FUN = function(x) {
    grepl(paste0(x, "{3,}"), x = q, ignore.case = TRUE) -> k
    k
}) -> zz

apply(X = zz, 1, sum) -> flag
q[ifelse(flag == 1, FALSE, TRUE)] -> newq
paste(newq, collapse = " ") -> final
library(tm)
数据(“acq”)
acq[[12]]$content->sometext
tm::MC_标记器(x=sometext)->q
q[131]k
K
})->zz
应用(X=zz,1,总和)->标志
q[ifelse(flag==1,FALSE,TRUE)]->newq
粘贴(newq,collapse=“”)->final

首先为您的案例构造正则表达式,这是一个可能的解决方案

regExp <- paste(sapply(letters, paste, "{3,}", sep = ""), collapse = "|")
> regExp

"a{3,}|b{3,}|c{3,}|d{3,}|e{3,}|f{3,}|g{3,}|h{3,}|i{3,}|j{3,}|k{3,}|l{3,}|m{3,}|n{3,}|o{3,}|p{3,}|q{3,}|r{3,}|s{3,}|t{3,}|u{3,}|v{3,}|w{3,}|x{3,}|y{3,}|z{3,}"

words <- unlist(strsplit(origStr, "\\s+"))
cleanStr <- paste(words[!grepl(regExp, words)], collapse = " ")
cleanStr
[1] "It has been raining badly last few days"
regExp regExp
{3,00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0{3,}s{3,}t{3,}u{3,}v{3,}w{3,}x{3,}y{3,}z{3,}”
单词选项1

x <- "It has been raining verrrry badly heeere last few days"
m <- gregexpr('\\s\\b\\w*(\\w)\\1{2,}\\w*\\b\\s', x, perl = TRUE)
regmatches(x, m) <- ' '
x
# [1] "It has been raining badly last few days"
x出现了一个快速搜索。使用它,您可以执行以下操作:

## string with repeated letters
s <- "It has been raining verrrry badly heeere last few days"

## split string into vector of words to select
svec <- unlist(strsplit(s, " "))

## find words with 3 or more repeated letters/numbers
## (for any general symbol use '.' instead of '\\w')
rmword <- grep("(\\w)\\1{2, }", svec)

## join words into single string again, removing the unwanted ones
paste(svec[-rmword], collapse = " ")

## output:
[1] "It has been raining badly last few days"
然后在数据帧上使用它:

df <- data.frame(id=c(1, 2, 3), text=c(s, s, s), stringsAsFactors=FALSE)
## > df
##   id                                                   text
## 1  1 It has been raining verrrry badly heeere last few days
## 2  2 It has been raining verrrry badly heeere last few days
## 3  3 It has been raining verrrry badly heeere last few days


df$text <- sapply(df$text, rm.repeatLetters)
## > df
##   id                                    text
## 1  1 It has been raining badly last few days
## 2  2 It has been raining badly last few days
## 3  3 It has been raining badly last few days
df
##id文本
##最近几天雨下得很大
##最近几天雨下得很大
##最近几天雨下得很大
df$文本df
##id文本
##最近几天一直在下大雨
##最近几天一直在下大雨
##最近几天一直在下大雨

我不认为这是一个真正的R问题,我建议您在google中查找“backreferences regex”,您应该能够找到它。如果你仍然坚持在这里张贴你的尝试,我相信人们会帮助你。regex/gsub新手。需要在Scriptr中嵌入此功能从中删除单词所需的字符串有多长?我们谈论的是整本书还是仅仅几页?这看起来像你想要的:字符串不是很大…可以说是一页文本谢谢鲍勃。如果字符串中有重复两次以上的字母,我希望将这些单词从字符串中删除。在我的例子中,“Verrry”和“heeere”被删除,因为它们的字母重复了两次以上。此外,我正在寻找一个通用的功能,可以应用于任何单词包含任何重复的字母…对不起,我刚才不清楚。非常感谢你的帮助谢谢克里斯。你的解决方案有效。想知道如何将您的解决方案扩展到数据框(有一个包含两列的数据框…ID和文本)。文本字段正在进行处理。我建议您使用abvoe显示的解决方案中的一个…它们比我给出的更有效。哇,在我写这篇文章的时候,出现了很多类似的答案…谢谢Gabe…谢谢分享解决方案。您能建议如何在具有两列ID和文本的数据帧上实现它吗。处理需要在文本列上进行。数据框的每一个文本行是一个单词还是一个完整的句子(或段落)?每一行是一个段落(在文本列中),我刚刚意识到您可能不会收到我编辑答案的通知。它已更新为在数据帧上使用它。
df <- data.frame(id=c(1, 2, 3), text=c(s, s, s), stringsAsFactors=FALSE)
## > df
##   id                                                   text
## 1  1 It has been raining verrrry badly heeere last few days
## 2  2 It has been raining verrrry badly heeere last few days
## 3  3 It has been raining verrrry badly heeere last few days


df$text <- sapply(df$text, rm.repeatLetters)
## > df
##   id                                    text
## 1  1 It has been raining badly last few days
## 2  2 It has been raining badly last few days
## 3  3 It has been raining badly last few days