R 删除文本中数字与字符比率大于平均值的所有句子
是否有可能找到并删除所有数字与字符比率较高的句子? 我创建了以下函数来计算给定字符串中的比率:R 删除文本中数字与字符比率大于平均值的所有句子,r,regex,text,substring,text-mining,R,Regex,Text,Substring,Text Mining,是否有可能找到并删除所有数字与字符比率较高的句子? 我创建了以下函数来计算给定字符串中的比率: a <- "1aaaaaa2bbbbbbb3" Num_Char_Ration <- function(string){ length(unlist(regmatches(string,gregexpr("[[:digit:]]",string))))/nchar(as.character(string)) } Num_Char_Ration(a) #0.1875 a您需要将长字符串拆
a <- "1aaaaaa2bbbbbbb3"
Num_Char_Ration <- function(string){
length(unlist(regmatches(string,gregexpr("[[:digit:]]",string))))/nchar(as.character(string))
}
Num_Char_Ration(a)
#0.1875
a您需要将长字符串拆分为单个单词!(strsplit()
用于eg)
数据:
我将使用stringr
package来计算数字和字符:
# Original data
input <- " aa111111. bbbbbb22. cccccc3."
# Split by .
inputSplit <- strsplit(input, "\\.")[[1]]
# Count digits and all alnum in splitted string
counts <- sapply(inputSplit, stringr::str_count, c("[[:digit:]]", "[[:alnum:]]"))
# Get ratios and collapse text back
paste(inputSplit[counts[1, ] / counts[2, ] < 0.5], collapse = ".")
# [1] " bbbbbb22. cccccc3"
#简化的num-to-char比率函数
Num_Char_Ration下面是我在base R中的实现方法。改编自Andre的代码
my_string <- " aa111111. bbbbbb22. cccccc3."
#Split paragraph into sentences based on '.'
my_string <- unlist(strsplit(my_string, '(?<=\\.)\\s+', perl=TRUE))
#Removing sentences with more numbers than letters
my_string <- subset(my_string,nchar(gsub("\\D","",my_string)) <= nchar(gsub("[^A-z]","",my_string,perl=T)))
my_string
##[1] "bbbbbb22." "cccccc3."
以下是一个简单的基本解决方案:
x <- strsplit(input,"\\.")[[1]]
x <- x[nchar(x) < 2 * nchar(gsub("\\d","",x))]
paste(x,collapse=".")
# [1] " bbbbbb22. cccccc3"
x的比率越高,你是指最大值吗?例如,是的。或者所有句子的比率都高于所有句子的平均比率。
#[1] "bbbbbb22." "cccccc3."
# Original data
input <- " aa111111. bbbbbb22. cccccc3."
# Split by .
inputSplit <- strsplit(input, "\\.")[[1]]
# Count digits and all alnum in splitted string
counts <- sapply(inputSplit, stringr::str_count, c("[[:digit:]]", "[[:alnum:]]"))
# Get ratios and collapse text back
paste(inputSplit[counts[1, ] / counts[2, ] < 0.5], collapse = ".")
# [1] " bbbbbb22. cccccc3"
# To get ratio between digits and string
# Divide first row by second row
aa111111 bbbbbb22 cccccc3
[1,] 6 2 1
[2,] 8 8 7
# Simplified num to char ratio function
Num_Char_Ration <- function(string) {
lengths(regmatches(x, gregexpr("[0-9]", x))) / nchar(x)
}
clear_nmbstring <- function(x) {
x <- strsplit(x, ".", fixed = TRUE)[[1]]
cleanx <- trimws(x)
x <- x[Num_Char_Ration(cleanx) < 0.5]
paste(x, collapse = ".")
}
# Example:
string <- c(" aa111111. bbbbbb22. cccccc3.")
clear_nmbstring(string)
[1] " bbbbbb22. cccccc3"
my_string <- " aa111111. bbbbbb22. cccccc3."
#Split paragraph into sentences based on '.'
my_string <- unlist(strsplit(my_string, '(?<=\\.)\\s+', perl=TRUE))
#Removing sentences with more numbers than letters
my_string <- subset(my_string,nchar(gsub("\\D","",my_string)) <= nchar(gsub("[^A-z]","",my_string,perl=T)))
my_string
##[1] "bbbbbb22." "cccccc3."
paste(my_string,collapse=" ")
##[1] "bbbbbb22. cccccc3."
x <- strsplit(input,"\\.")[[1]]
x <- x[nchar(x) < 2 * nchar(gsub("\\d","",x))]
paste(x,collapse=".")
# [1] " bbbbbb22. cccccc3"