R 将文本检索到数据帧的两列时,正则表达式模式匹配出错
考虑以下假设数据:R 将文本检索到数据帧的两列时,正则表达式模式匹配出错,r,regex,perl,dataframe,R,Regex,Perl,Dataframe,考虑以下假设数据: x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row name
x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data
frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data
frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. :
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)
需要“y”的输出(因为在前三句话中找不到“:”,因此):
与上述“y”的结果一样,“z”的所需输出结果应为:
Col1 Col2
NA all of the text from 'z'
我想做的是:
resX <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[1]]),
Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[1]]))
resY <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[2]]),
Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[2]]))
resZ <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[3]]),
Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[3]]))
resX您可以尝试使用这个负前瞻正则表达式:
^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$
更新:
如果您的条件满足,那么正则表达式将返回true,您应该得到2部分
组1包含第一个之前的值:组2将包含第二个之后的值
如果不满足条件,则将整个字符串复制到第2列,并将所需内容作为第1列
一个更新的示例代码段包含一个名为process data的方法,它将为您提供这些技巧。如果满足条件,则它将分割数据并放入col1和col2。。。。如果输入中的y和z不符合条件。。。它会将NA放入col1,将整个值放入col2
运行示例源-->:
简明的
我的灵感来自于我的,所以你会看到他的答案完善了我的答案。我不喜欢的是,它在非句子开始时中断(例如行.names
——尽管提供的OP文本示例没有提供任何示例,其中行.names
在前两个句子中出现了3次,以展示这一点)。我还确保捕获组/列的编号与OP期望的完全一致,并且始终存在匹配。我的答案确实是Rizwan的一个改进
注1:我假设“句子”由句点/点定义,后跟至少一个水平空格
注2:这适用于PCRE正则表达式,未经其他正则表达式的测试,可能需要适应其他正则表达式才能正常工作(即if/else、垂直空格和水平空格标记)
代码
结果
输入
输出
匹配1
- 第一组:
有一部恐怖电影在iNox剧院上演。
- 第2组:
如果提供的行名长度为1,并且数据框只有一行,则使用row.names指定行名,而不是列(按名称或编号)。如果提供的行名称的长度为1,并且数据框只有一行,则使用row.names指定行名称,而不是列(按名称或编号):请
第二场比赛
- 第1组:空-不匹配
- 第二组:
在iNox剧院有一部恐怖电影。如果提供的行名称的长度为1,并且数据框只有一行,则采用row.names。指定行名称而不是列。按姓名或号码:如果提供的行名称的长度为1,并且数据框只有一行,则使用row.names指定行名称,而不是列(按名称或编号):请
第三场比赛
- 第1组:空-不匹配
- 第二组:
在iNox剧院有一部恐怖电影。如果提供的行名称的长度为1,并且数据框只有一行,则使用row.names指定行名称,而不是列(按名称或编号)。如果提供了长度为1的行名称:数据框只有一行,row.names用于指定行名,而不是列(按名称或编号):请
解释
^
在字符串开头断言位置
(?(?!(?:[^:\v]*?\。\h){3,})([^:\v]*?)\s*:\s*|)
(?(!…)x | y)
如果语句使用否定(?!…)
作为条件
(?:[^:\v]*?\。\h){3,}
至少匹配以下3次
[^:\v]*?
匹配集合中不存在的任何字符(不是冒号或垂直空白字符)任意次数,但尽可能少
\.\h
逐字匹配点字符,后跟水平空白字符(空格或制表符)
- If语句true:如果满足上述条件,请执行以下操作
([^:\v]*?)\s*:\s*
([^:\v]*?)
捕获到组1:集合中不存在的任何字符(不是冒号或垂直空白字符)任意次数,但尽可能少
\s*:\s*
匹配任意数量的空格字符,后跟一个冒号,后跟任意数量的空格(注意,如果“句子”可能包含:
,则可以将*
更改为+
,以改进匹配过程)
- 如果语句为false:未满足前面的条件,请执行以下操作:不匹配任何内容
(.*)
捕获到第2组:任意字符(当s
标志关闭时不包括换行符)任意次数
$
在字符串末尾断言位置
分成句子;grep表示第一次出现的位置:
,并使用条件拆分原始文本:
sp <- strsplit(x, '(?<=\\.)(?=\\s+\\S)', perl = TRUE)[[1L]]
sp <- if (grep(':', sp)[1L] < 3L)
sub(':\\s+', '$', x) else paste0('$', x)
sp <- gsub('\\v', '', sp, perl = TRUE)
str(read.table(text = sp, sep = '$', col.names = paste0('Col', 1:2), as.is = TRUE))
# 'data.frame': 1 obs. of 2 variables:
# $ Col1: chr "There is a horror movie running in the iNox theater. "
# $ Col2: chr "If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names an"| __truncated__
消极前瞻是昂贵的,而且很难阅读。这里有一个更简单的解决方案:
library(stringr)
# throw out everything after first :, and count the number of sentences
split = str_count(sub(':.*', '', df$Text), fixed('. ')) < 3
# assemble the required data (you could also avoid ifelse if really needed)
data.frame(col1 = ifelse(split, sub(':.*', '', df$Text), NA),
col2 = ifelse(split, sub('.*?:', '', df$Text), df$Text))
库(stringr)
#在第一句之后扔掉所有的东西,数一数句子的数量
split=str_计数(sub(':.*','',df$Text),固定('.')<3
#组装所需的数据(如果确实需要,还可以避免使用ifelse)
frame(col1=ifelse(拆分,子(':.*','',df$Text),NA),
col2=ifelse(拆分,子('.*?:','',df$Text),df$Text))
你的意思是这样的吗resZ@MadhuSareen我已经用一些示例源更新了答案,您可以查看一下。为什么它不适用于Y和Z???都是NA和NA。第一列应该是NA和t
library(stringr)
x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data
frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data
frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. :
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)
resDF <- data.frame("Col1" = character(), "Col2" = character(), stringsAsFactors=FALSE)
processData <- function(a) {
patt <- "^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$"
if(grepl(patt,a,perl=TRUE))
{
result<-str_match(a,patt)
col1<-result[2]
col2<-result[3]
}
else
{
col1<-"NA"
col2<-a
}
return(c(col1,col2))
}
for (i in 1:nrow(df)){
tmp <- df[i, ]
resDF[nrow(resDF) + 1, ] <- processData(tmp)
}
print(resDF)
Col1
1 There is a horror movie running in the iNox theater.
2 NA
3 NA
Col2
1 If row names are supplied of length one and the data \n frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n the row names and not a column (by name or number) Can we go : Please
2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data \n frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : \n If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n the row names and not a column (by name or number) Can we go : Please
3 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify \n the row names and not a column (by name or number) Can we go : Please
^(?(?!(?:[^:\v]*?\.\h){3,})([^:\v]*?)\s*:\s*|)(.*)$
There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
sp <- strsplit(x, '(?<=\\.)(?=\\s+\\S)', perl = TRUE)[[1L]]
sp <- if (grep(':', sp)[1L] < 3L)
sub(':\\s+', '$', x) else paste0('$', x)
sp <- gsub('\\v', '', sp, perl = TRUE)
str(read.table(text = sp, sep = '$', col.names = paste0('Col', 1:2), as.is = TRUE))
# 'data.frame': 1 obs. of 2 variables:
# $ Col1: chr "There is a horror movie running in the iNox theater. "
# $ Col2: chr "If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names an"| __truncated__
f <- function(text, end_of_sentence = '.', n = 3L, sep = '$') {
p <- sprintf('(?<=[%s])(?=\\s+\\S)', end_of_sentence)
sp <- strsplit(text, p, perl = TRUE)[[1L]]
sp <- if (grep(':', sp)[1L] <= n)
sub(':\\s+', sep, text) else paste0(sep, text)
sp <- trimws(gsub('\\v', '', sp, perl = TRUE))
read.table(text = sp, sep = sep, col.names = paste0('Col', 1:2),
stringsAsFactors = FALSE)
}
## test
f(x); f(y); f(z)
## vectorize it to work on more than one string
f <- Vectorize(f, SIMPLIFY = FALSE, USE.NAMES = FALSE)
do.call('rbind', f(df$Text))
# Col1
# 1 There is a horror movie running in the iNox theater.
# 2 <NA>
# 3 <NA>
# Col2
# 1 If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
# 2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
# 3 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
library(stringr)
# throw out everything after first :, and count the number of sentences
split = str_count(sub(':.*', '', df$Text), fixed('. ')) < 3
# assemble the required data (you could also avoid ifelse if really needed)
data.frame(col1 = ifelse(split, sub(':.*', '', df$Text), NA),
col2 = ifelse(split, sub('.*?:', '', df$Text), df$Text))